In an effort to publicly write more blogs, read more research papers, and formulate better opinions on safety evaluations, I’m starting this series of notes where I read through a research paper — particularly on multi-agent evals — and provide some of my thoughts. In this post, I focus on MultiAgentBench: Evaluating the Collaboration and Competition of LLM Agents by Zhu et. al.
What is the MultiAgentBench paper about?
In a nutshell…
Researchers evaluate task completion, task performance, and coordination of multi-agent setups in various task-based and social-simulation-based scenarios. They initialize and execute the workflows of multi-agent setups using a framework they created (called MARBLE), then use their MultiAgentBench benchmark to evaluate the systems. From these experiments, they found that 4o mini-based setups generally perform better than others across various scenarios. The researchers also mention that while coordination contributes to task performance, model capabilities are what primarily drive task performance.
So, why is this useful?
The researchers state that having a benchmark that measures coordination on top of task performance across multiple domains addresses the lack of evaluation approaches that cover the dynamics of multi-agent systems. If people continue to create more complex multi-agent systems for more complex use cases, then it would be useful to have a benchmark that captures multi-agent interactions that potentially lead to higher performance (or more safety concerns!).
What exactly is MARBLE?
To provide more context, Multi-Agent Coordination Backbone with LLM Engine (MARBLE) is a framework that allows agents to coordinate and accomplish tasks within environments. There are multiple modules within it, but all of which are centered on the Coordination Engine — the module that initializes and controls the other modules such as Agent Graph, Cognitive Module, and itself. The Coordination Engine initializes the agents, the relationships between these agents, and the tasks that need to be accomplished based on supported coordination protocols. The protocols are the following:
Centralized Protocols
Star: One central planner that assigns tasks to all actors
Tree: A hierarchy containing a top-level planner, sub-planners, and actors — like a traditional corporation.
Decentralized Protocols
Graph: A network of interconnected actors that communicate to each other
Chain: Actors arranged sequentially
What exactly is MultiAgentBench?
MultiAgentBench, what is used to evaluate MARBLE, is essentially a multi-agent benchmark that measures task completion, task performance, and coordination via new metrics the researchers have designed. For task completion, the researchers introduced milestone-based key performance indicators. Basically, an LLM-based detector checks if there’s a milestone that’s been achieved, and assigns contributions to agents. For task performance, it can be an LLM judge’s evaluation or a rules-based metric depending on a scenario being task-based or social-simulation-based.
Task-based Scenarios (Mutual Goals)
Research Tasks: Agents co-author a research proposal based on a topic
Minecraft: Agents tasked to collaboratively build structures in a shared environment
Database Error Analysis: Agents diagnose and fix database errors
Coding Challenges: Agents solving coding problems and develop software
Social-simulation-based Scenarios (Conflicting Goals)
Werewolf: Two groups of agents face off and deceive each other in a predefined narrative.
Bargaining: Agents try to negotiate over resources
For coordination, it’s simply the average of a communication score and a planning score based on aggregated data.
A few more details on how task completion, task performance, and coordination metrics
Task Completion Metrics
So, scenarios can be task-oriented or social simulation based, but these scenarios all have goals. To achieve a goal, there are milestones associated with that goal. The researchers set up an LLM-based detector that checks the system of agents per iteration if a milestone has been achieved. If the milestone has been achieved, contributions are recorded. The individual KPI of an agent is then the ratio of the milestones it has contributed to (n_j) and the total number of milestones (M). The overall KPI is the average of these ratios across all agents (N).
Task Performance Metrics
Aside from the milestones, the final outputs per scenario are checked. Depending on the scenario, the evaluation metric is either based on an LLM judge’s score or a rules-based score. For example, the research scenario uses an LLM judge to score the research proposal that has been created by the agents using a five-point scale. The LLM judge, in this scenario, looks at two core aspects, innovation and feasibility, and provides an overall score to the multi-agent system. To ensure that the LLM judges are reliable, alignment with human evaluators are checked (perhaps, using inter-rater reliability metrics?).
Coordination Metrics
Coordination/collaboration is the average of the communication and planning scores — which are evaluated by an LLM judge on a five-point scale. Communication has its own evaluation prompt which shows that it is judged based on effective decision-making, clarity, adherence to social relationships, alignment with agent profiles, and overall effectiveness. Planning, on the other hand, is judged based on clarity of task assignment, definition of roles, workload distribution, effectiveness of outcomes, and overall strategic coordination.
Thoughts
On the metrics
As someone who is particularly concerned about potential collusions and miscoordinations leading to safety issues, I’m not so interested in the finding that 4o mini-based multi-agent setups perform better than others in scenarios. What I really wanted to scrutinize is how the metrics were formulated and put together to form strong conclusions on coordination (and perhaps, how it relates to performance and safety?). I’m not surprised that I did not get what I wanted to see in this paper since I find that the metrics were a bit fragmented. To be completely honest, I’m not sure how the new milestone-based KPIs were tied back to the task scores; It was also not clearly stated how the task scores relate to the coordination metric. What I got was the finding that the coordination metric isn’t the primary driver of task performance — but even for this, I’m not entirely convinced as there are many factors that could lead to high coordination and low task performance.
Despite this, I do think there’s value in building on some of the ways that the researchers have used to measure task completion, task performance, and coordination. I would have actually appreciated an approach that somehow used the milestones as markers, not just for agent contributions, but for when agents communicated/planned/collaborated/tried to convince/changed strategies. I find that using milestones, in this manner, is very beneficial as it adds a temporal aspect to the analysis. Perhaps, checking for milestones per iteration in that way could have allowed the researchers to characterize some agents as detractors, neutrals, or champions — similar to how some organizations would map stakeholders. Alternatively, it could have been used to the detect change (who switched sides?) when the researchers did the emergent behavior analysis.
On another note, I think there’s room to improve the coordination score since it does not fully encapsulate what it means to coordinate. Averaging the LLM-based communication and planning scores based on aggregated data (conversation logs?) does not capture dynamic changes across iterations since the aggregation provides more of an overall snapshot. The communication score will not show when an agent would strategically be quiet or over-communicate — both of which can be attributable to being a better collaborator (I’d like a teammate who knows when to shut up!). It just shows the overall effectiveness of an agent when it comes to communicating. Additionally, there are situations wherein an agent is collaborating with another by responding tersely/ambiguously to another, opposing agent. This is not captured by an aggregated communication score that gives premium to clear communication. Surely, there are other pathways worth exploring when it comes to formulating some kind of coordination score without being too simplistic, but this task is admittedly hard — I don’t blame anyone for sticking to the average of communication and planning scores. In fact, I would probably try out detecting coordination protocol changes (or checking if any agent switches sides during an iteration) in a sandboxed multi-agent system first then try to create a single coordination score that would fully describe what it means to coordinate.
On multi-agent setup
Aside from the metrics, I would have liked to see a multi-agent setup without direct task configurations, strict roles, or clear objectives. The agents would be deployed in an environment where they have access to tools, and they have the liberty to team up or go against other agents to come up with a solution. Yes, this setup is trickier to create a rules-based score (or even an LLM-based score) for, but I think it would be easier to see the relationship of high performance and collaborative tendencies when monitoring milestones + behavioral changes in free environments then trying to create a hard score for toy examples with strict objectives. This would more closely replicate agents roaming around the Internet or deploying a team of general agents to solve unique problems — which are some ways I could see people using powerful multi-agent systems.