Papers
Topics
Authors
Recent
Search
2000 character limit reached

Multi-Agent Teams Hold Experts Back

Published 1 Feb 2026 in cs.MA and cs.AI | (2602.01011v2)

Abstract: Multi-agent LLM systems are increasingly deployed as autonomous collaborators, where agents interact freely rather than execute fixed, pre-specified workflows. In such settings, effective coordination cannot be fully designed in advance and must instead emerge through interaction. However, most prior work enforces coordination through fixed roles, workflows, or aggregation rules, leaving open the question of how well self-organizing teams perform when coordination is unconstrained. Drawing on organizational psychology, we study whether self-organizing LLM teams achieve strong synergy, where team performance matches or exceeds the best individual member. Across human-inspired and frontier ML benchmarks, we find that -- unlike human teams -- LLM teams consistently fail to match their expert agent's performance, even when explicitly told who the expert is, incurring performance losses of up to 37.6%. Decomposing this failure, we show that expert leveraging, rather than identification, is the primary bottleneck. Conversational analysis reveals a tendency toward integrative compromise -- averaging expert and non-expert views rather than appropriately weighting expertise -- which increases with team size and correlates negatively with performance. Interestingly, this consensus-seeking behavior improves robustness to adversarial agents, suggesting a trade-off between alignment and effective expertise utilization. Our findings reveal a significant gap in the ability of self-organizing multi-agent teams to harness the collective expertise of their members.

Summary

  • The paper demonstrates that multi-agent LLM teams systematically underperform their top individual agent due to diluted expertise.
  • It employs controlled experiments on human teamwork tasks and ML benchmarks, revealing synergy gaps from 8.1% to 37.6%.
  • The study highlights that consensus-driven compromise hinders expertise leveraging, indicating a need for architectural adjustments.

Multi-Agent LLM Teams Fail to Harness Expert Performance

Problem Formulation and Motivation

The paper "Multi-Agent Teams Hold Experts Back" (2602.01011) rigorously interrogates the hypothesis that groups of LLM-based agents collaborating via unconstrained deliberative protocols can achieve strong synergy—specifically, whether teams match or exceed the performance of their most capable individual. Drawing on the organizational psychology literature, the authors contrast these open-ended self-organizing multi-agent LLM teams with human teams, where strong synergy is often demonstrated once expertise is identified within the group.

The central findings are that LLM teams, across a broad spectrum of tasks and agent compositions, systematically fail to capitalize on expertise within the group. This underperformance holds even when expertise is explicitly revealed to the group, and is observable in both classic human teamwork benchmarks and challenging modern ML evaluation datasets. Figure 1

Figure 1: Multi-agent teams fail to leverage expertise—(left) the expert outperforms the team, (center) team discussion prioritizes consensus over deference, (right) increasing team size exacerbates performance loss.

Experimental Framework

The evaluation is conducted in two experimental regimes:

  1. Controlled Human Teamwork Tasks: Canonical decision tasks (NASA Moon Survival, Lost at Sea, Student Body President) are adapted for LLM agents. Expertise is operationalized through controlled assignment of ground truth or privileged information, allowing for precise manipulation of expertise distribution (concentrated vs. distributed).
  2. Frontier ML Benchmarks: State-of-the-art models (diverse GPT and Claude variants among others) are assigned to multi-agent teams and assessed on benchmarks such as MMLU Pro, GPQA Diamond, SimpleQA, HLE (Humanity's Last Exam), and MATH-500. Expertise here is defined operationally as the agent(s) with the correct answer per problem, naturally distributing expertise across problems.

Teams deliberate (four models, four discussion rounds per protocol) before submitting a collective answer. Multiple information conditions are tested: no information about expertise, non-explicit expertise, explicit revelation of expertise, and an upper bound derived from always selecting the best individual for each problem.

Key Quantitative Results

Across all settings, LLM teams exhibit persistent synergy gaps—quantifiable underperformance relative to the top individual agent. The severity of this gap varies with task, team size, and information structure but is consistently nontrivial. Figure 2

Figure 2: Concentrated Expertise—teams fail to close the gap to the expert, with explicit expert revelation offering minimal improvement.

On classic human teamwork tasks, relative synergy gaps ranged from 55% to over 110% L1 error over the expert baseline depending on expertise distribution and task (Appendix Table). Even with hand-optimized prompts that aggressively encourage deference to revealed experts, teams exhibited only marginal improvements, with the primary bottleneck identified as a failure to leverage rather than identify expertise.

On ML benchmarks, synergy gaps were consistently observed:

  • SimpleQA: 18.7%
  • GPQA Diamond: 16.4%
  • HLE Text-Only: 37.6%
  • MATH-500: 15.2%
  • MMLU Pro: 8.1% Figure 3

    Figure 3: MMLU Pro—team accuracy trails the "At Least One Correct" upper bound, indicating missed opportunities for expert exploitation.

    Figure 4

    Figure 4: SimpleQA—relative synergy gaps persist across ML benchmarks, with teams underperforming the expert-optimized baseline.

    Figure 5

    Figure 5: GPQA Diamond—strong individual performance by one agent is not capitalized upon collectively.

These findings generalize across the studied tasks: expertise leveraging is consistently inadequate even under conditions most favorable to its emergence.

Expertise Dilution and Team Size

A pronounced expertise dilution effect is observed as team size increases: performance continues to degrade further below the expert baseline in larger teams, with the effect robust to model family heterogeneity. Figure 6

Figure 6: Rising error with team size in NASA Moon Survival demonstrates robust expertise dilution irrespective of composition.

Figure 7

Figure 7: Unaveraged run-level data confirm expertise dilution is not an artifact of aggregation—strong negative trend in individual outcomes.

Figure 8

Figure 8: Averaged by configuration, all model mixture types (Anthropic/OpenAI/mixed) demonstrate parallel performance deterioration as teams scale.

Mechanistic Insights: Compromise over Deference

Conversational analysis of the deliberation logs reveals that LLM teams predominantly engage in integrative compromise—averaging conflicting views between expert and non-expert members—instead of adopting the expert opinion wholesale (epistemic deference). Non-experts often propose middle-ground solutions even when epistemic authority is explicitly indicated, and experts may exhibit "epistemic flexibility," diluting their own authoritative stances in the face of group consensus pressure.

This behavioral bias towards consensus is rooted in RLHF and similar alignment procedures, which reinforce non-divisive, agreeable interaction. The paper finds a significant negative correlation between compromise frequency and performance, and a positive correlation between deference events and improved outcomes in contexts requiring expertise utilization.

(Figure 1 panel 2 is illustrative: non-experts compromise rather than defer. Transcript and code analyses in the appendix provide detailed evidence.)

Robustness against Adversarial Input

An intriguing duality emerges: the same group convergence mechanisms that hinder expertise leveraging also confer robustness to adversarial sabotage. When one team member is instructed to act adversarially, the team's performance degrades minimally. The consensus process naturally neutralizes outlier (including adversarial) contributions by majority weighting. Figure 9

Figure 9: Lost at Sea—adversarial member inclusion results in minimal degradation, as group consensus suppresses single-member sabotage.

Figure 10

Figure 10: NASA Moon Survival—robustness to adversaries persists across team sizes and model configurations.

Practical and Theoretical Implications

These results have profound implications for multi-agent LLM system design:

  • Emergent self-organization is insufficient for expertise exploitation: Without explicit role/wiring, LLM teams will not match human-level strong synergy, especially in tasks dependent on differential knowledge.
  • Prompt engineering and explicit expert labeling is not enough: Even aggressively optimized prompts for deference fail to close the gap; architectural and alignment changes are likely required.
  • Compromise behavior creates a robustness-utility tradeoff: Alignment protocols that favor agreeableness and consensus will inherently suppress both adversarial influence and the integration of true expertise.
  • Scaling teams worsens performance in expertise-asymmetric regimes: Naive scaling (e.g., increasing the number of agents) is actually harmful for expertise utilization.

In the broader context, the work supports a more skeptical interpretation of unconstrained multi-agent “wisdom of crowds” effects in LLM collectives, sharply delineating their limits in high-expertise/reliability domains.

Directions for Future Work

Improving expertise leveraging in LLM teams likely necessitates training paradigms or architectural changes that explicitly valorize epistemic authority under demonstrated evidence—possibly via mechanisms inspired by human group dynamics (trust calibration, hierarchical role emergence, demonstration channels, etc.), or by developing structural models for dynamic epistemic status detection. Balancing robustness and utility remains a central obstacle as empowerment of expertise can, by the same token, elevate the efficacy of adversarial agents.

Conclusion

LLM multi-agent teams, in their current form, are robust yet fundamentally limited: they systematically underperform the best individual in expertise-rich scenarios due to a bias towards consensus-driven compromise over rational epistemic deference. While this compromise affords resilience against adversarial threats, it fatally constrains the capacity for emergent team-level superperformance. Overcoming these limitations will require interventions at the level of alignment, authority modeling, and possibly rethinking the design of open-ended AI collectives.

Paper to Video (Beta)

Whiteboard

Explain it Like I'm 14

Overview

This paper looks at how teams of AI chatbots (called LLMs, or LLMs) work together on problems. The big question: can a team of AI agents do at least as well as the best individual agent on the team? In human teamwork, when people know who the expert is, they usually follow that person and match their level. The paper tests whether AI teams can do the same without being given strict roles or rules.

Key Questions

  • Can AI teams reach “strong synergy,” meaning the team does as well as or better than the best individual member?
  • If teams fail, is the problem finding who the expert is or actually listening to and using the expert’s knowledge?
  • What team behaviors (like how they talk, team size, and handling bad actors) affect whether teams use expertise correctly?

How Did They Study It?

The researchers set up small teams of AI models and let them talk for several rounds to agree on a final answer. They tested them on two types of tasks:

Human-inspired teamwork tasks

These are classic group decision problems used with humans:

  • NASA Moon Survival and Lost at Sea: rank items by importance in survival scenarios.
  • Student Body President: pick the best candidate using both shared and hidden information.

These tasks let the researchers control who has expert information and whether the team knows who the expert is.

Modern AI test benchmarks

They also used tough academic and reasoning tests (like MMLU Pro, GPQA Diamond, HLE, MATH-500, SimpleQA) where different models are good at different questions. This checks how teams perform when expertise naturally varies from problem to problem.

Team setup and conditions

Teams had 4 AI agents that:

  • Shared their initial opinions,
  • Discussed for 4 rounds,
  • Produced a final team answer.

They tried four information conditions:

  • No Information: nobody gets special info about expertise.
  • Expert Not Mentioned: one or more agents have expert info, but the team is not told who.
  • Reveal Expert: the team is explicitly told who the expert is.
  • Best Individual: the expert agent answers alone (the benchmark for “best member”).

Measuring success

They used “strong synergy”: does the team match or beat the best individual? The “synergy gap” is how much the team falls short compared to the best member. For the AI benchmarks, they also used “At Least One Correct,” which means imagining the team perfectly picks the right agent for each question—an upper bound on how good the team could be.

Conversation analysis and extra tests

They analyzed the team chats to see how agents behave:

  • Epistemic deference: non-experts trust and follow the expert.
  • Integrative compromise: agents average everyone’s opinions.
  • Strategic persistence: experts stick to their correct view.
  • Epistemic flexibility: experts compromise with others.

They also varied team size (2, 4, 8 agents) and added an adversarial agent (someone trying to sabotage the team) to test robustness.

What Did They Find?

  • Teams rarely match the best individual. Across tasks, teams fell behind their best agent by about 8% to 38%. For example, on a tough benchmark (HLE), teams were around 38% worse than the ideal “pick-the-right-agent-each-time” upper bound.
  • Knowing the expert doesn’t fix the problem. Even when the team is told who has the most relevant expertise, they still don’t use that expertise well. The bottleneck isn’t finding the expert; it’s leveraging the expert’s knowledge during discussion.
  • Teams average opinions instead of trusting the expert. In chats, non-experts often aim for “middle-ground” solutions (integrative compromise) rather than deferring to the expert. This averaging behavior strongly correlates with worse performance.
  • Bigger teams do worse. As team size increases, expertise gets “diluted”—the expert’s correct answer gets pulled toward an average, and performance drops farther below the best member.
  • Consensus behavior blocks sabotage. The same tendency to average opinions helps filter out bad advice from an adversary. Teams stayed fairly robust when one agent tried to push wrong answers.
  • Compared to humans, LLMs struggle to defer. Human teams typically match the expert when expertise is revealed. LLM teams, even when told who the expert is, keep compromising instead of appropriately deferring.

Why This Is Important

Think of a group project where one classmate clearly understands the topic best. If the group averages everyone’s ideas instead of trusting the expert when it matters, the final result is worse. This is what the AI teams are doing. The paper suggests a trade-off:

  • Consensus and agreeableness = safer against bad actors,
  • But consensus = worse at using real experts when they’re available.

This matters for building multi-agent AI systems. If different AI models are strong in different areas, we want teams that can detect and properly follow the right expert at the right time—without always averaging everything.

Takeaway and Potential Impact

  • Current self-organizing AI teams tend to prioritize agreement over expertise, which lowers performance.
  • Prompting alone (telling the team who the expert is) isn’t enough; we may need new training or design methods that teach models when to defer and when to persist.
  • Until then, practical systems may need clear roles, rules, or human oversight to ensure the team actually uses the best member’s knowledge.
  • Future work should aim to balance two goals: leverage the expert when it matters, while still staying robust to manipulation.

Knowledge Gaps

Below is a single, consolidated list of concrete knowledge gaps, limitations, and open questions that remain unresolved by the paper. Each item is framed to be actionable for future research.

  • External validity beyond the chosen tasks: Do the observed expertise-leveraging failures generalize to multimodal tasks (e.g., vision, code, tool use), real-world collaborative workflows, and longer-horizon projects rather than short intellective problems?
  • Limited benchmark coverage and sampling: Results on ML benchmarks are based on 100 problems per benchmark across 2 seeds; how sensitive are the findings to larger samples, different domains, and more diverse task types (e.g., planning, interactive environments)?
  • Closed-model reproducibility: Many team compositions include proprietary frontier models (e.g., GPT-5, Claude variants); can the findings be replicated with open-source models and fully transparent training regimes?
  • Identification vs. leveraging measurement fidelity: The decomposition into identification and leveraging gaps relies on information conditions; can more direct measurement (e.g., per-round expert selection accuracy, trust weight trajectories) more cleanly separate identification from leveraging failures?
  • Speaking-order effects: The paper randomizes speaking order but does not manipulate it; does expert-first speaking or targeted turn-taking materially change leveraging outcomes or the dilution effect?
  • Number of rounds and conversational length: Teams engage in four rounds; how do leveraging, dilution, and adversarial robustness trends change with more (or fewer) rounds, and with adaptive termination criteria?
  • Communication structure and minimal scaffolding: The study uses unconstrained group deliberation; what is the minimal conversational structure (e.g., turn constraints, explicit voting phases, forced justification protocols) that mitigates expertise dilution without heavy role engineering?
  • Role and hierarchy cues: Equal-status settings are assumed; do explicit status cues, authority signals, or lightweight leadership assignment improve epistemic deference without sacrificing robustness?
  • Trust and confidence calibration: The paper does not measure confidence signaling; can calibrated confidence (self-reported probabilities, justification quality scores) improve appropriate deference to genuine experts?
  • Causal link to alignment procedures: The hypothesized connection between RLHF/agreeableness and consensus-seeking is correlational; can controlled comparisons (base vs. RLHF models, varying RLHF intensity, alternative post-training objectives) establish causal effects on expertise leveraging?
  • Training interventions for deference: What specific fine-tuning or preference-modeling objectives (e.g., “contextual deference” rewards) produce better expert preemption, and what trade-off curves exist between leveraging and adversarial robustness?
  • Alternative aggregation mechanisms: The paper avoids learned aggregators; do trust-weighted voting, Bayesian social learning, or meta-agents trained to allocate authority outperform unconstrained deliberation while preserving the benefits of discussion?
  • Demonstrability of solutions: The discussion invokes demonstrability but does not quantify it; how can one operationalize and measure “demonstrability” (e.g., verifiable proofs, executable checks) and test its impact on deference and synergy?
  • Expertise detection granularity: In ML benchmarks, “Reveal Expert” indicates “most expertise” rather than “has the correct answer”; does stronger, per-problem correctness signaling or richer expertise metadata change outcomes?
  • Upper-bound choice and fairness: Synergy gaps on ML tasks are measured relative to the “At Least One Correct” upper bound; how do conclusions change when using the best individual model baseline or other realistic upper bounds?
  • Distributed expertise dynamics: The paper shows failures with distributed expertise but does not analyze interaction patterns needed to integrate complementary knowledge; which protocols enable accurate synthesis across disjoint expert subdomains?
  • Team-size scaling beyond 8: Expertise dilution is shown up to 8 agents; how do effects scale further, and can structured sub-teaming or hierarchical aggregation invert the dilution trend at larger scales?
  • Adversarial setting breadth: Adversarial robustness tests use a single saboteur with ranking tasks; do results hold under stronger, adaptive adversaries, colluding subgroups, misinformation strategies, or realistic red-teaming in ML benchmarks?
  • Cross-lingual and cultural variability: The study is monolingual; do deference and compromise behaviors (and their effects) differ across languages or culturally inflected conversational norms?
  • Conversation analysis generalization: Behavioral coding is performed mainly on psychology tasks with limited human validation; do the same compromise/deference patterns explain failures on ML benchmarks, and can larger, multi-annotator studies validate these mechanisms?
  • Confidence-weighted persistence in experts: Expert persistence correlates with performance in some tasks; what governs when experts should be flexible vs. resolute, and can models learn an adaptive policy for persistence based on evidence strength?
  • Persona and instruction effects: The paper uses “aggressively-tuned” prompts but does not systematically vary personas (e.g., “domain specialist,” “group facilitator”); which persona/instruction designs materially alter deference and leveraging?
  • Tool use and external verification: Teams deliberate purely via text; does access to tools (calculators, retrieval, code execution, checkers) improve demonstrability and hence appropriate deference?
  • Real-world deployment constraints: The harness abstracts away resource limits, long-term memory, and task decomposition; how do these practical constraints interact with expertise leveraging in deployed multi-agent systems?
  • Minimal interventions to achieve strong synergy: What is the smallest change (prompting, turn-taking, voting rule, trust signal) that closes the 8–38% synergy gap without heavy engineering or sacrificing robustness?
  • Formal models of social learning: The paper offers a qualitative account; can formal models (e.g., Bayesian authority, opinion dynamics, social choice) predict when teams should preempt vs. integrate, and guide protocol design?

Practical Applications

Immediate Applications

Below are actionable, sector-linked uses you can deploy now, based on the paper’s findings that self-organizing LLM teams underperform their best member, primarily due to poor expertise leveraging and consensus-seeking behavior.

  • Industry (Software/AI): Expert-weighted orchestration for multi-model systems
    • Use case: Route tasks to the historically best-performing model per domain and apply weighted voting rather than unconstrained deliberation.
    • Tools/workflows: “Expertise Router” that assigns weights via per-domain accuracy logs; “Expert Override” that lets the top-rated model preempt others.
    • Assumptions/dependencies: Access to heterogeneous models; reliable historical evaluation data; calibration of confidence and accuracy.
  • Industry (Software/AI, Robotics): Team-size caps and explicit role hierarchies in agent frameworks
    • Use case: Limit agent teams to small sizes and enforce hierarchies (lead expert, critic, verifier) instead of equal-status, free-form discussion.
    • Tools/workflows: Orchestration policies in AgentOps, AutoGen, or similar frameworks; “Expert Veto” rules; structured turn-taking.
    • Assumptions/dependencies: Clear task definitions; availability of domain experts (human or model); ability to modify agent topologies and prompts.
  • Industry (Trust & Safety, Finance, Moderation): Robust consensus for adversarial environments
    • Use case: Leverage consensus-seeking in multi-agent ensembles for adversarial robustness in content moderation, fraud detection, or abuse triage.
    • Tools/workflows: Majority-vote filters combined with anomaly detection; adversary dilution via ensemble aggregation.
    • Assumptions/dependencies: Acceptance of trade-offs (robustness vs. expertise utilization); labeled adversarial datasets.
  • Cross-sector (Software, Procurement, QA): Synergy evaluation harness for benchmarking multi-agent deployments
    • Use case: Adopt the paper’s open-source harness to quantify the “synergy gap” and decide whether multi-agent setups outperform single experts in your workflows.
    • Tools/workflows: “Synergy Score Dashboard” for internal QA; pre-deployment benchmarking against best-member baselines.
    • Assumptions/dependencies: Availability of representative tasks with ground truth; reproducible benchmarking; cost budget for multi-model inference.
  • Industry (Software/AI, Operations): Confidence-calibrated expert selection
    • Use case: Aggregate model-specific confidence with per-domain reliability to select one model’s answer rather than averaging multiple.
    • Tools/workflows: Calibrated confidence scoring; reliability registries; abstain/escalation policies when confidence is low.
    • Assumptions/dependencies: Robust confidence calibration; domain-specific validation data.
  • Industry (Software/AI, Human-in-the-loop): Fallback gating and escalation workflows
    • Use case: When agents disagree or drift toward compromise, gate outputs through the designated expert or escalate to a human reviewer.
    • Tools/workflows: Decision gates triggered by disagreement metrics; “Expert Preemption Mode”; structured escalation trees.
    • Assumptions/dependencies: Human oversight capacity; clear escalation criteria; logs to detect compromise behaviors.
  • Academia (HCI, Organizational Psychology, ML): Course modules and replications
    • Use case: Teaching labs that replicate NASA/Lost-at-Sea/Student Body President tasks to study LLM team dynamics and expertise leveraging.
    • Tools/workflows: Use the harness for controlled experiments; annotate discussions for epistemic deference vs. compromise.
    • Assumptions/dependencies: Annotators or coding schemes; IRB processes when mixing human and LLM studies.
  • Policy (AI Governance, Standards): Deployment and audit guidelines
    • Use case: Require “synergy gap” reporting for multi-agent systems in regulated domains; mandate expert identification and explicit weighting strategies for intellective tasks.
    • Tools/workflows: Procurement checklists; audit protocols; conformance tests (e.g., strong vs. weak synergy thresholds).
    • Assumptions/dependencies: Sector-specific regulation readiness; availability of standardized benchmarks; stakeholder alignment on thresholds.
  • Daily Life (Personal AI use): Practical ensemble hygiene
    • Use case: For multi-bot or multi-model assistants, pick the best-known domain expert or keep the team small; avoid “compromise for correctness.”
    • Tools/workflows: Per-domain model selection; toggles for “defer-to-expert” vs. “seek-consensus” modes depending on task type.
    • Assumptions/dependencies: User awareness of domain boundaries; access to multiple assistants/models; basic model performance tracking.
  • Industry (Prompting/Orchestration): Authority prompts and deference cues
    • Use case: Employ carefully crafted prompts that signal expert authority and encourage preemption over averaging—paired with structural controls that enforce it.
    • Tools/workflows: Role-specific system prompts; weighted speak-order; constrained turn budgets for non-experts.
    • Assumptions/dependencies: Prompt efficacy varies; structural enforcement is often more impactful than phrasing alone.

Long-Term Applications

These opportunities require further research, scaling, or development to address the expertise-leveraging bottleneck while balancing adversarial robustness.

  • Cross-sector (ML Alignment): Deference-aware alignment objectives
    • Use case: Modify RLHF/RLAIF to teach models when to preempt non-expert views (epistemic deference) versus integrate evidence.
    • Tools/workflows: “Authority-aware” reward models; datasets labeled for deference vs. compromise behaviors.
    • Assumptions/dependencies: New training corpora; measurement protocols for miscalibrated deference; safety considerations.
  • Industry/Academia (ML Systems): Learned expertise routers and reliability estimators
    • Use case: Train meta-models to estimate per-problem model reliability and select/weight experts to approximate the “At Least One Correct” upper bound.
    • Tools/workflows: Per-problem reliability predictors; ELO-like ratings; continuous online updates.
    • Assumptions/dependencies: Large-scale logging; robust generalization across tasks; cost constraints for multi-model evaluation.
  • Industry (Software/AI): Debate protocols with proof-based preemption
    • Use case: When tasks admit formal verification (math, programming), adopt “proof/trace preemption,” where verifiable arguments override consensus.
    • Tools/workflows: Integration with theorem provers, test suites, linters; structured “proof-first” rounds.
    • Assumptions/dependencies: Availability of verifiers; demonstrability of correctness; domain suitability.
  • Robotics/Energy/Autonomy: Authority-structured multi-agent control
    • Use case: In safety-critical planning (multi-robot, grid operations), enforce leader–critic hierarchies and expert vetoes rather than free-form discussion.
    • Tools/workflows: Role-locked policies; formal policy verification; runtime monitoring for expertise dilution.
    • Assumptions/dependencies: Certified controllers; clear expertise definitions; compliance with safety standards.
  • Healthcare (Clinical Decision Support): “AI tumor board” with expert gating
    • Use case: Multi-model review of cases where oncology/radiology models can propose, but expert models or clinicians preempt ensemble compromise.
    • Tools/workflows: Deference to validated models; human-in-the-loop escalation; synergy-based evaluation in prospective studies.
    • Assumptions/dependencies: Regulatory approval; rigorous validation; traceability and auditing.
  • Finance (Quant, Risk): Expertise-aware trading and risk pipelines
    • Use case: Route tasks to specialized models (macro, microstructure, credit) and enforce expert preemption under high-confidence signals.
    • Tools/workflows: Reliability registries; kill-switches when disagreement persists; post-trade audits using synergy metrics.
    • Assumptions/dependencies: Robust backtesting; model governance; risk limits.
  • Education (EdTech): Teacher-aware AI facilitation
    • Use case: Classroom AIs that identify the “expert” source (teacher, vetted material) and avoid averaging with weaker sources on intellective tasks.
    • Tools/workflows: Authority tagging; defer-to-curriculum rules; assessment-integrated reliability checks.
    • Assumptions/dependencies: Curriculum metadata; school policies; evaluation rubrics.
  • Safety/Trust & Safety: Dual-objective aggregation (expertise utilization + manipulation resistance)
    • Use case: Design aggregators that explicitly optimize the trade-off the paper identifies, tuning for context (e.g., high-stakes correctness vs. high-risk adversarial exposure).
    • Tools/workflows: Multi-objective optimization; context switches; threat modeling.
    • Assumptions/dependencies: Clear task criticality labels; adversary models; continuous monitoring.
  • Policy/Standards: Sector-wide benchmarks and reporting
    • Use case: Create standards that distinguish intellective vs. preference tasks and require evidence of strong synergy or expert deference where correctness is demonstrable.
    • Tools/workflows: Conformance tests; transparency reports on synergy gaps; certification pathways.
    • Assumptions/dependencies: Multi-stakeholder coordination; standardized datasets; enforcement mechanisms.
  • ML Research: Conversation-level annotations and datasets
    • Use case: Build large corpora labeled for epistemic deference, integrative compromise, and strategic persistence to enable supervised training and evaluation.
    • Tools/workflows: Annotation frameworks; cross-disciplinary coding schemes (social epistemology + negotiation theory).
    • Assumptions/dependencies: High-quality annotation; reproducible coding standards; privacy-safe logs.

Glossary

  • Accuracy-based tasks: Evaluation settings where higher scores indicate better performance. "For accuracy-based tasks (ML benchmarks) where higher is better, the relative synergy gap is:"
  • Adversarial robustness: The capacity of a system to maintain performance in the presence of malicious contributors. "this consensus-seeking behavior provides adversarial robustness."
  • Aggregation rules: Predefined methods for combining the outputs or opinions of multiple agents. "most prior work enforces coordination through fixed roles, workflows, or aggregation rules"
  • Alignment procedures: Post-training methods that shape model behavior to be helpful and agreeable. "Current alignment procedures optimize models to be helpful and agreeable through RLHF"
  • At Least One Correct (upper bound): The theoretical maximum performance if the team perfectly identifies and uses the agent with the correct answer on each problem. "For ML benchmarks, we define the At Least One Correct upper bound as the performance achievable if the team perfectly identified and leveraged the agent with the correct answer on each problem."
  • Communication topologies: Structured patterns that determine which agents communicate with which others. "most systems assume fixed communication topologies—which agents can communicate with which others."
  • Consensus-seeking: A tendency to prefer agreement or middle-ground solutions over deference to expertise. "we find that this consensus-seeking behavior provides adversarial robustness."
  • Controlled ablations: Experimental tests that isolate factors by systematically removing or varying components. "controlled ablations reveal the primary failure is leveraging, not identification."
  • Debate-then-vote protocols: Collaboration formats where agents discuss (debate) before aggregating decisions via voting. "find that majority voting drives nearly all gains in debate-then-vote protocols"
  • Deliberation: Back-and-forth reasoning among agents to coordinate and decide. "we study settings in which agents must self-organize through deliberation."
  • Distributed expertise: Knowledge relevant to a task is partitioned across multiple agents with complementary strengths. "Distributed expertise, where task-relevant knowledge is partitioned mutually exclusively across multiple team members."
  • Epistemic deference: Yielding to expert authority by adopting the expert’s view directly. "Epistemic deference (preemption): recognizing an expert and adopting their view directly"
  • Epistemic flexibility: An expert’s willingness to accommodate group feedback. "Epistemic Flexibility EF."
  • Evidence integration: Treating expert opinion as additional evidence to be weighed rather than as authoritative. "Evidence integration: treating expert opinion as additional evidence to be weighed and combined with other views."
  • Expert leveraging: Effectively using the expert’s knowledge to guide team decisions. "expert leveraging, rather than identification, is the primary bottleneck."
  • Expert Not Mentioned condition: An experimental setting where expertise exists but the team is not told who the expert is. "making the Expert Not Mentioned condition most representative of real-world scenarios."
  • Expertise dilution effect: The phenomenon where larger teams increasingly average away expert input, reducing performance. "We document an expertise dilution effect where performance degrades with team size"
  • Frontier ML benchmarks: Challenging, state-of-the-art evaluation tasks used to assess cutting-edge models. "Across human-inspired and frontier ML benchmarks, we find that—unlike human teams—LLM teams consistently fail to match their expert agent's performance"
  • Identification Gap: The performance difference capturing how well teams autonomously recognize the expert. "The Identification Gap measures the difference between team performance when expertise is not mentioned versus when revealed"
  • Intellective tasks: Tasks with demonstrably correct answers that allow verification of solutions. "We focus on intellective tasks—those with demonstrably correct answers—"
  • Integrative compromise: Averaging expert and non-expert views instead of properly weighting expertise. "integrative compromise—averaging expert and non-expert views rather than appropriately weighting expertise—"
  • L1 distance: A metric computed as the sum of absolute differences between positions in two rankings. "Performance is measured by L1 distance from expert ranking"
  • Learned ensembling: Using a trained aggregator to combine model outputs rather than spontaneous deliberation. "closer to learned ensembling than deliberative collaboration."
  • Majority voting: Decision aggregation by selecting the option with the most votes. "find that majority voting drives nearly all gains in debate-then-vote protocols"
  • Mixture-of-Agents: A method that uses one model to dynamically aggregate other models’ outputs. "Mixture-of-Agents uses LLMs as dynamic aggregation functions over other model outputs"
  • Model heterogeneity: Teams composed of different models with distinct training and strengths. "We study teams of different frontier models, each with distinct pretraining data and comparative advantages"
  • No-info setting: An adversarial experiment setup where only the adversary gets special information. "We use the no-info setting (only the adversary receives special information) to isolate the adversarial effect"
  • Organizational psychology: The study of behavior in organizational contexts, including team dynamics. "Drawing on organizational psychology, we study whether self-organizing LLM teams achieve strong synergy"
  • Pearson correlations: A statistical measure of linear association between variables. "We compute Pearson correlations between behavior frequencies and the synergy gap."
  • Preemption thesis: The idea that an expert’s judgment should replace a layperson’s own reasoning. "Drawing on the preemption thesis from philosophy of authority"
  • RLHF (Reinforcement Learning from Human Feedback): A post-training approach that optimizes models using human feedback signals. "Current alignment procedures optimize models to be helpful and agreeable through RLHF"
  • Relative Synergy Gap: A normalized measure of team underperformance relative to the best individual. "Relative Synergy Gap measures the difference between At Least One Correct and Team (Expert Not Mentioned) as a fraction of At Least One Correct performance."
  • Reveal Expert condition: An experimental setting where the identity of the expert is explicitly disclosed. "NASA Moon Survival (Reveal Expert condition) shows ranking error increasing with team size"
  • Role assignment: Pre-specifying functional roles for agents in multi-agent systems. "Many systems pre-specify static roles such as proposer, critic, or refiner."
  • Self-organizing LLM teams: Teams that coordinate and decide without predefined roles or workflows. "we study whether self-organizing LLM teams achieve strong synergy"
  • SEM (Standard Error of the Mean): A statistic quantifying the uncertainty of an estimated mean. "Error bars are ± SEM."
  • Strong synergy: Team performance matches or exceeds the best individual member’s performance. "strong synergy asks whether a team can match or exceed the performance of its strongest individual member"
  • Synergy gap: The difference between team performance and the best individual, indicating underperformance. "We measure the synergy gap against the At Least One Correct upper bound"
  • Task decomposition: Splitting a problem into parts assigned across agents. "prior work asks: how do we optimally decompose tasks across agents and aggregate their outputs?"
  • Weak synergy: Team performance exceeds the average of individual members but not the best member. "human teams usually achieve only weak synergy (exceeding the average of members' individual performances)"

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 6 tweets with 569 likes about this paper.

HackerNews

  1. Multi-Agent Teams Hold Experts Back (1 point, 0 comments)