Papers
Topics
Authors
Recent
Search
2000 character limit reached

DR-Arena: Automated Evaluation for DR Agents

Updated 19 January 2026
  • DR-Arena is a dynamic evaluation framework for Deep Research Agents that leverages live web data to continuously generate and refine complex tasks.
  • It constructs Real-Time Information Trees to enhance multi-hop reasoning and wide aggregation by expanding both depth and breadth based on live content.
  • The framework uses an automated Examiner and an Adaptive Evolvement Loop to iteratively adjust task complexity, closely aligning its results with human judgments.

DR-Arena is a fully automated evaluation framework for Deep Research (DR) Agents—LLMs specialized for autonomous investigation and synthesis of information. Addressing the limitations of static benchmarks, such as temporal staleness and data contamination, DR-Arena continuously constructs dynamic tasks from live web trends and employs an LLM-based Examiner and a state-machine controller—the Adaptive Evolvement Loop—to test agents on two orthogonal dimensions: Deep reasoning and Wide coverage. The entire system operates as a closed-loop evaluator, iteratively generating, adjudicating, and refining evaluation challenges based on agents’ real-time performance, aiming to robustly demarcate capability boundaries on live information tasks (Gao et al., 15 Jan 2026).

1. System Architecture and Workflow

DR-Arena comprises three principal, tightly coupled modules: Real-Time Information Tree Constructor, Automated Examiner, and Adaptive Evolvement Loop. Evaluation proceeds through three main stages:

  • Dynamic Information Tree Construction: The system initiates a web scrape from a live root page, systematically expanding an Information Tree—a directed acyclic graph %%%%1%%%%—by crawling links (for depth) and sibling pages (for width). Nodes V={v1,,vn}V = \{v_1,\ldots,v_n\} store page content and metadata; edges E={eij}E = \{e_{ij}\} capture hyperlinked relations, supplemented with semantic labels RijR_{ij} inferred from anchor context.
  • Automated Task Generation and Judgement: For each agent match, the Examiner generates a “Deep & Wide” query and an evidence-based scoring rubric. Both agents respond, and submissions are adjudicated against hard (logical/factual) and soft (formatting/citation/helpfulness) rubrics.
  • Adaptive Evolvement Loop: A state-machine controller dynamically escalates the task’s complexity, adjusting Information Tree depth DD and width WW parameters based on agent performance until a decisive score gap or round cap is reached.

The interaction cycle is: build/expand the tree → generate query/rubric → collect/judge responses → adjust tree/task per outcome → repeat, as formalized in Algorithm 1 of (Gao et al., 15 Jan 2026).

2. Real-Time Information Tree Construction

Information Trees provide temporally aligned, contamination-resistant evaluation tasks. The construction logic is:

  • Depth Expansion: Increase multi-hop reasoning requirements by crawling child links.
  • Width Expansion: Enhance aggregation requirements by fetching sibling nodes.
  • Semantic Relations: Edges are enriched with semantic relation labels derived from anchor text/context, increasing the evaluation's reasoning specificity.
  • Data Structure:
    • InformationTree.nodes contains {id, URL, content, metadata}
    • InformationTree.edges contains {src_id, dst_id, relation_label}
    • depth_limit d, width_limit w
  • Expansion Triggers: The tree expands whenever ancestor count Ancestors(P)<D|Ancestors(P)| < D or sibling count Siblings(P)<W|Siblings(P)| < W for selected path PP. No explicit node scoring criteria are defined in the paper.

This dynamic construction ensures queries are crafted from fresh, live contexts rather than static, pre-dated datasets.

3. Automated Examiner: Task Generation and Adjudication

The Examiner, implemented as a single LLM (Gemini-3-Pro), alternates between two roles:

  • Task Generator ("Interviewer"):
    • Inputs: Selected path PP in tree; Deep context Cdeep=Ancestors(P)C_{\text{deep}} = Ancestors(P); Wide context Cwide=Siblings(P)C_{\text{wide}} = Siblings(P) up to width WW.
    • Depth Principle: Entity names masked; demands reasoning chain of length D\geq D.
    • Width Principle: Forces aggregation across W\geq W sibling nodes.
    • Outputs: Natural-language query QQ (entities hidden, facts scattered), gold answer AA^*, rubric R={ChecklistDepth,ChecklistWidth}\mathcal{R} = \{\text{Checklist}_\text{Depth}, \text{Checklist}_\text{Width}\}.
  • Judge ("Adjudicator"):
    • Inputs: Query QQ, rubric R\mathcal{R}, answers Aa,AbA_a, A_b.
    • Verification: Hard constraints (logical/factual checkpoints) penalize critical violations. Soft constraints (format, citation density, helpfulness) resolve ties.
    • Scoring: Verdicts drawn from discrete categories ($\text{A}_\text{MUCH_BETTER}, \text{A}_\text{BETTER}, \text{Tie}_\text{High}, \text{Tie}_\text{Low}, \ldots$), with loser’s failure type reported ({Deep, Wide, Both, None}).

This structure emphasizes both multi-hop logical deduction and wide aggregation, challenging agents across two orthogonal axes.

4. Adaptive Evolvement Loop

The Adaptive Evolvement Loop is a state-machine that escalates or de-escalates complexity using two parameters, DD (depth) and WW (width):

  • Transition Rules:

| Verdict / Failure Type | Depth Update (Dt+1D_{t+1}) | Width Update (Wt+1W_{t+1}) | Node Selection (PP) | |---------------------------|------------------------------|------------------------------|-------------------------------------------| | Tie_low_quality | DtD_t | max(2,Wt1)\max(2, W_t-1) | Pparent(P)P \leftarrow \text{parent}(P) | | Tie_high_quality / Both | Dt+1D_t+1 | Wt+1W_t+1 | — | | Loser failure = Deep | Dt+1D_t+1 | WtW_t | — | | Loser failure = Wide | DtD_t | Wt+1W_t+1 | — |

  • Expansion: If selected node is a leaf or lacks siblings, the crawler triggers ExpandDepth or ExpandWidth.
  • Termination: The loop ends upon a “Much Better” verdict, reaching a score threshold, or a round maximum.

This mechanism actively concentrates compute on the most informative capacity boundaries for each agent match.

5. Evaluation Metrics and Empirical Results

DR-Arena’s evaluation metrics target alignment with human judgements from the LMSYS Search Arena:

  • Spearman’s Rank Correlation:

rs=16i=1ndi2n(n21)r_s = 1 - \frac{6 \sum_{i=1}^n d_i^2}{n(n^2 - 1)}

where did_i is the rank difference for agent ii, n=6n = 6.

  • Key Findings:
    • Spearman: 0.94
    • Pearson: 0.74
    • Agent Ranking: GPT-5.1-Search, Gemini-2.5-Pro, o3-Search, Grok-4, Claude-Opus-4.1, Perplexity-Sonar (nearly identical; #5/#6 swap)
    • Ablation: Removing evolvement loop reduces Spearman to 0.77; dropping rubric-based judgement drops to 0.83.
    • Human Audit: Cohen’s κ = 0.91, 96.9% correct evolvement, 92.2% efficient stops (on 30 matches).
    • Statistical Significance: System’s rounds vs. Elo gap correlation r=0.61,p=0.045r = -0.61, p = 0.045.

These results substantiate DR-Arena as a reliable surrogate for costly human adjudication, tightly correlating with manual preference rankings.

DR-Arena complements and contrasts with frameworks such as Decentralized Arena ("De-Arena"), which utilizes democratic collective intelligence across many LLMs with pairwise evaluation and distributed judging (Yin et al., 19 May 2025). While De-Arena applies “coarse-to-fine” ranking of up to 66 models and automatic question selection—attaining Spearman correlation of up to 0.974 with human labels—DR-Arena focuses on deep, live, evidence-based tasks and adaptive complexity escalation for stress-testing individual model boundaries. Unlike De-Arena’s model-as-judge approach, DR-Arena centralizes automated adjudication in a single LLM Examiner and dynamically exploits web freshness for evaluation diversity.

7. Discussion, Limitations, and Prospects

DR-Arena’s construction of evaluation tasks from live web data obviates temporal misalignment and dataset memorization issues. The closed-loop system adaptively escalates task complexity, efficiently probing bounds of deep reasoning and wide aggregation capabilities. Notable limitations include:

  • Parametric Bias: Examiner may hallucinate or override rubrics based on own parametric knowledge.
  • External Dependencies: Reliance on live web pages and APIs introduces variance from site restrictions or regional geoblocking.
  • Creativity Penalty: Strict rubric enforcement may penalize creative or lateral responses not conforming to the tree.
  • Future Work: Possible directions include hybrid human–LLM adjudication, diversified rubric sources, and more robust handling of rubric hallucination.

In summary, DR-Arena establishes a state-of-the-art, fully automated, and adaptive framework for evaluating DR Agents on dynamic, real-world information synthesis problems, achieving robust alignment with manual human judgments and highlighting agent boundaries via targeted stress-testing (Gao et al., 15 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DR-Arena.