Papers
Topics
Authors
Recent
Search
2000 character limit reached

Wiki Live Challenge Benchmark

Updated 9 February 2026
  • WLC is a live benchmark that evaluates Deep Research Agents by comparing their outputs to dynamically curated, community-vetted Wikipedia Good Articles.
  • Its Wiki Eval framework applies 39 atomized criteria assessing writing style and factual accuracy, revealing gaps in depth, neutrality, and citation support.
  • The challenge offers actionable diagnostics for improving research synthesis, fostering continuous advancements in autonomous agent performance.

The Wiki Live Challenge (WLC) is a live benchmark designed to rigorously evaluate Deep Research Agents (DRAs) through the generation of Wikipedia-style articles benchmarked against the latest Wikipedia Good Articles (GAs). By leveraging community-vetted, expert-level references and a fine-grained, criteria-driven evaluation architecture, WLC advances both the measurement and capabilities of autonomous research and synthesis systems. Its methodology exposes persistent gaps in style, depth, neutrality, and factual grounding that evade traditional QA and summarization benchmarks, offering actionable diagnostics for agent and LLM research (Wang et al., 2 Feb 2026).

1. Dataset Creation and Contamination Resistance

WLC’s foundation is the dynamic curation of high-complexity, human-reviewed Wikipedia articles. Between March 1 and December 1, 2025, the benchmark harvested every newly created Wikipedia page, filtering for the “Good Article” (GA) status—designated after stringent community review for clarity, neutrality, breadth, and verifiability. From 304 such candidates, the methodology ranked articles using a composite of two complexity proxies: (1) the number of unique reference URLs and (2) structural depth, as measured by the count of subheadings. List-only and stub-style entries were excluded to focus on full-length, discursive examples. The top 100 articles—spanning 15 major domains, including History, Mathematics, Natural Sciences, and Philosophy—became the core dataset. This rolling, live-update protocol ensures benchmark relevance and guards against model contamination, as DRAs cannot simply memorize these recent, continuously changing sources.

2. The Wiki Eval Evaluation Framework

WLC proposes Wiki Eval, a two-pronged, criteria-driven evaluation suite rooted in Wikipedia’s Good Article guidelines (omitting the “Stability” and “Illustration” criteria):

A. Wiki Writing:

Assessment of writing quality follows 39 atomized criteria, aligned with four key axes:

  • Well-Written (18 checks): Enforces encyclopedic tone, minimal unexplained jargon, no “lies-to-children” simplifications, lead-section structure (presence, summarization of topic, context, notability, and controversies).
  • Lead-Section Structure (7 checks): Evaluates organizational standards, e.g., presence and summarization in the opening sections.
  • Words-to-Watch (11 categories): Detects problematic language such as puffery, unsupported attributions, expressions of doubt or editorializing, euphemisms, clichés, relative time references, and unspecified referents.
  • Neutral Point of View (10 checks): Enforces balanced representation, proper attribution of opinions, avoidance of undue weight for fringe positions, nonjudgmental language, and parity between majority/minority views and title conventions.

Evaluation utilizes a “LLM-as-Judge” paradigm: for each Wikipedia reference and DRA output pair, a high-performance LLM (Gemini-2.5-pro) judges, on each criterion, whether the agent’s attempt satisfies the criterion—outputting binary results Judge(wi,gi){0,1}\text{Judge}(w_i,g_i) \in \{0,1\} and summing for an aggregate writing score.

B. Wiki Fact:

Fact-based evaluation encompasses two orthogonal axes:

  • Coverage vs. Wikipedia:

For each extracted gold fact fif_i from the reference set F={f1,,fF}F=\{f_1,\dots,f_{|F|}\}, candidate statements {sj}\{s_j\} from the agent output are matched, and a fact-checking LLM determines consistency:

Fact(fi,G)={1if consistent 0if inconsistent or conflicting\text{Fact}(f_i, G) = \begin{cases} 1 & \text{if consistent} \ 0 & \text{if inconsistent or conflicting} \end{cases}

Coverage score: Cov.Wiki.=1Fi=1FFact(fi,G)\text{Cov.Wiki.} = \frac{1}{|F|}\sum_{i=1}^{|F|}\text{Fact}(f_i,G)

  • Citation Verifiability:

For each agent statement–URL pair, the referenced page’s HTML is fetched, and the fact-checker verifies the factual support:

Reference accuracy: Ref.Acc.=1Sk=1SFact(sk,R)\text{Ref.Acc.} = \frac{1}{|S|}\sum_{k=1}^{|S|}\text{Fact}(s_k, R)

3. Experimental Protocol

Judging for writing is performed by Gemini-2.5-pro, while Gemini-2.5-flash is used for fact extraction and checking. Twelve DRAs are evaluated, spanning both open-source systems (DeepResearcher, Tongyi DeepResearch, LangChain using GPT-4.1 and GPT-5) and proprietary models (OpenAI o3, Gemini-2.5-pro, Gemini-3-pro, Qwen-3-max, Perplexity, Grok, Doubao). Generation took place under prompts forbidding direct Wikipedia access (December 15–19, 2025). Agreement between the LLM judges and humans was measured via Pairwise Agreement Rate (PAR) on 390 annotations. ANOVA tested writing-score robustness across topical categories; the null hypothesis (H0)(H_0) of no category effect was retained (except for DeepResearcher) with p>0.05p>0.05 and low η2\eta^2.

4. Empirical Results and Analysis

The experimental evaluation reveals substantial performance gaps between DRAs and expert-level Wikipedia articles:

  • Writing Quality Gap:

Top-scoring agents—Gemini-3-pro and LangChain-GPT-5—reached 58 and 53 “wins” out of 39 criteria (interpreted as ~150% wins versus baseline), yet no model approached a perfect 39/39. The lowest open-source scores were as low as 2.28.

  • Factual and Citation Coverage:

Coverage against the Wikipedia fact set remained limited; the best system (Gemini-3-pro) covered only 30.76% of gold facts (Cov.Wiki.), reflecting deep deficits in autonomous research rather than mere summarization. Citation verifiability (Ref.Acc.) peaked at 67.6% for GPT-5 via LangChain, but many open-source systems failed to format citations and were unscorable.

  • Conflict Rates:

Agents exhibited 7–11% direct contradiction rate against Wikipedia and up to 6.9% citation conflict, signaling persistent hallucinations and shortcomings in fact grounding.

  • Domain Difficulty:

Performance difficulty correlated moderately with page views (r=+0.482r=+0.482) but not with article length or link count. History and Mathematics were most challenging (<20% win rate); Philosophy & Religion and Natural Sciences less so (>40%).

5. Diagnostic Power and Research Implications

WLC advances agent research beyond static and model-generated evaluation benchmarks. By anchoring the challenge in dynamic, community-vetted, expert-level references and employing fine-grained, atomized evaluation criteria, WLC locates deficiencies in agents’ reasoning, structure, neutrality, and content retrieval at high specificity. Its “live” construction—leveraging freshly minted GAs and an evolving reference corpus—prevents overfitting, fosters continuous agent adaptation, and creates robust grounds for methodological progress. The framework’s actionable diagnostics help direct future system improvement, whether on lead structure, neutral tone, factual synthesis, or citation formatting.

6. Limitations and Future Directions

Current constraints include pool size—limited by the frequency of new GAs, resulting in a dataset of only several hundred articles. A proposed extension involves including “A-List” or Featured Articles to increase scale. Reference opacity is a challenge, as some proprietary DRAs cite inaccessible or paywalled sources, rendering Ref.Acc. observational rather than definitive. While prompts restrict Wikipedia access, absolute leakage prevention needs further technical enforcement. Incorporation of multimedia checks (addressing GA criterion 6), dynamic evaluation strategies, and human-in-the-loop reranking constitute promising avenues for benchmark enrichment. A plausible implication is that as agents reach higher parity with expert-level GAs, further criteria and evolving standards will be necessary to maintain evaluation rigor.

7. Significance for Autonomous Scholarship

WLC provides real-time, community-vetted rigor in autonomous agent evaluation, systematically mapping persistent deficits in scope, style, neutrality, and grounding. By providing granular, actionable feedback via Wiki Eval, the benchmark charts a pathway for closing the performance gap between current DRAs and expert-level scholarship, contributing substantively to the goal of trustworthy, expert-grade, automated research synthesis (Wang et al., 2 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Wiki Live Challenge (WLC).