Papers
Topics
Authors
Recent
Search
2000 character limit reached

LongBench Pro: Bilingual Long-Context Benchmark

Updated 13 January 2026
  • LongBench Pro is a bilingual, multitask benchmark that evaluates LLMs’ long-context comprehension and reasoning using naturally occurring, multi-domain documents.
  • It employs a Human–Model Collaborative Construction pipeline to generate and refine 1,500 annotated samples across 36 tasks, ensuring scalability and accuracy.
  • Empirical evaluations reveal that optimizing context length and leveraging chain-of-thought prompts improve performance, while cross-lingual challenges remain a current research focus.

LongBench Pro is a bilingual, multitask benchmark designed to evaluate long-context comprehension and reasoning abilities in LLMs. It addresses the limitations of prior benchmarks by providing realistic, multi-domain, naturally occurring documents in English and Chinese, with input lengths from 8,000 to 256,000 tokens. The benchmark encompasses 1,500 samples across 11 primary and 25 secondary real-world tasks, each annotated with fine-grained metrics and a multi-dimensional taxonomy of context requirement, document length, and calibrated difficulty. LongBench Pro employs a Human–Model Collaborative Construction pipeline to balance annotation quality and scalability, leveraging model-generated drafts refined by expert annotators. The benchmark supports advanced evaluation of 46 state-of-the-art LLMs, offering empirical insights into the effectiveness of long-context modeling paradigms, parameter scaling, and cross-lingual alignment (Chen et al., 6 Jan 2026).

1. Rationale and Design Principles

The expansion of LLM context windows—to hundreds of thousands or even millions of tokens—renders existing benchmarks insufficient. Previous benchmarks either substitute synthetic toy tasks (e.g., RULER, MRCR, GSM-∞), which lack real-world complexity, or rely on fully manual annotation (e.g., LongBench v2), which is difficult to scale and does not generalize to extreme lengths or bilingual/multitask scenarios. LongBench Pro addresses these gaps by targeting:

  • Realism: Sourcing 1,500 samples from naturally occurring English and Chinese documents (news, science, law, code, tables, etc.).
  • Comprehensiveness: Covering 11 primary and 25 secondary tasks, representative of contemporary NLP needs.
  • Scalability: Supporting input contexts from 8K up to 256K tokens, exposing truly long-document challenges.
  • Fine-Grained Taxonomy: Applying a multi-dimensional labeling of context requirement, length, and difficulty.
  • Annotation Efficiency: Leveraging a collaborative pipeline where frontier LLMs draft candidate items and experts verify correctness, reducing the cost and cognitive load relative to fully manual annotation (Chen et al., 6 Jan 2026).

2. Multi-Dimensional Taxonomy and Sample Labeling

Each LongBench Pro sample is classified along three orthogonal axes:

  • Context Requirement:
    • Full Dependency ("global"): Response demands integration of information dispersed throughout the document.
    • Partial Dependency ("local"): Only a specific contiguous passage is relevant.
  • Document Length: Six buckets, quantized as 8K, 16K, 32K, 64K, 128K, 256K tokens (using the Qwen tokenizer), with ±20% flexibility.
  • Difficulty: Four calibrated levels, defined by LLM pass-rates:
    • Extreme: ≤1 high-tier model answers correctly.
    • Hard: ≤1 mid-tier model answers correctly (if not Extreme).
    • Moderate: ≤1 low-tier model answers correctly (if not above).
    • Easy: Otherwise.

Difficulty is assessed by grouping models into high/mid/low tiers (based on performance), then applying task-specific metric thresholds (e.g., ≥0.65 for summarization) to determine correctness (Chen et al., 6 Jan 2026).

3. Task Coverage and Metrics

LongBench Pro supports a diverse set of tasks relevant for long-document NLP, as summarized below:

Primary Task Example Subtask Context Requirement Metric
Retrieval & Ranking Global Cohesive Retrieval Full NDCG@k
Sequencing & Structure Global Timeline Reconstruction Full Pairwise Acc.
Evidence-Grounded QA Multi-Doc Integration QA Full Accuracy
Summarization & Synthesis Global-coverage Summary Full SemSim+ROUGE-L
Attribution & Citation Alignment Full-sentence Alignment Full F1
Aggregation & Clustering Doc Clustering Full SubEM
Consistency & Compliance Checking Global Conflict Localization Full F1
Structured & Numeric Reasoning Multi-source Consistency Verif. Full SubEM
Version & Code Diff Analysis Multi-Version Impact Full F1
Rule Induction & In-Context Learning Rule Induction Full SubEM
Dialogue Memory & Long-Horizon Long-Range Entity Tracking Full Accuracy

The summarization score for relevant subtasks is defined as: Scoresum=0.5maxiSemSim(Sgen,Srefi)+0.5maxiROUGE-L(Sgen,Srefi)\mathrm{Score}_{\text{sum}} = 0.5 \max_i \mathrm{SemSim}(S_{\text{gen}}, S_{\text{ref}}^i) + 0.5 \max_i \mathrm{ROUGE\text{-}L}(S_{\text{gen}}, S_{\text{ref}}^i) (Chen et al., 6 Jan 2026).

4. Human–Model Collaborative Construction Pipeline

The annotation pipeline is structured as follows:

  1. Document Collection: Curate 1,500 English & Chinese documents drawn from multiple domains, covered across all length buckets and vetted for privacy or copyright concerns.
  2. Model Drafting: Five SOTA LLMs (Gemini-2.5-Pro, GPT-5, Claude-4, DeepSeek-V3.2, Qwen3-235B-A22B-Thinking) are prompted to generate multiple candidate samples per task, each comprising question, reference answer, design rationale, and solution process.
  3. Human Verification & Selection: Expert annotators verify task-context alignment, correctness, and ensure the sample is challenging, i.e., “fooling” at least one drafting model.
  4. Question Standardization: For each sample, two prompt templates are produced—Non-Thinking ("Output the '[Answer]' identifier first…") and Thinking ("Think step by step…").
  5. Answer Review: Model predictions are combined with independent human review to audit both precision (all required components correct) and recall (no valid components missed). Discrepancies are adjudicated by domain experts.
  6. Difficulty Labeling: Difficulty levels are set using representative models in each tier per sample (Chen et al., 6 Jan 2026).

5. Evaluation Methodology and Metrics

LongBench Pro applies task-specific metrics with normalized scoring ([0,1], reported as ×100):

  • Retrieval & Ranking: NDCG@k
  • Sequencing/Clustering: Pairwise Accuracy (correctly ordered item pairs among all (n2)\binom{n}{2} pairs)
  • QA Tasks: Accuracy
  • Citation & Violations: F1 (2precision×recallprecision+recall2 \frac{\text{precision} \times \text{recall}}{\text{precision} + \text{recall}})
  • Single-Answer Generation: SubEM (strict exact-match over subcomponents, average for several tasks)
  • Summarization: Combined SemSim and ROUGE-L

All definitions follow LaTeX-formulations as specified in the benchmark documentation (Chen et al., 6 Jan 2026).

6. Empirical Findings from Model Evaluation

Evaluation of 46 long-context LLMs (3B–1T parameters, context ≤1M tokens) on LongBench Pro yields three key results:

  • Context Optimization Surpasses Parameter Scaling: Extending a model’s effective context length provides larger performance gains than increasing parameter count; e.g., Qwen3-4B-Instruct-2507 (256K context) outperforms Qwen3-8B (128K context).
  • Effective vs. Claimed Context Length and Cross-Lingual Misalignment: Many models nominally support long contexts (>100K tokens) but effective comprehension degrades near the upper limit. Models show English/Chinese performance divergence (GPT, Claude favor English; GLM, Kimi favor Chinese), though this gap narrows in high-tier models.
  • Reasoning Paradigm Impact: Chain-of-thought ("thinking") prompts systematically improve results, especially in models natively trained with stepwise reasoning. Forcing chain-of-thought in models without native support produces marginal or negative improvements. Mixed-thinking models (hybrid fast + deep reasoning modes) offer Pareto-optimal trade-offs between latency and output quality (Chen et al., 6 Jan 2026).

7. Limitations and Directions for Future Research

LongBench Pro advances the evaluation of long-context LLMs, yet several challenges remain:

  • Verification of extremely long context outputs is cognitively demanding, even with the collaborative annotation model.
  • Effective context length continues to lag well behind claimed capabilities; for example, a model with a 256K-token window may leverage only ~120K tokens effectively.
  • Cross-lingual performance alignment is incomplete, underscoring the need for improved multilingual training and evaluation methodologies.
  • Suggested future research includes recursive critique-of-critique (“meta-verification”) pipelines, expansion to contexts beyond 256K tokens and multimodal documents, and development of automated difficulty calibration and adaptive sampling strategies during evaluation.

LongBench Pro establishes a robust, scalable, bilingual benchmark for the evaluation of long-context LLM comprehension and reasoning, providing a foundation for further advances in extreme-length, cross-lingual, and multitask NLP models (Chen et al., 6 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LongBench Pro.