- The paper presents LegalRikai, a benchmark tailored to assess LLMs on long-context Japanese corporate legal tasks through four hierarchically structured editing tasks.
- It integrates human expert and automated evaluations to measure metrics like coverage, accuracy, and structural consistency in real-world legal workflows.
- Results indicate that state-of-the-art LLMs suffer up to a 50% drop in structure and impact capture metrics, underscoring challenges in abstract, globally consistent document editing.
LegalRikai: Open Benchmark – A Benchmark for Complex Japanese Corporate Legal Tasks
Motivation and Benchmark Design
LegalRikai: Open Benchmark presents a domain-specific benchmark constructed to address the gap between prevailing legal QA/classification datasets and real-world Japanese corporate legal workflows. The authors, supported by legal professionals under attorney supervision, focus on evaluating LLM capabilities along dimensions that mirror actual corporate legal practice—specifically longitudinal and structured document editing, rather than fragment-level legal reasoning.
The benchmark formalizes four hierarchically complex tasks: AmendExp (amendment explanation and impact summarization), StatRev (statute-driven contract revision), ReqRev (contract revision by explicit counterparty requirements), and RiskRev (contract revision for risk mitigation under abstract instructions). Each task models true-to-practice demands, such as tracking interrelated statute/contract changes and enforcing output structural and stylistic consistency.
For each task, bespoke evaluation metrics, with granular human/LLM-assessable aspects, are articulated, including coverage, accuracy, relevance, instruction following, change precision, structural consistency, terminology accuracy, and contractual phrasing appropriateness. Notably, the benchmark encompasses long-context inputs (statutes and full contracts exceeding 6,500–30,000 characters), such that models’ attention and reference abilities are adequately stress-tested.
Evaluation Methodology
LegalRikai deploys both human and automated evaluations. Outputs from GPT-5, Gemini 2.5 Pro, and Claude Opus 4.1 are scored by legal experts using fine-grained rubrics. Automated evaluation employs the same LLMs as evaluators, with meta-prompts to obtain structured, criterion-targeted judgments. The approach enables measurement of inter-rater agreement (Cohen’s κ), correlations (Spearman’s ρ), and deviation (MAE) between expert and LLM scoring.
Comparisons also extend to the effect of model scaling, with additional results for GPT-5-mini and GPT-5-nano to numerically quantify degradation for these complex tasks, as compared to general legal/QA benchmarks.
Principal Results
Task-Dependent Performance: In ReqRev (explicit counterparty instructions), all LLMs achieve high instruction-following and consistency scores—hypothesis: precision in instructions constrains possible error modes and overediting. In contrast, StatRev and RiskRev, which require statutory mapping or abstract risk identification/generation, reveal significant variation: models, especially with abstract input, frequently over-modify, hallucinate, or break structure/logic, despite plausible local edits. This introduces legal risk and review overheads in practical settings.
Model-Specific Findings: Gemini 2.5 Pro ranks best in factual accuracy and change precision; Claude Opus 4.1 maintains superior document/format consistency; GPT-5 performs averagely but fails to differentiate itself markedly in any specialty. All models display near-perfect terminology handling when terms explicitly must be changed but underperform in context-sensitive word choice and formal phrasing.
Scale Sensitivity: Unlike in general legal QA (e.g., GPQA diamond), where scale reduction from GPT-5 to GPT-5-nano incurs <15% performance drop, LegalRikai’s document-wide tasks suffer disproportionately—up to 50% drop in structure/impact capture metrics. This supports the claim that real-world contract editing requires not only knowledge coverage, but persistent, high-capacity context integration—posing a sharp contrast to cross-sectional benchmarks.
Automated vs Human Judgment: LLM-based evaluation is strongly correlated with human expert judgment on content-centric, linguistically grounded aspects (e.g., coverage and explicit instruction following, ρ > 0.5 in AmendExp). However, automated scoring poorly tracks expert decisions on structure and technical nuance (e.g., cross-article coherence, legal connotation of synonyms). Averaging over multiple evaluator models improves but does not eliminate this gap. Thus, automated evaluation is adequate for filtering and rough scoring but inadequate for final validation of legal outputs without human supervision.
Implications for Legal NLP and LLM Deployment
From a practical standpoint, the findings demonstrate that current SOTA LLMs, regardless of underlying architecture, cannot be considered reliable stand-alone editors for globally-consistent, abstract instruction-driven contract revision in Japanese law. In high-risk workflows (i.e., post-amendment cross-referential document updates, risk-led revision), model over-editing and format drift present substantial latent liabilities.
For theoretical research on LLMs, the observed scale sensitivity in long-context structured editing underscores a key open challenge: scalable representation and manipulation of deeply structured legal text, especially where global inter-clause dependencies and technical register must be maintained. Improvements in long-context encoding, memory persistence, and legal-domain constrained decoding are essential for progress.
The benchmark’s formal rubric and task design generalize to other languages/legal systems and offer a principled framework for future cross-jurisdictional comparison and few-shot/fine-tuning study.
Limitations and Future Directions
LegalRikai currently comprises 100 samples (25 per task) and focuses exclusively on closed-source SOTA models. Expansion to larger and open-source LLMs is necessary for statistical robustness and to inform instruction/data-centric interventions. Additionally, while the current benchmark covers canonical contract and amendment-driven workflows, extension to other legal document families (e.g., litigation pleadings, in-house opinions) is expected to yield further insights into LLM deficiencies and adaptation strategies.
The dataset and framework’s language- and jurisdiction-independence enables prospective comparative evaluation of civil vs. common law systems, and would facilitate future research on jurisdiction adaptation and specialized fine-tuning for legal models.
Conclusion
LegalRikai: Open Benchmark (2512.11297) rigorously evaluates LLMs on complex, document-level Japanese corporate legal tasks, revealing domain-specific model vulnerabilities that are not captured by conventional benchmarks. The research demonstrates both the necessity of domain- and workflow-specific evaluation and the critical limits of current LLMs for nuanced, globally consistent legal editing. The work serves as both a practical tool for the legal NLP research community and a technical demonstration of open challenges in deploying LLMs for high-stakes, structured domain applications.