Self-Verification is All You Need To Pass The Japanese Bar Examination
Abstract: Despite rapid advances in LLMs, achieving reliable performance on highly professional and structured examinations remains a significant challenge. The Japanese bar examination is a particularly demanding benchmark, requiring not only advanced legal reasoning but also strict adherence to complex answer formats that involve joint evaluation of multiple propositions. While recent studies have reported improvements by decomposing such questions into simpler true--false judgments, these approaches have not been systematically evaluated under the original exam format and scoring scheme, leaving open the question of whether they truly capture exam-level competence. In this paper, we present a self-verification model trained on a newly constructed dataset that faithfully replicates the authentic format and evaluation scale of the exam. Our model is able to exceed the official passing score when evaluated on the actual exam scale, marking the first demonstration, to our knowledge, of an LLM passing the Japanese bar examination without altering its original question structure or scoring rules. We further conduct extensive comparisons with alternative strategies, including multi-agent inference and decomposition-based supervision, and find that these methods fail to achieve comparable performance. Our results highlight the importance of format-faithful supervision and consistency verification, and suggest that carefully designed single-model approaches can outperform more complex systems in high-stakes professional reasoning tasks. Our dataset and codes are publicly available.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
What is this paper about?
This paper shows a way to train an AI so it can pass the multiple-choice part of the Japanese bar exam without changing the exam’s original format or rules. The key idea is simple: have one model answer each question and then quickly “check its own work” before submitting the final answer. This “self-verification” step, combined with training on questions that look exactly like the real exam, helped the AI reach a passing score.
What did the researchers want to find out?
They focused on a few clear questions:
- Can an AI pass the Japanese bar exam when the questions and scoring are kept exactly the same as in the real test?
- Is it better to train on the exam in its original format rather than turning each question into a bunch of easy true/false items?
- Does a simple “answer-then-check” (self-verification) step help more than fancier methods like teams of multiple AIs (multi-agent systems)?
How did they do it? (Methods in simple terms)
Think of the Japanese bar exam like a combination lock. Each question has several statements (like A, B, C). You must judge each one correctly and follow strict answer rules (for example, write “112” where 1 means “correct” and 2 means “incorrect”). If you get even one part wrong or write the answer in the wrong format, you can lose a lot of points.
Here’s their approach:
- Keep the original format: They built a training set from real past exams (2019–2023) that looks just like the actual test, including the strict answer styles and the way points are given. They tested on the 2024 exam to make it realistic.
- Train one model end-to-end: The model learned to read a full, real-style question and produce the exact kind of answer the exam asks for (like “212” or “3” for multiple-choice).
- Add self-verification at test time: After the model gives an answer, it takes one more quick “second look” at the question and its own answer. If it spots a clear mistake—like a format slip or one wrong statement in an otherwise correct set—it fixes it. If everything looks right, it leaves the answer alone. You can think of this like checking your homework before turning it in.
- Compare with other strategies: They also tried:
- Decomposed training (turning each question into separate true/false mini-questions).
- Multi-agent systems (a team of different AI “roles,” like a retriever, checker, summarizer, and final answerer).
About scoring: The real exam uses a strict point system with partial credit. Importantly, several statements may be grouped together for a few points. One wrong can reduce points; two wrong can drop the whole set to zero—so just “mostly right” can still score nothing if the group rules aren’t met.
What did they find, and why is it important?
Main results:
- Passing on the real scale: Their single model, trained on the original format and using self-verification, scored 96 on the 2024 exam. The passing score was 93 (out of 175). That means it passed under the actual rules.
- Self-verification helps across the board: No matter which model they tried, adding the quick “check your own answer” step improved performance.
- Original-format training beats decomposition: Models trained on simplified true/false versions did not perform well on the real exam format. Simplifying the questions made learning easier, but it didn’t teach the AI how to handle the tricky “all parts must fit together” nature of real exam questions.
- Multi-agent teams didn’t help here: Splitting the work among several AI agents added complexity and actually hurt performance on this tightly structured task. Mistakes tended to build up across the different steps.
Why this matters:
- It shows that careful training on the real problem (not a simplified version) plus a smart self-check can be more powerful than bigger, more complicated systems.
- It also shows that many AI mistakes on hard tests come from consistency and formatting—things a quick self-check can fix—rather than a total lack of knowledge.
What does this mean for the future? (Implications)
- For high-stakes, rule-heavy tests (like professional exams), keeping the original question style during training is crucial. It teaches the AI to think about the whole problem at once and follow exact instructions.
- A simple, well-aligned approach (one model + self-verification) can beat more complex setups (like multi-agent teams) when the task requires strict consistency.
- This strategy could help build more reliable AI for other exams or structured tasks.
- Important limits: This work only covers the multiple-choice part, not the essay part that requires long, careful arguments. Also, passing a test doesn’t make an AI a real lawyer. Human oversight and caution are still necessary.
In short: Train on the real thing, have the model check its own work, and you can get strong, reliable performance—even on very demanding, format-heavy exams.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a single, concrete list of what remains missing, uncertain, or unexplored in the paper, formulated to be actionable for future research:
- Generalization beyond a single test year: Results are reported only on the 2024 (R6) exam. Evaluate on multiple unseen years (e.g., R3, R4, R5 as held-out test sets) to assess stability across cohorts and distribution shifts.
- Statistical robustness: Experiments are repeated three times but lack confidence intervals, variance analyses, and formal significance testing. Quantify effect sizes and statistical reliability (e.g., bootstrap CIs, paired tests).
- Pretraining data contamination: GPT-4.1 may have seen past exam questions during pretraining. Conduct contamination checks and evaluate on truly unseen or embargoed questions to rule out memorization.
- Model dependence: The approach is tested only on GPT-4.1. Assess portability to diverse models (open-source and proprietary; Japanese-native vs multilingual; different parameter scales) to establish method generality.
- Dataset scale vs performance: The dataset has 460 questions; no scaling analysis is provided. Investigate how performance scales with more format-faithful data (e.g., adding older exams) and identify data-efficiency regimes.
- Ablations on training signals: It is unclear whether gains are driven by format-faithful supervision, the verification prompt, or both. Perform controlled ablations (format-faithful vs decomposed; with/without verification; varying prompt strictness).
- Verification depth and strategy: Only a single verification pass is used. Explore iterative verification, debate-style internal verification, confidence-calibrated corrections, and programmatic consistency checks.
- Error taxonomy: No granular analysis of error types is provided. Build an error taxonomy (formatting errors, single-constituent misjudgments, statute misinterpretations, precedent misapplications) to target method improvements.
- Calibration and uncertainty: The model cannot abstain or signal uncertainty. Measure calibration (e.g., probability of correctness under self-verification) and test whether confidence-guided acceptance improves scoring.
- Robustness to format variants: Prompts enforce strict output formats, but the approach is not stress-tested against new or altered exam formats (e.g., roman numerals, reordered statements, hybrid formats). Benchmark robustness to formatting perturbations.
- Handling label noise and ambiguity: Some items can be ambiguous or precedent-sensitive. Quantify label noise, disagreement across expert annotations, and sensitivity to changes in judicial interpretations.
- Temporal currency of legal knowledge: The model may rely on outdated precedents. Evaluate time-aware performance (e.g., post-2024 statutory changes) and integrate retrieval of updated statutes/case law to maintain currency.
- External knowledge integration: The pipeline avoids external tools and retrieval. Test whether statute/precedent retrieval, citation grounding, or verification against authoritative sources yields gains without format violations.
- Compute and latency costs: Self-verification adds an extra forward pass, but inference cost and latency are unreported. Measure throughput, latency, and cost across exam scales to assess deployability.
- Section-level robustness: While section scores meet thresholds, analyze per-topic variability within constitutional, civil, and criminal law (e.g., property, procedure, contracts, evidence) to identify weak subdomains.
- Free-response generalization: The approach targets multiple-choice only. Design and evaluate a format-faithful method for essay questions, including citation control, structured argumentation, and rubric-aligned scoring.
- Cross-jurisdiction transfer: It is unknown whether format-faithful self-verification generalizes to other bar exams (e.g., Korea, China) with multi-proposition constraints. Conduct cross-jurisdiction experiments.
- Non-legal multi-proposition tasks: Test whether the approach benefits other domains requiring joint consistency (e.g., medical boards, engineering certification) to validate broader utility.
- Prompt sensitivity and reproducibility: Results hinge on carefully engineered prompts. Quantify sensitivity to prompt variations, document exact prompt templates, and test automated prompt optimization.
- Training details and reproducibility: Provide explicit fine-tuning parameters (objective, batch sizes, learning rates, epochs, sampling, temperature, decoding) and release scripts to enable independent replication.
- Point-based grading fidelity: The partial-credit implementation is described but not audited. Validate the scoring logic against official rubrics across all question types and edge cases.
- Multi-agent design space: Only one four-stage pipeline is evaluated. Explore alternative architectures (debate, voting, hierarchical controllers, shared scratchpads, iterative aggregation) and measure failure modes systematically.
- Interaction failures in multi-agent systems: Identify where errors propagate (retrieval mismatch, abstraction drift, verification filtering). Instrument agents with diagnostics and design coordination mechanisms to mitigate compounding errors.
- Mechanistic explanation of gains: The claim that self-verification elicits latent knowledge is not empirically validated. Use controlled synthetic tasks, probing, and attention/representation analyses to test this hypothesis.
- Fairness and accessibility: Fine-tuning GPT-4.1 may not be accessible to all researchers. Provide an open-source reproduction path (e.g., instruct-tuned Llama/Qwen) and compare results to address equity and openness.
Practical Applications
Practical Applications Derived from the Paper’s Findings
Below are actionable, real-world applications that leverage the paper’s contributions: (1) format-faithful dataset construction for tightly constrained exams; (2) a single-model, answer-conditioned self-verification pass that enforces strict output formats and improves global consistency; and (3) empirical guidance showing that single-model + verification outperforms multi-agent pipelines for exam-style, multi-proposition reasoning.
Immediate Applications
- Education (legal/standardized test prep) — Bar exam tutor and scorer for Japan
- Use a fine-tuned, format-faithful model with self-verification to deliver practice, strict-format answer capture, and official-scale scoring (including partial credit rules).
- Tools/products/workflows:
- “Exam-scale Scorer” engine that mirrors authentic point allocations and partial credit.
- “Strict-format Answer UI” that only accepts valid composites (e.g., 11221).
- “Self-Verify Inference Wrapper” that runs an answer pass followed by a conservative verification pass.
- Dependencies/assumptions: Access to up-to-date past questions and legal precedent; a strong Japanese LLM and fine-tuning capability; exam IP/licensing compliance; explicit disclosure that this is a study aid, not legal advice.
- Education (cross-exam EdTech) — Port to other multi-proposition MC exams
- Rapidly replicate the approach for exams with joint-evaluation constraints (e.g., JP civil service, CPA, some medical board MCQs, US MBE-style composites).
- Tools/products/workflows: “Format-faithful Dataset Builder” scripts; modular scoring schemas for different partial-credit rules; prompt packs enforcing strict outputs.
- Dependencies/assumptions: Public availability of past items; domain-language LLM competence; careful mapping of each exam’s native scoring and format.
- Software/LLM Ops — Structured-output reliability layer
- Adopt the paper’s strict-format prompting and self-verification as a generic guardrail for any pipeline that must return rigid formats (e.g., JSON/XML/CSV, checklist digits, combinatorial labels).
- Tools/products/workflows: A drop-in “Verify-then-Commit” middleware that:
- Re-prompts the same model to validate adherence to schema and to correct only when clearly inconsistent.
- Reduces brittle post-hoc normalization heuristics.
- Dependencies/assumptions: Base model must be competent enough to evaluate its own outputs; domain schemas must be explicit and unambiguous.
- Legal and Compliance (enterprise) — Policy/contract checklist copilot
- Use the single-model + self-verification pattern to check multi-factor policy compliance questions, RFP compliance matrices, or contract clause checklists where a single wrong factor invalidates an option.
- Tools/products/workflows: “Checklist Consistency Checker” that maps policy criteria to composite judgments and enforces output constraints.
- Dependencies/assumptions: Human-in-the-loop review for all compliance outcomes; current, authoritative policy libraries; careful calibration to avoid false assurance.
- Finance (reporting and controls) — Pre-checker for regulatory filings
- Apply joint-consistency verification for reporting templates with interdependent conditions (e.g., if A is true, B must be ≤ threshold, else C becomes mandatory).
- Tools/products/workflows: “Reg-Form Validator” using strict schemas and a verification pass to reduce formatting/logic errors before human submission.
- Dependencies/assumptions: Exact regulatory templates and rules codified; human sign-off; jurisdiction-specific validation logic maintained as rules change.
- Government/Policy (evaluation and procurement) — Format-faithful AI benchmarks
- Use the dataset design principles to evaluate AI systems under authentic exam/task formats rather than decomposed proxies, for procurement or standard-setting.
- Tools/products/workflows: “Authentic-Format Evaluation Harness” that scores models on native rules (no reformatting), with transparent partial credit.
- Dependencies/assumptions: Access to representative, licensed test sets; clear governance on how benchmark results inform decisions.
- Academia (research and pedagogy) — Benchmarks and methods for structured reasoning
- Build and share authentic-format benchmarks for other domains (law, medicine, finance) to study multi-proposition reasoning and strict-format adherence.
- Tools/products/workflows: Reusable SFT/verification prompts; open-source scorers; ablation suites comparing single-model vs multi-agent under authentic scoring.
- Dependencies/assumptions: Curated datasets with accurate, up-to-date answers; IRB/IP considerations for exam content; reproducible evaluation pipelines.
- Daily life (forms and applications) — Multi-section form-filling assistant
- Self-verification to catch inconsistent or format-invalid entries across interdependent fields in tax, visa, scholarship, or benefits applications.
- Tools/products/workflows: “Consistency Pass” that flags and minimally corrects entries prior to submission, while logging changes.
- Dependencies/assumptions: Clear jurisdictional rules and schemas; user consent and privacy protections; disclaimers and human confirmation.
- Product strategy (LLM platform teams) — Complexity reduction vs multi-agent pipelines
- Replace fragile multi-agent stacks with a strong single-model + verification for strictly constrained tasks, reducing inference paths and coordination failures.
- Tools/products/workflows: Architectural pattern docs; cost/latency models showing one extra forward pass is cheaper and more reliable than multi-agent chains.
- Dependencies/assumptions: The task is analogous to exam-style composite judgments; base model has sufficient latent knowledge; thorough A/B validation.
Long-Term Applications
- Legal EdTech — Full exam coverage including essay (論文式)
- Extend format-faithful + self-verification to structured essay grading: citation checks, factor tests, and internal consistency of multi-issue analyses.
- Tools/products/workflows: “Structured Argument Verifier” that cross-checks claims against cited statutes/cases and issue spotting rubrics.
- Dependencies/assumptions: Retrieval-grounding for up-to-date law; rubric-aligned scoring data; robust citational verification; much deeper evaluation design.
- Courts and Law Firms — Drafting assistants with multi-factor legal tests
- AI assistants that apply multi-prong doctrinal tests (e.g., proportionality, reasonableness) and self-verify conclusions against enumerated factors and precedent constraints.
- Tools/products/workflows: “Factor-Test Copilot” with explicit factor schemas; verification that conclusions don’t contradict earlier factor findings.
- Dependencies/assumptions: Continuous legal updates; high-stakes validation and audit logs; professional oversight; jurisdictional customization.
- Policy and Certification — Standardized, authentic-format AI competence exams
- Sector regulators could adopt format-faithful benchmarks to certify AI systems for specific uses (e.g., document triage, intake screening) under real scoring rules.
- Tools/products/workflows: “Authentic-Use Certification Suite” with versioned test banks and pass/fail criteria tied to domain safety thresholds.
- Dependencies/assumptions: Governance around test leakage and model overfitting; test refresh cycles; transparent reporting.
- Cross-jurisdiction legal tutoring — Multilingual expansion
- Build authenticated-format datasets and scoring for other jurisdictions (Korea, China, EU), delivering localized bar/civil-service tutors with self-verification.
- Tools/products/workflows: Localization pipeline for legal sources; jurisdiction-specific scoring engines; multilingual prompts.
- Dependencies/assumptions: High-quality translations; licensed content; strong base models in target languages; ongoing maintenance as laws evolve.
- Enterprise decision engines — Hybrid symbolic-LLM consistency layers
- Combine self-verifying LLM outputs with constraint solvers to guarantee satisfaction of hard rules in underwriting, safety checklists, clinical pathways.
- Tools/products/workflows: “LLM + Constraints Orchestrator” that (1) proposes, (2) self-verifies, then (3) validates via a rules engine before approval.
- Dependencies/assumptions: Formalized rule sets; integrations with MDM/knowledge sources; clear escalation to human adjudicators.
- Government services — End-to-end e-filing copilots with legal checks
- Agents that complete complex administrative filings and self-verify internal consistency, alerting users to conflicts and missing conditions before submission.
- Tools/products/workflows: “E-filing Copilot” embedded in portals; dynamic checklists; minimal-change correction policies to prevent over-editing.
- Dependencies/assumptions: Secure data handling; agency API integrations; clear liability and consent frameworks; human confirmation steps.
- Privacy-first deployments — On-prem or on-device self-verifying models
- For sensitive domains (healthcare, finance, legal), deploy smaller fine-tuned models with the verification pattern within controlled environments.
- Tools/products/workflows: Distillation/fine-tuning toolchains; lightweight verification prompts; model monitoring for drift and legal updates.
- Dependencies/assumptions: Sufficient on-prem compute; domain-adapted models reaching competence thresholds; update pipelines.
- Safety and governance research — Error taxonomy and guardrail design
- Systematic study of when self-verification corrects vs. cements errors; design of “only-correct-if-clearly-inconsistent” policies as a governance primitive.
- Tools/products/workflows: Evaluation suites for global-consistency failures; datasets capturing subtle format vs. knowledge errors; policy templates.
- Dependencies/assumptions: Broad, diverse testbeds; shared reporting standards; community benchmarks.
Notes on feasibility and assumptions across applications:
- The method assumes access to a strong pretrained model with latent domain knowledge; weaker models may not benefit as much from self-verification.
- Authentic-format datasets are pivotal; transferring to new domains requires careful replication of native formats and scoring.
- Legal and regulatory contexts demand human oversight, provenance tracking, and timely updates to statutes/case law.
- The verification pass adds a small but non-zero inference cost; cost-benefit depends on baseline error rates and the severity of format/consistency errors.
Glossary
Below is an alphabetical list of domain-specific terms from the paper, each with a concise definition and a verbatim usage example.
- acquisition of counterfeit currency: A criminal offense involving obtaining or possessing counterfeit money with intent to use. "In this case, the crime of acquisition of counterfeit currency is not established."
- agent diversity: Variation in the training or configuration of multiple agents to increase functional differences. "We further find that increasing agent diversity through independent fine-tuning does not improve performance and instead exacerbates coordination failures."
- agentic behavior: The coordinated, autonomous behavior exhibited by multiple interacting agents. "Shared representations appear to be crucial for effective agentic behavior in multi-agent setting."
- alteration of a private document with a seal: A crime involving unauthorized modification of a sealed private document. "In this case, the crime of alteration of a private document with a seal is established."
- answer-conditioned self-verification: A verification step where the model reassesses its own answer to correct errors. "Our approach combines supervised fine-tuning with answer-conditioned self-verification."
- authentic exam format: The original, unmodified structure of the examination used for evaluation. "Our work directly addresses this gap by evaluating models on the authentic exam format and scale."
- chain-of-thought prompting: Prompting that elicits step-by-step reasoning traces from a model. "Particularly when chain-of-thought prompting is enabled."
- combinatorial rules: Constraints requiring selection under combinations of multiple conditions. "Constrained selection under combinatorial rules."
- conceptual concurrence: A legal doctrine where multiple crimes are considered concurrently under a single conceptual framework. "In this case, fraud and uttering counterfeit currency are established, and the two crimes are in conceptual concurrence."
- consistency verification: A method to check and enforce coherence of a model’s output against task constraints. "Our results highlight the importance of format-faithful supervision and consistency verification."
- decomposed propositions: Breaking complex questions into independent statements for separate evaluation. "This reinforces the point that performing well on decomposed propositions does not automatically translate to equivalent performance on actual exam format and scale."
- decomposition-based supervision: Training approach that uses simplified, decomposed tasks rather than the original complex format. "We further conduct extensive comparisons with alternative strategies, including multi-agent inference and decomposition-based supervision."
- exact-match accuracy: Metric that counts an answer as correct only if it exactly matches the gold output. "We report exact-match accuracy as well as the official examination point score."
- format-faithful supervision: Training that strictly preserves the original task’s format and constraints. "Our results highlight the importance of format-faithful supervision and consistency verification."
- format memorization: Overfitting to output patterns or formats without genuine reasoning. "Note that its performance gain cannot be attributed to format memorization or answer pattern learning."
- format-specific fine-tuning: Fine-tuning tailored to the exact output and structure required by the task. "In short, format-specific fine-tuning teaches the model how to exploit internal knowledge that would otherwise remain dormant."
- forgery of a private document with a seal: A crime involving creating a counterfeit sealed private document. "In this case, the crime of forgery of a private document with a seal is established."
- forgery of a valuable security: A crime involving falsifying financial instruments or securities. "In this case, the crime of forgery of a valuable security is established."
- free-response (論文式): The essay-style component of the exam requiring structured legal argumentation. "As future work, we aim to extend this approach to the free-response (論文式) portion of the exam."
- global decision consistency: Ensuring that local judgments align coherently to produce a consistent overall decision. "Preserving the native question format during training appears critical for enabling models to align local legal knowledge with global decision consistency."
- joint-consistency constraints: Requirements that multiple parts of an answer remain mutually consistent under strict rules. "The final reasoning agent must reconcile under strict formatting and joint-consistency constraints."
- joint decision structure: Exam design where multiple statements must be evaluated together to form a single decision. "Our dataset preserves the original joint decision structure."
- joint evaluation: Assessing several propositions or statements together under a single set of constraints. "Examinees must jointly evaluate multiple statements and select correct combinations under rigid answer constraints."
- knowledge abstraction: Extracting generalizable legal principles from specific examples or cases. "Retrieval, verification, knowledge abstraction, and final answering."
- legal text entailment: Determining whether one legal text logically follows from another. "Legal text entailment."
- multi-agent architectures: Systems in which multiple LLM agents collaborate or debate to produce an answer. "We also investigate more complex inference strategies, including multi-agent architectures."
- multi-agent pipeline: A sequential system of specialized agents handling different sub-tasks. "We implemented a multi-agent pipeline in which distinct agents are responsible for retrieval, verification, knowledge abstraction, and final answering."
- normalization schemes: Methods to standardize varied model outputs into a consistent format. "Which eradicates the necessity for normalization schemes to account for various output formats."
- official exam grading scheme: The formal scoring rules used by the examination, including partial credit. "Accuracy denotes exact-match answer accuracy, while scores follow the official exam grading scheme with partial credit."
- partial credit scheme: A scoring method awarding fractional points based on the number of correct subparts. "Partial credit scheme works as following; when 3 or more questions are grouped together with n points, getting one question wrong results in n-2 points."
- proposition-level supervision: Training on individual statements rather than the whole composite question. "These findings suggest that while proposition-level supervision may improve performance on simplified benchmarks, it does not necessarily transfer to evaluation settings that require holistic reasoning."
- role-specific prompts: Prompts tailored to the distinct roles of agents in a pipeline. "Each agent was fine-tuned separately on the same training data but with role-specific prompts."
- self-verification: Having a model re-evaluate and potentially correct its own answer. "We introduce a verification step in which the model re-evaluates its own predicted answer."
- strict answer format: Output constraints requiring answers to be given in a precise, specified format. "【Strict answer format】Output only the answer. Do not include any reasons, explanations, or symbols."
- task reformulation: Changing the original task structure to simplify training or evaluation. "Many rely on task reformulation, simplified supervision, or indirect evaluation metrics."
- true/false judgments: Binary decisions indicating whether statements are correct or incorrect. "Decomposing such questions into simpler true--false judgments."
- true/false reformulation: Converting complex questions into independent binary statements for training. "The true/false reformulation introduces an implicit shift in task distribution."
- uttering counterfeit currency: A crime involving passing or using counterfeit money. "Uttering counterfeit currency are established."
- verification-oriented behavior: Model behavior prompted to check and correct answers conservatively. "Under a different prompt that induces verification-oriented behavior."
- zero-shot: Model evaluation without any in-context examples. "We examine both zero-shot and few-shot setting."
- few-shot: Model evaluation with a small number of in-context examples. "We examine both zero-shot and few-shot setting."
Collections
Sign up for free to add this paper to one or more collections.