Papers
Topics
Authors
Recent
Search
2000 character limit reached

Automatic Legal Writing Evaluation of LLMs

Published 29 Apr 2025 in cs.CL and cs.AI | (2504.21202v1)

Abstract: Despite the recent advances in LLMs, benchmarks for evaluating legal writing remain scarce due to the inherent complexity of assessing open-ended responses in this domain. One of the key challenges in evaluating LLMs on domain-specific tasks is finding test datasets that are public, frequently updated, and contain comprehensive evaluation guidelines. The Brazilian Bar Examination meets these requirements. We introduce oab-bench, a benchmark comprising 105 questions across seven areas of law from recent editions of the exam. The benchmark includes comprehensive evaluation guidelines and reference materials used by human examiners to ensure consistent grading. We evaluate the performance of four LLMs on oab-bench, finding that Claude-3.5 Sonnet achieves the best results with an average score of 7.93 out of 10, passing all 21 exams. We also investigated whether LLMs can serve as reliable automated judges for evaluating legal writing. Our experiments show that frontier models like OpenAI's o1 achieve a strong correlation with human scores when evaluating approved exams, suggesting their potential as reliable automated evaluators despite the inherently subjective nature of legal writing assessment. The source code and the benchmark -- containing questions, evaluation guidelines, model-generated responses, and their respective automated evaluations -- are publicly available.

Summary

  • The paper introduces oab-bench, a benchmark that uses LLMs to generate and evaluate legal exam responses, achieving notable alignment with human scoring.
  • It employs a dual role design where LLMs act as both candidates and evaluators, leveraging comprehensive guidelines from the Brazilian Bar Examination.
  • The study finds that Claude-3.5 Sonnet excels with an average score of 7.93, underscoring the potential of LLMs in automating subjective legal assessments.

The paper "Automatic Legal Writing Evaluation of LLMs" (2504.21202) presents a novel benchmark, oab-bench, designed to evaluate the performance of LLMs in the legal domain, specifically focusing on the Brazilian Bar Examination. This work addresses the complexity of assessing open-ended responses, a task characteristic of the legal domain due to its subjective nature.

Introduction and Motivation

The challenges associated with the automated evaluation of legal writing are multifaceted, primarily due to the open-ended and interpretative nature of legal tasks. Prior benchmarks such as LegalBench have laid the groundwork by incorporating interdisciplinary collaboration to represent real-world legal practice tasks. However, these often require human evaluators. The introduction of oab-bench aims to fill this gap, offering a structured, comprehensive evaluation framework leveraging LLMs as automated judges.

Methodological Approach

Benchmark Design: oab-bench

The oab-bench includes data derived from the Brazilian Bar Examination, a national examination required for law graduates to practice in Brazil. This benchmark comprises 105 questions spanning seven legal areas, each accompanied by comprehensive evaluation guidelines, reference materials, and detailed score distribution tables. This design ensures rigorous, consistent evaluation aligned with human examiner protocols.

Dual Role of LLMs: Candidate and Judge

The study showcases two roles for LLMs: as candidates generating answers to exam questions and as judges evaluating these answers. The generative aspect leverages specific prompts to mimic the exam conditions faced by human candidates. As judges, LLMs are tasked with assigning scores based on detailed evaluation criteria, such as legal argumentation quality and alignment with provided guidelines.

Evaluation Pipeline

The evaluation pipeline employs frontier models like OpenAI's o1 to assess model-generated responses' alignment with human experts' scores. The results demonstrated a high correlation between LLM-assigned scores and human evaluations, suggesting LLMs' potential as automated evaluators of legal writing.

Results and Findings

The evaluations revealed that Claude-3.5 Sonnet outperformed other LLMs, achieving an average score of 7.93 across all exams and passing every test. The study indicates that LLMs can substantially align their scoring with human judges, achieving a notable correlation in their assessments. This result underscores LLMs' growing role in automating complex, subjective evaluation tasks within specialized domains. Figure 1

Figure 1: Performance of each model across different areas of law.

Implications and Future Directions

The implications of this research extend towards the broader application of LLMs in automating subjective assessments in specialized fields like law. The demonstrated ability of LLMs to replicate human judgment in legal writing evaluations hints at significant cost and resource savings. Future developments could explore enhancing LLM evaluation accuracy through domain-specific fine-tuning and incorporating additional interpretative data.

Conclusion

The paper provides a significant contribution to the landscape of LLM evaluations in specialized, open-ended tasks. By introducing and validating the oab-bench, and exploring LLMs' dual roles as both candidates and evaluators, this work sets a precedent for further exploration of AI capabilities in complex, subjective assessment contexts, heralding advancements in automated legal writing evaluation.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Glossary

  • AlpacaEval: A benchmark used to assess chatbot-style open-ended performance via model comparisons. "chatbot benchmarks such as AlpacaEval"
  • Analytical grading: An evaluation approach that awards points for specific components (e.g., presentation, argumentation, legal basis) rather than only the final result. "The OAB exam grading is analytical."
  • Benchmark saturation: The phenomenon where established benchmarks no longer differentiate model performance because scores approach the ceiling. "concerns about benchmark saturation"
  • Brazilian Bar Examination: The national licensing exam (OAB) required to practice law in Brazil. "The Brazilian Bar Examination meets these requirements."
  • Chain-of-thought prompting: A prompting technique that elicits step-by-step reasoning from LLMs before final answers. "chain of thought prompting"
  • Commented answer: The official explanatory answer key detailing expected legal reasoning and acceptable argumentation. "commented answer"
  • Criminal Procedure Code: The Brazilian legal code governing criminal procedural rules. "Criminal Procedure Code"
  • Criminal review: A judicial remedy that allows revision of a criminal conviction after judgment. "Criminal review is admissible even after the death of the convicted person"
  • Data contamination: The unintended overlap between evaluation data and a model’s training corpus, risking biased assessments. "data contamination"
  • Extinction of punishability: A legal concept where criminal liability is extinguished (e.g., by the death of the convicted person). "extinction of punishability"
  • FGV (Fundação Getúlio Vargas): The Brazilian institution that organizes the OAB exam. "Fundação Getúlio Vargas (FGV)"
  • Frontier models: The most advanced, high-capability LLMs at the cutting edge of performance. "frontier models like OpenAI's o1 achieve a strong correlation with human scores"
  • Inter-annotator agreement: A measure of consistency among human evaluators assessing the same items. "inter-annotator agreement"
  • Jurisprudence: Case law and judicial precedents used to support legal arguments. "applying relevant legislation and jurisprudence to the presented case"
  • LegalBench: A large, interdisciplinary legal NLP benchmark reflecting real-world legal tasks. "LegalBench"
  • Legal theses: Specific legal arguments or doctrinal positions evaluated as items within the rubric. "items representing legal theses"
  • LLM judge: An LLM used as an automated evaluator to grade open-ended responses. "LLM judge"
  • Machine-readable format: A structured representation enabling automated parsing and analysis. "machine-readable format"
  • Mean Absolute Error (MAE): An error metric equal to the average absolute difference between predicted and true values. "Mean Absolute Error (MAE)"
  • Multi-turn judge prompt: An evaluation prompt that considers prior turns to assess interdependent sub-answers. "multi-turn judge prompt"
  • oab-bench: A benchmark derived from OAB exams for evaluating LLMs on legal writing tasks. "We introduce oab-bench, a benchmark comprising 105 questions across seven areas of law from recent editions of the exam."
  • Open-ended legal QA: Legal question answering tasks requiring free-form, reasoned responses. "open-ended legal QA"
  • Pact of San José, Costa Rica: A regional human rights treaty (American Convention on Human Rights). "the Pact of San José, Costa Rica"
  • peça prático-profissional: The practical-professional legal essay component of the OAB exam. "peça prático-profissional"
  • Position bias: Systematic bias in evaluations due to the ordering or placement of options/responses. "position bias"
  • Principle of non-transferability of punishment: Legal principle stating criminal penalties do not transfer to others (e.g., heirs). "principle of non-transferability of punishment"
  • Reasoning model: An LLM designed to perform extended internal reasoning before producing an answer. "a reasoning model released by OpenAI"
  • Reference-guided grading modes: Evaluation modes where the judge uses official references or guidelines to assess answers. "reference-guided grading modes"
  • Retrieval-Augmented Generation (RAG): A method that augments generation with retrieved documents to improve factuality and alignment. "Retrieval-Augmented Generation (RAG)"
  • Score distribution table: The official rubric specifying how points are allocated across parts of an answer. "score distribution table"
  • Standardized exams: Tests with fixed formats and scoring procedures used to compare performance across cohorts and models. "standardized exams"
  • UltraFeedback: A post-training dataset providing preference signals to improve LLM alignment via reinforcement learning. "UltraFeedback"

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.