LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models

Published 20 Aug 2023 in cs.CL, cs.AI, and cs.CY | (2308.11462v1)

Abstract: The advent of LLMs and their adoption by the legal community has given rise to the question: what types of legal reasoning can LLMs perform? To enable greater study of this question, we present LegalBench: a collaboratively constructed legal reasoning benchmark consisting of 162 tasks covering six different types of legal reasoning. LegalBench was built through an interdisciplinary process, in which we collected tasks designed and hand-crafted by legal professionals. Because these subject matter experts took a leading role in construction, tasks either measure legal reasoning capabilities that are practically useful, or measure reasoning skills that lawyers find interesting. To enable cross-disciplinary conversations about LLMs in the law, we additionally show how popular legal frameworks for describing legal reasoning -- which distinguish between its many forms -- correspond to LegalBench tasks, thus giving lawyers and LLM developers a common vocabulary. This paper describes LegalBench, presents an empirical evaluation of 20 open-source and commercial LLMs, and illustrates the types of research explorations LegalBench enables.

Abstract PDF Upgrade to Chat

Citations (101)

View on Semantic Scholar

Summary

The paper introduces LegalBench, a novel benchmark designed to assess diverse legal reasoning abilities in large language models.
It details a methodology of handcrafted tasks by legal experts covering six types of reasoning, ensuring practical relevance.
Empirical evaluations of 20 LLMs reveal performance gaps in handling nuanced legal contexts, guiding future improvements in AI legal applications.

LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in LLMs

Introduction

The paper "LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in LLMs" introduces a unique benchmark specifically designed to assess the capacity of LLMs in executing various forms of legal reasoning. The benchmark, LegalBench, spans a wide array of tasks inspired by practical and theoretical challenges present in legal contexts. This effort represents a concerted interdisciplinary collaboration involving legal professionals who tailormade these tasks to highlight relevant legal reasoning capabilities.

Benchmark Design and Structure

LegalBench comprises a diverse set of tasks encompassing six distinct forms of legal reasoning. These tasks are meticulously handcrafted by subject matter experts to ensure their practical applicability and relevance to the legal field. The benchmark provides a robust framework where each task aligns with established legal reasoning frameworks, thereby facilitating a shared vernacular between legal practitioners and LLM developers. This common language supports cross-disciplinary dialogue, aiming to foster enhanced understanding and collaboration.

Empirical Evaluation and Results

The paper presents a comprehensive empirical evaluation of 20 LLMs, both open-source and commercial, using the LegalBench framework. This evaluation reveals significant insights into the strengths and limitations of LLMs in legal contexts. The results exhibit varying levels of performance across different tasks, highlighting the challenges LLMs face in legal reasoning, which often requires a deep understanding of complex, context-specific nuances. Such findings not only underscore the potential of LLMs in automating or assisting in legal tasks but also point out the current gaps that need addressing.

Implications for AI and Legal Communities

The introduction of LegalBench holds substantial implications for both the AI and legal communities. For AI researchers, it provides a structured methodology to evaluate and enhance the reasoning capabilities of LLMs in legal applications. For legal practitioners, it offers a platform to contribute directly to the evolution of AI tools that are better tailored to legal needs. The evaluation's insights also serve as guiding points for the development of more advanced LLMs capable of handling the intricacies of legal reasoning.

Future Directions

The ongoing development of LegalBench signifies the dynamic interface between AI and law, showing promise for future advancements in AI's application within the legal sphere. As LLMs continue to evolve, integrating deeper contextual and ethical reasoning into these models remains a pivotal area of research. The paper suggests that future iterations of LegalBench could incorporate more intricate reasoning challenges and evaluate models under more diverse legal contexts and jurisdictions, further bridging the gap between AI capabilities and real-world legal applications.

Conclusion

LegalBench represents a significant stride in qualitatively evaluating LLMs' legal reasoning abilities. By providing a collaborative and interdisciplinary approach to benchmark design, it stands as a pathway for advancing AI's role in legal domains, with the potential to drive innovations that meet the rigorous demands of legal reasoning and decision-making. This contribution not only advances the understanding of LLMs' competencies but also fosters a deeper integration of AI solutions within legal practices.

Markdown Report Issue

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a single, concrete list of gaps the paper leaves unresolved and actionable directions for future work:

Benchmark coverage of jurisdictions and practice areas is unclear; add tasks spanning diverse areas (e.g., criminal, administrative, immigration, IP/patents, family law) and state/federal variations to assess breadth.
Absence of multilingual evaluation; incorporate non-English legal corpora (leveraging resources like MultiLegalPile and LEXTREME) to test cross-language legal reasoning and translation fidelity of legal concepts.
Limited assessment of real-world workflows; create tasks that require document search, retrieval of statutes/cases/regulations, cite-checking, and updating based on current law rather than closed-book reasoning.
Unclear handling of legal ambiguity and indeterminacy; encode multiple acceptable answers, measure argumentation quality (e.g., IRAC structure, doctrinal support), and evaluate reasoning persuasiveness under disagreement.
Ground-truth validity and reliability not documented; report labeling protocols, sources of authoritative answers, inter-annotator agreement among legal experts, and resolution of contested items.
Missing evaluation of citation competence; add tasks that require correct Bluebook-formatted citations, accurate pin cites, and verifiable authorities to detect fabricated or misapplied citations.
Lack of long-context and multi-document reasoning tests; design tasks requiring cross-referencing lengthy cases, statutes, and regulatory provisions, including intra-document cross-citations and legislative history.
No temporal robustness assessment; introduce time-aware tasks that require applying law as of specific dates, test performance under statutory/regulatory updates, and define a maintenance plan for benchmark freshness.
Data contamination risks are not quantified; audit and document potential pretraining exposure to benchmark items, and include “fresh” out-of-pretraining tasks to measure true generalization.
Prompt sensitivity and robustness are underexplored; systematically vary prompts (e.g., IRAC templates, chain-of-thought, role instructions), measure stability of results, and publish standardized prompt sets and ablation studies.
Faithfulness of model rationales not measured; evaluate whether generated explanations accurately cite and apply legal rules, and penalize post hoc or fabricated reasoning steps.
Safety, ethics, and harm not benchmarked; add tasks to detect unsafe, biased, or confidentially risky advice, measure calibration and appropriate abstention, and assess compliance with professional responsibility standards.
Human-in-the-loop effectiveness is untested; run user studies with lawyers to measure time saved, error correction rates, oversight burdens, and workflow integration across drafting, review, and research tasks.
Tool-augmented capabilities are not compared; create parallel tracks for retrieval-augmented models, legal search integrations, calculators (e.g., damages, tax), and citation-checkers to quantify gains from tools.
Effect of legal-domain fine-tuning vs general foundation models needs analysis; conduct controlled experiments on domain-adaptive pretraining/fine-tuning and report scaling trends and transfer across legal subdomains.
Evaluation metrics are narrow; add calibration (e.g., Brier score), selective prediction/abstention, robustness to adversarial inputs, and cost-aware metrics reflecting the severity of legal errors.
Reproducibility constraints (model/API drift) are not addressed; publish deterministic evaluation protocols (seeds, temperatures, decoding settings), fallback prompts, and model versioning to ensure comparability over time.
Dataset licensing, privacy, and compliance are not detailed; provide datasheets specifying licensing, personal data handling, redactions, and permissible uses for academic and commercial contexts.
Multimodal legal materials are missing; include PDFs/scanned documents, tables, exhibits, signatures, and forms to assess models’ ability to handle non-textual and structured content common in practice.
Precedent-based analogical reasoning is underrepresented; design tasks requiring analogizing/distinguishing cases, identifying controlling versus persuasive authority, and mapping fact patterns to holdings.
Statutory interpretation doctrines are not stress-tested; add controlled tasks on textualism, purposivism, canons (e.g., ejusdem generis), and conflicting interpretive approaches to evaluate doctrinal consistency.
Cross-jurisdiction conflict-of-law scenarios are absent; introduce tasks on Erie doctrine, preemption, choice-of-law, and forum-selection to assess complex inter-jurisdictional reasoning.
Performance on edge cases and exceptions is unknown; curate rare or exception-heavy scenarios (e.g., narrow statutory carve-outs, equitable defenses) to test brittleness and coverage gaps.
External validity to real outcomes is unclear; correlate benchmark scores with practical proxies (e.g., bar exam sections, moot court judging, brief quality ratings, contract review error rates) to validate utility.
Maintenance and governance of the collaborative benchmark are unspecified; define processes for task addition, expert review, versioning, deprecation, and community contributions to keep the benchmark current and trustworthy.

View Paper Prompt View All Prompts

Glossary

Access to justice: The ability of individuals, especially those with limited resources, to obtain legal help and resolve disputes through the legal system. "How AI can improve access to justice"
Adjudication: The formal process by which a court or administrative body resolves legal disputes and issues decisions. "Artificial Intelligence for Adjudication: The Social Security Administration and AI Governance"
Adversarialism: A legal system structure where opposing parties present their cases to a neutral decision-maker, emphasizing contest between sides. "Legal Tech, Civil Procedure, and the Future of Adversarialism"
Civil Procedure: The body of rules governing how civil lawsuits are initiated, conducted, and resolved in courts. "Legal Tech, Civil Procedure, and the Future of Adversarialism"
Due diligence: A thorough investigation and risk assessment conducted before transactions or legal actions, especially in corporate contexts. "Harnessing Machine Learning for Due Diligence: Realizing the Possibilities"
eDiscovery: The process of identifying, collecting, and producing electronically stored information for litigation or investigations. "Epiq Launches Pre-Built NLP Model Strategy For eDiscovery"
Ejusdem generis: A canon of statutory interpretation meaning “of the same kind,” where general terms following specific ones are limited to the same class. "ejusdem generis"
Force majeure: A contract clause that excuses performance due to extraordinary events beyond the parties’ control (e.g., pandemics, natural disasters). "Pandemics and Force Majeure: How can AI help you?"
IRAC: A structured method for legal analysis: Issue, Rule, Application, Conclusion. "Legal Reasoning? It's All About IRAC"
Legal analytics: The application of data analysis and AI to legal data to derive insights for practice and decision-making. "Artificial intelligence and legal analytics: new tools for law practice in the digital age"
Legal judgment prediction: The task of forecasting court decisions or legal outcomes from case data using computational methods. "A survey on legal judgment prediction: Datasets, metrics, models and challenges"
Legal reasoning: The process of applying legal rules and principles to facts to reach judgments or arguments. "what types of legal reasoning can LLMs perform?"
Legal Tech: Technologies designed to support legal practice, research, and access to legal services. "Natural Language Processing in Legal Tech"
Litigation: The process of resolving disputes through the court system, including filing and prosecuting lawsuits. "The litigation state"
Merger agreement: A legally binding contract detailing the terms and conditions of a corporate merger. "MAUD: An Expert-Annotated Legal NLP Dataset for Merger Agreement Understanding"
Precedent: Prior judicial decisions that guide the reasoning and outcomes in subsequent similar cases. "Precedent and Analogy in Legal Reasoning"
Private enforcement: Enforcement of laws or regulations by private parties (e.g., through lawsuits), rather than by government agencies. "Private Enforcement in the States"
Statutory reasoning: Interpreting and applying statutes to particular cases using legal interpretive methods. "Can GPT-3 perform statutory reasoning?"
Successor liability: A corporate law doctrine where a successor company may be held liable for obligations of its predecessor. "From policy confusion to doctrinal clarity: successor liability from the perspective of big data"
Terms of service: Standard-form contracts governing the use of online platforms and services. "CLAUDETTE: an automated detector of potentially unfair clauses in online terms of service"
Textualism: A legal interpretive philosophy that focuses on the ordinary meaning of statutory text over legislative intent or purpose. "textualism"
USPTO: United States Patent and Trademark Office, the federal agency responsible for patents and trademarks. "The Harvard USPTO Patent Dataset: A Large-Scale, Well-Structured, and Multi-Purpose Corpus of Patent Applications"

View Paper Prompt View All Prompts

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

Continue Learning

Authors (40)

First 10 authors:

Collections

YouTube

Show All Videos

LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models

Summary

LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in LLMs

Introduction

Benchmark Design and Structure

Empirical Evaluation and Results

Implications for AI and Legal Communities

Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Glossary

Open Problems

Continue Learning

Related Papers

Authors (40)

Collections

YouTube