LEGAL-BERT: Legal NLP Transformers

Updated 8 February 2026

LEGAL-BERT is a family of BERT-based transformers pre-trained on extensive legal corpora to capture specialized legal language and terminology.
The models use techniques such as further pre-training and vocabulary injection to address tasks like legal reasoning, statutory interpretation, and entity recognition.
Empirical evaluations show LEGAL-BERT variants outperform standard BERT models, achieving higher macro-F1 scores and accuracy on legal NLP benchmarks.

LEGAL-BERT is a family of BERT-based transformer models specifically designed for NLP tasks in the legal domain. These models are trained or further pre-trained on large corpora of legal texts to capture the sublanguage, terminology, and stylistic conventions distinctive to statutes, case law, contracts, and judicial opinions. LEGAL-BERT provides a foundation for a variety of downstream applications, including legal document classification, named entity recognition (NER), and the automated analysis of judicial reasoning. Multiple research efforts have expanded upon and evaluated LEGAL-BERT models, establishing their empirical superiority over standard pre-trained architectures in legal NLP benchmarks (Chalkidis et al., 2020, Thalken et al., 2023, Khan, 2021).

1. Corpus Construction and Domain Adaptation

LEGAL-BERT models rely on legal-domain corpora that cover diverse jurisdictions and document types. The canonical LEGAL-BERT corpus comprises approximately 12 GB of English legal text (~355,000 documents) drawn from:

EU legislation (EUR-Lex): 61,826 documents, 1.9 GB
UK legislation (legislation.gov.uk): 19,867 documents, 1.4 GB
ECJ cases: 19,867 documents, 0.6 GB
ECHR cases: 12,554 documents, 0.5 GB
US court cases: 164,141 documents, 3.2 GB
US contracts (SEC-EDGAR): 76,366 documents, 3.9 GB

Legal corpora are often minimally pre-processed. For specialized tasks such as statutory-interpretation classification, filtering by keyword co-occurrence and up-sampling of paragraphs containing domain-typical terminology (e.g., “plain meaning,” “legislative history”) is used to target conceptually salient content (Thalken et al., 2023, Chalkidis et al., 2020).

2. Pre-training Strategies and Model Variants

Three principal domain adaptation strategies for BERT in LEGAL-BERT research are:

Out-of-the-box BERT: Direct fine-tuning of the original BERT model on legal tasks without domain adaptation.
Further pre-training (FP): Continued pre-training of BERT-base on legal data, preserving the original architecture and vocabulary, but updating contextual knowledge for the legal sublanguage.
Pre-training from scratch (SC): Initialization of a new BERT-base model trained exclusively on legal texts, often with a dedicated vocabulary learned via SentencePiece. For LEGAL-BERT-SC, the vocabulary size is 30,000 subwords, and the base architecture comprises 12 layers, 768 hidden size, 12 attention heads, totaling ~110 million parameters (Chalkidis et al., 2020).

The standard pre-training objectives—Masked Language Modeling (MLM) and Next Sentence Prediction (NSP)—are retained:

$L_{MLM} = -\sum_{i \in M} \log P(x_i \mid x_{\setminus M}), \qquad L_{NSP} = -[I_{IsNext} \log P(IsNext \mid A,B) + I_{NotNext} \log P(NotNext \mid A,B)],$

where $M$ denotes masked token positions, and $(A,B)$ is a candidate sentence pair.

3. Fine-tuning, Hyperparameters, and Vocabulary Integration

LEGAL-BERT fine-tuning departs from the “Devlin et al.” recipe by expanding the hyperparameter grid:

Learning rate: $1\times10^{-5}$ to $5\times10^{-5}$
Batch size: 4, 8, 16, 32
Dropout: 0.1 or 0.2 (classification head sometimes up to 0.3)
Early stopping on validation loss, no fixed epoch cap

Task-specific output heads are attached directly to the [CLS] token for classification tasks or token-level representations for NER (linear+CRF) (Chalkidis et al., 2020, Khan, 2021).

Additional adaptations include domain vocabulary injection: expanding the WordPiece vocabulary with frequently observed legal terms (e.g., adding 555 new terms) and randomly initializing associated embeddings. This modification improves representation of multi-word legal phrases and, when coupled with domain pre-training, produces synergistic gains (Khan, 2021).

4. Empirical Performance on Legal NLP Tasks

LEGAL-BERT is consistently evaluated against general-purpose BERT, further domain-pretrained variants, and alternative architectures (e.g., XLNet, T5):

Legal Reasoning Classification: On the three-way FORMAL/GRAND/NONE task for paragraph-level statutory-interpretation reasoning (U.S. Supreme Court), LEGAL-BERT-SC achieves macro-F1 = 0.70 (precision = 0.70, recall = 0.71), outperforming BERT-base (macro-F1 = 0.68), DistilBERT (0.67), T5-base (0.64), and all prompt-only LMs (GPT-4 with codebook prompt, macro-F1 = 0.22; few-shot GPT-4, 0.45; FLAN-T5 ≈0.19; Llama-2-chat ≈0.20) (Thalken et al., 2023).
Legal Opinions Classification: On majority-dissent classification, Legal-Vocab-BERT attains the best test accuracy (95.28%, +1.13 percentage points over BERT-Base). Domain-specific pretraining (Legal-BERT) yields a +0.93 point gain, while vocabulary augmentation alone yields +0.20 points. Combined, these techniques are mildly synergistic. XLNet achieves only +0.06 over BERT-Base and is 50% slower (Khan, 2021).
Legal NER and Multi-label Classification: LEGAL-BERT-SC matches or exceeds all other variants, with observed micro-F1 improvements across EURLEX57K multi-label classification (+0.2%), ECHR binary (+0.8%) and multi-label (+2.5%) classification, and Contracts-NER (+1.1–1.8% across entity types) (Chalkidis et al., 2020).

5. Specialized Applications: Judicial Reasoning and Legal Periodization

LEGAL-BERT-SC, fine-tuned on expert-annotated Supreme Court data, enables quantitative jurisprudence analysis:

A dataset of 15,860 opinions (1870–2014) is processed to derive annual formality scores—the fraction of interpretive paragraphs labeled FORMAL or GRAND.
Historical transitions captured by LEGAL-BERT-SC mirror canonical legal-historical periodizations:
- 1870–1910: Prevalence of formalism
- 1910–1937: Formalism declines
- 1937–1980: Grand style dominance
- Post-1980: Resurgence of formalism

These results align with existing qualitative accounts but afford more granular, data-driven assessments of doctrinal evolution. The paragraph-centric approach imposes constraints, as broader legal context beyond isolated text fragments is not modeled (Thalken et al., 2023).

6. Limitations and Future Directions

Several limitations are identified:

Inter-annotator agreement is moderate (Krippendorff’s $\alpha \approx 0.63$ ) even among domain experts, underscoring the inherent subjectivity of jurisprudential classification tasks.
Current taxonomy (three labels) restricts granularity; future efforts could expand interpretive categories or integrate discourse-level features.
Prompt engineering for generative LMs yields poor performance compared to fine-tuned LEGAL-BERT, indicating the necessity of annotation-intensive methods for highly specialized tasks.

Future directions include semi-supervised learning with larger silver-standard datasets, integration of cross-paragraph or document-level features, and development of richer legal label taxonomies (Thalken et al., 2023).

7. Comparative Table: LEGAL-BERT Variants and Performance

Model Variant	Domain Adaptation Method	Test Accuracy / Macro-F1
BERT-Base-Cased	General-domain, no further adapt.	94.15% (Legal Opinion Task) / 0.68 (Statutory Interpretation)
Legal-BERT	Domain-specific pre-training	95.08% (+0.93 vs BERT-Base)
Vocab-BERT	Legal vocabulary injection	94.35% (+0.20)
Legal-Vocab-BERT	Pre-training + vocab injection	95.28% (+1.13)
LEGAL-BERT-SC	Pre-trained from scratch, +legal vocab (for some tasks)	macro-F1 = 0.70 (Legal Reasoning)
XLNet-Base-Cased	General-domain, autoregressive	94.21% (+0.06 vs BERT-Base)
T5-base	General-domain seq2seq	Macro-F1 = 0.64
GPT-4 (prompt)	Prompted LLM, no fine-tuning	Macro-F1 = 0.22–0.45

LEGAL-BERT and its variants establish state-of-the-art benchmarks for multiple legal NLP tasks, demonstrating the necessity and efficacy of in-domain pre-training and legal vocabulary adaptation in the pursuit of robust, high-fidelity language understanding in the legal field (Chalkidis et al., 2020, Khan, 2021, Thalken et al., 2023).

Markdown Report Issue Upgrade to Chat

References (3)

LEGAL-BERT: The Muppets straight out of Law School (2020)

Modeling Legal Reasoning: LM Annotation at the Edge of Human Agreement (2023)

Comparing the Performance of NLP Toolkits and Evaluation measures in Legal Tech (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LEGAL-BERT.

LEGAL-BERT: Legal NLP Transformers

1. Corpus Construction and Domain Adaptation

2. Pre-training Strategies and Model Variants

3. Fine-tuning, Hyperparameters, and Vocabulary Integration

4. Empirical Performance on Legal NLP Tasks

5. Specialized Applications: Judicial Reasoning and Legal Periodization

6. Limitations and Future Directions

7. Comparative Table: LEGAL-BERT Variants and Performance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

LEGAL-BERT: Legal NLP Transformers

1. Corpus Construction and Domain Adaptation

2. Pre-training Strategies and Model Variants

3. Fine-tuning, Hyperparameters, and Vocabulary Integration

4. Empirical Performance on Legal NLP Tasks

5. Specialized Applications: Judicial Reasoning and Legal Periodization

6. Limitations and Future Directions

7. Comparative Table: LEGAL-BERT Variants and Performance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research