Papers
Topics
Authors
Recent
Search
2000 character limit reached

LEGAL-BERT: Legal NLP Transformers

Updated 8 February 2026
  • LEGAL-BERT is a family of BERT-based transformers pre-trained on extensive legal corpora to capture specialized legal language and terminology.
  • The models use techniques such as further pre-training and vocabulary injection to address tasks like legal reasoning, statutory interpretation, and entity recognition.
  • Empirical evaluations show LEGAL-BERT variants outperform standard BERT models, achieving higher macro-F1 scores and accuracy on legal NLP benchmarks.

LEGAL-BERT is a family of BERT-based transformer models specifically designed for NLP tasks in the legal domain. These models are trained or further pre-trained on large corpora of legal texts to capture the sublanguage, terminology, and stylistic conventions distinctive to statutes, case law, contracts, and judicial opinions. LEGAL-BERT provides a foundation for a variety of downstream applications, including legal document classification, named entity recognition (NER), and the automated analysis of judicial reasoning. Multiple research efforts have expanded upon and evaluated LEGAL-BERT models, establishing their empirical superiority over standard pre-trained architectures in legal NLP benchmarks (Chalkidis et al., 2020, Thalken et al., 2023, Khan, 2021).

1. Corpus Construction and Domain Adaptation

LEGAL-BERT models rely on legal-domain corpora that cover diverse jurisdictions and document types. The canonical LEGAL-BERT corpus comprises approximately 12 GB of English legal text (~355,000 documents) drawn from:

  • EU legislation (EUR-Lex): 61,826 documents, 1.9 GB
  • UK legislation (legislation.gov.uk): 19,867 documents, 1.4 GB
  • ECJ cases: 19,867 documents, 0.6 GB
  • ECHR cases: 12,554 documents, 0.5 GB
  • US court cases: 164,141 documents, 3.2 GB
  • US contracts (SEC-EDGAR): 76,366 documents, 3.9 GB

Legal corpora are often minimally pre-processed. For specialized tasks such as statutory-interpretation classification, filtering by keyword co-occurrence and up-sampling of paragraphs containing domain-typical terminology (e.g., “plain meaning,” “legislative history”) is used to target conceptually salient content (Thalken et al., 2023, Chalkidis et al., 2020).

2. Pre-training Strategies and Model Variants

Three principal domain adaptation strategies for BERT in LEGAL-BERT research are:

  1. Out-of-the-box BERT: Direct fine-tuning of the original BERT model on legal tasks without domain adaptation.
  2. Further pre-training (FP): Continued pre-training of BERT-base on legal data, preserving the original architecture and vocabulary, but updating contextual knowledge for the legal sublanguage.
  3. Pre-training from scratch (SC): Initialization of a new BERT-base model trained exclusively on legal texts, often with a dedicated vocabulary learned via SentencePiece. For LEGAL-BERT-SC, the vocabulary size is 30,000 subwords, and the base architecture comprises 12 layers, 768 hidden size, 12 attention heads, totaling ~110 million parameters (Chalkidis et al., 2020).

The standard pre-training objectives—Masked Language Modeling (MLM) and Next Sentence Prediction (NSP)—are retained:

LMLM=iMlogP(xixM),LNSP=[IIsNextlogP(IsNextA,B)+INotNextlogP(NotNextA,B)],L_{MLM} = -\sum_{i \in M} \log P(x_i \mid x_{\setminus M}), \qquad L_{NSP} = -[I_{IsNext} \log P(IsNext \mid A,B) + I_{NotNext} \log P(NotNext \mid A,B)],

where MM denotes masked token positions, and (A,B)(A,B) is a candidate sentence pair.

3. Fine-tuning, Hyperparameters, and Vocabulary Integration

LEGAL-BERT fine-tuning departs from the “Devlin et al.” recipe by expanding the hyperparameter grid:

  • Learning rate: 1×1051\times10^{-5} to 5×1055\times10^{-5}
  • Batch size: 4, 8, 16, 32
  • Dropout: 0.1 or 0.2 (classification head sometimes up to 0.3)
  • Early stopping on validation loss, no fixed epoch cap

Task-specific output heads are attached directly to the [CLS] token for classification tasks or token-level representations for NER (linear+CRF) (Chalkidis et al., 2020, Khan, 2021).

Additional adaptations include domain vocabulary injection: expanding the WordPiece vocabulary with frequently observed legal terms (e.g., adding 555 new terms) and randomly initializing associated embeddings. This modification improves representation of multi-word legal phrases and, when coupled with domain pre-training, produces synergistic gains (Khan, 2021).

LEGAL-BERT is consistently evaluated against general-purpose BERT, further domain-pretrained variants, and alternative architectures (e.g., XLNet, T5):

  • Legal Reasoning Classification: On the three-way FORMAL/GRAND/NONE task for paragraph-level statutory-interpretation reasoning (U.S. Supreme Court), LEGAL-BERT-SC achieves macro-F1 = 0.70 (precision = 0.70, recall = 0.71), outperforming BERT-base (macro-F1 = 0.68), DistilBERT (0.67), T5-base (0.64), and all prompt-only LMs (GPT-4 with codebook prompt, macro-F1 = 0.22; few-shot GPT-4, 0.45; FLAN-T5 ≈0.19; Llama-2-chat ≈0.20) (Thalken et al., 2023).
  • Legal Opinions Classification: On majority-dissent classification, Legal-Vocab-BERT attains the best test accuracy (95.28%, +1.13 percentage points over BERT-Base). Domain-specific pretraining (Legal-BERT) yields a +0.93 point gain, while vocabulary augmentation alone yields +0.20 points. Combined, these techniques are mildly synergistic. XLNet achieves only +0.06 over BERT-Base and is 50% slower (Khan, 2021).
  • Legal NER and Multi-label Classification: LEGAL-BERT-SC matches or exceeds all other variants, with observed micro-F1 improvements across EURLEX57K multi-label classification (+0.2%), ECHR binary (+0.8%) and multi-label (+2.5%) classification, and Contracts-NER (+1.1–1.8% across entity types) (Chalkidis et al., 2020).

LEGAL-BERT-SC, fine-tuned on expert-annotated Supreme Court data, enables quantitative jurisprudence analysis:

  • A dataset of 15,860 opinions (1870–2014) is processed to derive annual formality scores—the fraction of interpretive paragraphs labeled FORMAL or GRAND.
  • Historical transitions captured by LEGAL-BERT-SC mirror canonical legal-historical periodizations:
    • 1870–1910: Prevalence of formalism
    • 1910–1937: Formalism declines
    • 1937–1980: Grand style dominance
    • Post-1980: Resurgence of formalism

These results align with existing qualitative accounts but afford more granular, data-driven assessments of doctrinal evolution. The paragraph-centric approach imposes constraints, as broader legal context beyond isolated text fragments is not modeled (Thalken et al., 2023).

6. Limitations and Future Directions

Several limitations are identified:

  • Inter-annotator agreement is moderate (Krippendorff’s α0.63\alpha \approx 0.63) even among domain experts, underscoring the inherent subjectivity of jurisprudential classification tasks.
  • Current taxonomy (three labels) restricts granularity; future efforts could expand interpretive categories or integrate discourse-level features.
  • Prompt engineering for generative LMs yields poor performance compared to fine-tuned LEGAL-BERT, indicating the necessity of annotation-intensive methods for highly specialized tasks.

Future directions include semi-supervised learning with larger silver-standard datasets, integration of cross-paragraph or document-level features, and development of richer legal label taxonomies (Thalken et al., 2023).

Model Variant Domain Adaptation Method Test Accuracy / Macro-F1
BERT-Base-Cased General-domain, no further adapt. 94.15% (Legal Opinion Task) / 0.68 (Statutory Interpretation)
Legal-BERT Domain-specific pre-training 95.08% (+0.93 vs BERT-Base)
Vocab-BERT Legal vocabulary injection 94.35% (+0.20)
Legal-Vocab-BERT Pre-training + vocab injection 95.28% (+1.13)
LEGAL-BERT-SC Pre-trained from scratch, +legal vocab (for some tasks) macro-F1 = 0.70 (Legal Reasoning)
XLNet-Base-Cased General-domain, autoregressive 94.21% (+0.06 vs BERT-Base)
T5-base General-domain seq2seq Macro-F1 = 0.64
GPT-4 (prompt) Prompted LLM, no fine-tuning Macro-F1 = 0.22–0.45

LEGAL-BERT and its variants establish state-of-the-art benchmarks for multiple legal NLP tasks, demonstrating the necessity and efficacy of in-domain pre-training and legal vocabulary adaptation in the pursuit of robust, high-fidelity language understanding in the legal field (Chalkidis et al., 2020, Khan, 2021, Thalken et al., 2023).

Topic to Video (Beta)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to LEGAL-BERT.