LEGAL-BERT-FP: Domain-Adapted Legal NLP

Updated 8 February 2026

LEGAL-BERT-FP is a specialized BERT model adapted through continued pre-training on extensive, multi-jurisdictional legal corpora.
It employs additional MLM and NSP steps on ~12 GB of legal data, leading to systematic performance gains on legal downstream tasks.
The model retains the original BERT-base vocabulary and architecture, ensuring seamless integration into existing NLP pipelines.

LEGAL-BERT-FP is a specialized adaptation of the BERT transformer architecture designed to serve legal NLP applications through domain-specific continued pre-training. Unlike generic BERT models trained on broad corpora, LEGAL-BERT-FP employs further masked language modeling (MLM) and next sentence prediction (NSP) on large legal text corpora, allowing the model to better encode domain-relevant semantics and terminology. The model retains the original BERT-base vocabulary and architecture, focusing all adaptation into additional pre-training steps on a curated multi-jurisdictional legal dataset. The result is a model that consistently shows improved performance on downstream legal tasks compared to base BERT, without the need for vocabulary changes or architectural modifications (Chalkidis et al., 2020).

1. Model Overview and Domain Adaptation Strategy

LEGAL-BERT-FP continues pre-training from the BERT-base-uncased checkpoint. It is exposed to ~12 GB of in-domain data comprising:

EU legislation (1.9 GB)
UK legislation (1.4 GB)
European Court of Justice (ECJ) cases (0.6 GB)
European Court of Human Rights (ECHR) cases (0.5 GB)
US court cases (3.2 GB)
US contracts (3.9 GB)

Pre-training involves up to 500,000 additional steps using the standard BERT objectives: MLM and NSP. LEGAL-BERT-FP retains the 12-layer, 768-hidden-unit, 12-head architecture and 110 million parameters of bert-base-uncased, maintaining compatibility with downstream pipelines and existing infrastructure. The vocabulary remains unaltered from the original model, distinguishing LEGAL-BERT-FP from variants that rebuild tokenization on legal data (Chalkidis et al., 2020).

2. Pre-training and Optimization Objectives

LEGAL-BERT-FP optimizes the canonical BERT losses during domain adaptation:

Masked Language Modeling (MLM):

$L_{MLM} = -\sum_{i=1}^n \log P(x_i \mid x_{\setminus i})$

where $x_i$ is the masked token and $x_{\setminus i}$ the sequence with $x_i$ removed.

Next Sentence Prediction (NSP):

$L_{NSP} = -\sum_{j=1}^m [y_j \log P(\text{IsNext}\mid s_j) + (1-y_j) \log P(\text{NotNext}\mid s_j)]$

During task-specific fine-tuning, LEGAL-BERT-FP switches to the relevant cross-entropy or sequence-labeling objective (e.g., CRF negative log-likelihood for named entity recognition).

3. Fine-Tuning Methodology and Hyper-parameter Search

All LEGAL-BERT-FP fine-tuning experiments utilize the HuggingFace/Google BERT codebase defaults, with targeted deviations to maximize task performance:

Learning rate (AdamW): $\{1 \times 10^{-5}, 2 \times 10^{-5}, 3 \times 10^{-5}, 4 \times 10^{-5}, 5 \times 10^{-5}\}$
Batch size: $\{4, 8, 16, 32\}$
Dropout (task head): $\{0.1, 0.2\}$
Weight decay: 0.01 (default AdamW)
Warm-up: 10% of total steps (default linear schedule)
Maximum sequence length: 512
Early stopping based on validation loss (no fixed maximum epochs)
Remaining optimizer hyperparameters $(\beta_1, \beta_2, \epsilon)$ as in Devlin et al.

Consistent grid expansion beyond prior tuning conventions is central, with explicit recommendations to experiment across a wide range of batch sizes and learning rates and to employ early stopping rather than heuristic epoch counts (Chalkidis et al., 2020).

4. Comparative Evaluation on Legal Downstream Tasks

LEGAL-BERT-FP’s performance is benchmarked against BERT-base (both default and expanded tuning) and domain-trained-from-scratch LEGAL-BERT-SC models across several datasets:

Task	bert-base (def.)	bert-base (tuned)	LEGAL-BERT-FP	LEGAL-BERT-SC
eurlex57k micro-F1	89.1%	90.4%	90.6%	90.8%
ECHR binary accuracy	–	89.2%	90.0%	90.2%
ECHR multi-label micro-F1	–	82.4%	84.9%	85.1%
contracts-NER (header) F1	–	91.2%	93.0%	93.3%
contracts-NER (dispute) F1	–	85.6%	87.2%	87.5%
contracts-NER (lease) F1	–	80.5%	81.6%	81.9%

LEGAL-BERT-FP yields systematic improvements (+0.2 to +2.5 points) in micro-F1 and accuracy across tasks. LEGAL-BERT-SC, which rebuilds vocabulary and pre-trains from scratch, delivers slightly higher scores, but LEGAL-BERT-FP achieves strong gains with substantially lower resource requirements and faster deployment (Chalkidis et al., 2020).

5. Best Practices and Model Selection Criteria

Guidelines for effective use of LEGAL-BERT-FP include:

Expanded hyper-parameter search: Always extend beyond Devlin et al.'s grid (e.g., include lower learning rates/batch sizes, higher dropout, and leverage early stopping).
Variant comparison: Evaluate out-of-the-box BERT, LEGAL-BERT-FP (continued pre-training ≥100K steps), and from-scratch LEGAL-BERT-SC for held-out perplexity.
Weight decay and learning rate schedules: Employ default values (0.01 decay, 10% warmup).
Monitor in-domain MLM/NSP perplexity: Drops of ≥3 points typically predict significant downstream performance improvements.
Resource-constrained scenarios: Consider LEGAL-BERT-SMALL for limited GPU/memory environments; it matches larger variants within 0.3–0.5 points while training ~4× faster.
Benchmark sharing: Release fine-tuned model checkpoints and evaluation scripts to promote replicability across the legal NLP community (Chalkidis et al., 2020).

6. Context and Comparative Perspectives

LEGAL-BERT-FP contrasts with other legal-domain BERT adaptations, such as those that incorporate custom tokenizations or entirely distinct pre-training regimens. While LEGAL-BERT-SC and LEGAL-Vocab-BERT (Khan, 2021) further rebuild the vocabulary and restart training, LEGAL-BERT-FP restricts adaptation to additional MLM/NSP, enabling efficient and reproducible upgrades to existing BERT infrastructures without increasing parameter count or requiring specialized tokenizers. Compared to parameter-efficient variants using prefix or prompt tuning (Li et al., 2022), LEGAL-BERT-FP achieves robust performance by adapting all model weights through traditional unsupervised pre-training, as opposed to fine-tuning a small prefix. This approach yields strong calibration and empirical metrics, suggesting continued domain pre-training remains a high-reward, low-engineering-cost avenue for legal NLP.

7. Significance and Practical Implications

LEGAL-BERT-FP provides a production-ready, scalable, and empirically validated solution for practitioners and researchers seeking state-of-the-art results on legal text classification, named entity recognition, and related tasks. Its architecture and vocabulary consistency with BERT-base facilitates integration into pre-existing NLP pipelines, while continued in-domain pre-training delivers precision gains essential for high-stakes legal applications. LEGAL-BERT-FP’s design and empirical validation have influenced best practices in legal-domain NLP, establishing a template for domain adaptation of LLMs via targeted unsupervised pre-training (Chalkidis et al., 2020).