Caselaw Access Project (CAP)
- The Caselaw Access Project is a digital repository that aggregates over six million U.S. court opinions with comprehensive metadata for legal research and AI applications.
- It employs high-volume OCR digitization and a RESTful API to ensure accurate, computationally accessible legal documents for NLP tasks.
- CAP underpins datasets like CiteCaseLAW, enabling advanced citation detection and improved performance of machine learning models in legal NLP.
The Caselaw Access Project (CAP) is a large-scale digitization and aggregation initiative that provides comprehensive access to United States court opinions, accompanied by extensive metadata and robust licensing for research applications. It serves as foundational infrastructure for data-driven legal research, NLP, and downstream AI applications focused on case law analysis.
1. Scope and Content of the Caselaw Access Project
CAP aggregates the full text of over six million U.S. court opinions originating from federal, state, and territorial courts, representing 61 jurisdictions in total. The integrated corpus draws primarily on the Harvard Law School Collection and the Fastcase Collection, both subject to high-volume OCR digitization. Each opinion is enriched with metadata comprising the court of origin, date, reporter volume, page information, OCR confidence scores, and, when available, a partial list of cited cases.
Documents are structured for computational accessibility, rendering them suitable for downstream annotation, corpus construction, and machine learning experiments in legal text mining. CAP exposes data through a RESTful API (https://case.law/), with full-bulk downloads available to academic users under a community-use license, subject to non-commercial use and attribution requirements. Commercial entities must negotiate separate licensing terms (Khatri et al., 2023).
2. Preprocessing and Structuring for Large-Scale NLP: CiteCaseLAW
Khatri et al. (2022) leverage CAP to automatically compile the CiteCaseLAW dataset for citation-worthiness detection in the legal domain (Khatri et al., 2023). The workflow proceeds as follows:
- The "Opinion" section of each case serves as the principal text source.
- Preprocessing removes footnotes, page headers/footers, non-ASCII tokens, and OCR-induced hyphenation artifacts. Abbreviation normalization (e.g., "Art." → "Article") addresses domain-specific idiosyncrasies critical for accurate sentence boundary detection.
- Empirical testing shows off-the-shelf splitters such as SpaCy, NLTK, and SegTok yield ≤ 50% correct splits on legal text. The optimal solution involves pySBD with custom "golden rules" and manual correction of problematic abbreviations, validated to deliver 99.2% accurate splits on a 1,000-document sample.
- Citation detection targets two prevalent formats: "versus-style" (e.g., "Smith v. Jones") and "reporter-style" (volume, reporter abbreviation, page, year), identified via regular expressions and replaced by a placeholder token "[CITATION_SPAN]." Sentences are then labeled into four categories, of which only type 1 ("not-cite," label 0) and type 4 ("cite," label 1) are retained, yielding a binary labeled corpus optimal for supervised learning.
3. Dataset Characteristics and Statistics
The resulting CiteCaseLAW corpus comprises 178,459,203 sentences drawn from 5,548,618 processed documents, with an average of approximately 32 sentences per document. Of these, 10,487,177 sentences (5.87%) are labeled as citation-worthy (label 1), and the remaining 167,972,026 (94.13%) as non-citation (label 0). Three canonical splits—Small (1M), Medium (10M), and Large (178M)—are distributed, with class balance maintained at approximately 5.87% positive in each.
| Version | Total sentences | Cite-worthy (%) |
|---|---|---|
| Small | 1,000,000 | 58,909 (5.89%) |
| Medium | 10,000,000 | 586,999 (5.87%) |
| Large | 178,459,203 | 10,487,177 (5.87%) |
This scale and class distribution enable robust model training and evaluation, obviating the need for costly manual annotation.
4. Modeling Frameworks and Performance Metrics
CiteCaseLAW supports benchmark comparisons across a spectrum of machine learning architectures, utilizing binary cross-entropy loss
where denotes the one-hot label and is the softmax probability.
Models evaluated include:
- Logistic Regression with TF-IDF features (baseline)
- CRNN combining convolutional and biLSTM layers
- Transformer trained from scratch (6 layers, sparse attention)
- Longformer for long-sequence handling
- BERT-Base fine-tuned
- LEGAL-BERT (a BERT/RoBERTa variant pre-trained on legal corpora)
- LEGAL-BERT augmented with Positive-Unlabeled (PU) learning to address subjective label noise
Hyperparameters (learning rate, decay, warmup steps, epochs) were optimized per model via Bayesian/random search.
| Model | Precision | Recall | F1 |
|---|---|---|---|
| Logistic Regression | 77.85 | 75.77 | 76.79 |
| CRNN | 76.54 | 74.72 | 74.93 |
| Transformer | 72.42 | 84.25 | 77.89 |
| Longformer | 87.10 | 86.02 | 86.56 |
| BERT | 87.73 | 86.56 | 87.14 |
| LEGAL-BERT | 87.64 | 87.20 | 87.42 |
| LEGAL-BERT + PU | 84.17 | 92.86 | 88.30 |
The LEGAL-BERT + PU model achieves the highest F1 score (88.3%). PU learning substantially increases recall (from 87.2% to 92.9%), validating its effectiveness in managing label noise. Error analysis suggests that false negatives persist especially in formulaic prose where citation needs are implicit, while false positives frequently occur with policy-like statements deemed citation-worthy (Khatri et al., 2023).
5. Impact on Legal NLP and Downstream Applications
The application of CAP to create CiteCaseLAW demonstrates the potential for large-scale, high-fidelity labeled corpora in the legal domain, eliminating dependence on manual annotation. Prior to CiteCaseLAW, no public dataset existed for automatic citation-worthiness detection in caselaw. Pre-training or fine-tuning transformer architectures on in-domain CAP data, as in LEGAL-BERT, produces significant improvements over models trained exclusively on general-domain corpora.
Key recommendations from Khatri et al. include extending modeling to multi-class citation intent (e.g., distinguishing supportive from distinguishing citations), exploiting CAP’s metadata (jurisdiction, year, judge), and developing advanced legal-assistive systems such as citation recommendation frameworks. The granular jurisdictional metadata could enable analyses of temporal and regional citation practices.
A plausible implication is that CAP can serve as the backbone for next-generation legal writing and research tools, from automated citation flagging within drafting software to comprehensive, context-aware case law recommender systems (Khatri et al., 2023).
6. Future Prospects and Research Directions
Prospective research trajectories include:
- Enriching citation analysis by annotating for intent or case function;
- Integrating jurisdictional and temporal signals from CAP metadata into neural architectures;
- Systematic evaluation of pre-training and adaptation using CAP versus general text;
- Exploring regional and historical evolution of U.S. citation practices utilizing the jurisdictional breadth of CAP.
Continued development and opening of the CAP corpus holds promise for scalable, empirically-driven advances in legal informatics and the broader domain of legal AI.