Papers
Topics
Authors
Recent
Search
2000 character limit reached

Caselaw Access Project (CAP)

Updated 10 February 2026
  • The Caselaw Access Project is a digital repository that aggregates over six million U.S. court opinions with comprehensive metadata for legal research and AI applications.
  • It employs high-volume OCR digitization and a RESTful API to ensure accurate, computationally accessible legal documents for NLP tasks.
  • CAP underpins datasets like CiteCaseLAW, enabling advanced citation detection and improved performance of machine learning models in legal NLP.

The Caselaw Access Project (CAP) is a large-scale digitization and aggregation initiative that provides comprehensive access to United States court opinions, accompanied by extensive metadata and robust licensing for research applications. It serves as foundational infrastructure for data-driven legal research, NLP, and downstream AI applications focused on case law analysis.

1. Scope and Content of the Caselaw Access Project

CAP aggregates the full text of over six million U.S. court opinions originating from federal, state, and territorial courts, representing 61 jurisdictions in total. The integrated corpus draws primarily on the Harvard Law School Collection and the Fastcase Collection, both subject to high-volume OCR digitization. Each opinion is enriched with metadata comprising the court of origin, date, reporter volume, page information, OCR confidence scores, and, when available, a partial list of cited cases.

Documents are structured for computational accessibility, rendering them suitable for downstream annotation, corpus construction, and machine learning experiments in legal text mining. CAP exposes data through a RESTful API (https://case.law/), with full-bulk downloads available to academic users under a community-use license, subject to non-commercial use and attribution requirements. Commercial entities must negotiate separate licensing terms (Khatri et al., 2023).

2. Preprocessing and Structuring for Large-Scale NLP: CiteCaseLAW

Khatri et al. (2022) leverage CAP to automatically compile the CiteCaseLAW dataset for citation-worthiness detection in the legal domain (Khatri et al., 2023). The workflow proceeds as follows:

  • The "Opinion" section of each case serves as the principal text source.
  • Preprocessing removes footnotes, page headers/footers, non-ASCII tokens, and OCR-induced hyphenation artifacts. Abbreviation normalization (e.g., "Art." → "Article") addresses domain-specific idiosyncrasies critical for accurate sentence boundary detection.
  • Empirical testing shows off-the-shelf splitters such as SpaCy, NLTK, and SegTok yield ≤ 50% correct splits on legal text. The optimal solution involves pySBD with custom "golden rules" and manual correction of problematic abbreviations, validated to deliver 99.2% accurate splits on a 1,000-document sample.
  • Citation detection targets two prevalent formats: "versus-style" (e.g., "Smith v. Jones") and "reporter-style" (volume, reporter abbreviation, page, year), identified via regular expressions and replaced by a placeholder token "[CITATION_SPAN]." Sentences are then labeled into four categories, of which only type 1 ("not-cite," label 0) and type 4 ("cite," label 1) are retained, yielding a binary labeled corpus optimal for supervised learning.

3. Dataset Characteristics and Statistics

The resulting CiteCaseLAW corpus comprises 178,459,203 sentences drawn from 5,548,618 processed documents, with an average of approximately 32 sentences per document. Of these, 10,487,177 sentences (5.87%) are labeled as citation-worthy (label 1), and the remaining 167,972,026 (94.13%) as non-citation (label 0). Three canonical splits—Small (1M), Medium (10M), and Large (178M)—are distributed, with class balance maintained at approximately 5.87% positive in each.

Version Total sentences Cite-worthy (%)
Small 1,000,000 58,909 (5.89%)
Medium 10,000,000 586,999 (5.87%)
Large 178,459,203 10,487,177 (5.87%)

This scale and class distribution enable robust model training and evaluation, obviating the need for costly manual annotation.

4. Modeling Frameworks and Performance Metrics

CiteCaseLAW supports benchmark comparisons across a spectrum of machine learning architectures, utilizing binary cross-entropy loss

L=i{0,1}yilogpiL = -\sum_{i\in\{0,1\}} y_i \log p_i

where yy denotes the one-hot label and pp is the softmax probability.

Models evaluated include:

  • Logistic Regression with TF-IDF features (baseline)
  • CRNN combining convolutional and biLSTM layers
  • Transformer trained from scratch (6 layers, sparse attention)
  • Longformer for long-sequence handling
  • BERT-Base fine-tuned
  • LEGAL-BERT (a BERT/RoBERTa variant pre-trained on legal corpora)
  • LEGAL-BERT augmented with Positive-Unlabeled (PU) learning to address subjective label noise

Hyperparameters (learning rate, decay, warmup steps, epochs) were optimized per model via Bayesian/random search.

Model Precision Recall F1
Logistic Regression 77.85 75.77 76.79
CRNN 76.54 74.72 74.93
Transformer 72.42 84.25 77.89
Longformer 87.10 86.02 86.56
BERT 87.73 86.56 87.14
LEGAL-BERT 87.64 87.20 87.42
LEGAL-BERT + PU 84.17 92.86 88.30

The LEGAL-BERT + PU model achieves the highest F1 score (88.3%). PU learning substantially increases recall (from 87.2% to 92.9%), validating its effectiveness in managing label noise. Error analysis suggests that false negatives persist especially in formulaic prose where citation needs are implicit, while false positives frequently occur with policy-like statements deemed citation-worthy (Khatri et al., 2023).

The application of CAP to create CiteCaseLAW demonstrates the potential for large-scale, high-fidelity labeled corpora in the legal domain, eliminating dependence on manual annotation. Prior to CiteCaseLAW, no public dataset existed for automatic citation-worthiness detection in caselaw. Pre-training or fine-tuning transformer architectures on in-domain CAP data, as in LEGAL-BERT, produces significant improvements over models trained exclusively on general-domain corpora.

Key recommendations from Khatri et al. include extending modeling to multi-class citation intent (e.g., distinguishing supportive from distinguishing citations), exploiting CAP’s metadata (jurisdiction, year, judge), and developing advanced legal-assistive systems such as citation recommendation frameworks. The granular jurisdictional metadata could enable analyses of temporal and regional citation practices.

A plausible implication is that CAP can serve as the backbone for next-generation legal writing and research tools, from automated citation flagging within drafting software to comprehensive, context-aware case law recommender systems (Khatri et al., 2023).

6. Future Prospects and Research Directions

Prospective research trajectories include:

  • Enriching citation analysis by annotating for intent or case function;
  • Integrating jurisdictional and temporal signals from CAP metadata into neural architectures;
  • Systematic evaluation of pre-training and adaptation using CAP versus general text;
  • Exploring regional and historical evolution of U.S. citation practices utilizing the jurisdictional breadth of CAP.

Continued development and opening of the CAP corpus holds promise for scalable, empirically-driven advances in legal informatics and the broader domain of legal AI.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Caselaw Access Project (CAP).