CLERC: Case Law Evaluation Retrieval Corpus

Updated 10 February 2026

CLERC dataset is a comprehensive legal resource comprised of U.S. federal opinions and structured multi-view annotations that support both information retrieval and retrieval-augmented generation tasks.
It is constructed from the Caselaw Access Project, featuring 1.84M documents, 23.7M passages, and extensive citation data to enable precise legal analysis.
CLERC advances legal AI by benchmarking citation retrieval and generation performance while highlighting challenges like OCR noise and citation masking errors.

The CLERC (Case Law Evaluation Retrieval Corpus) dataset is a large-scale resource to support research in legal information retrieval (IR) and retrieval-augmented text generation (RAG) with a specific focus on legal case opinions and the practice of precedent citation in downstream analytical writing. Developed by leveraging the Caselaw Access Project (CAP), CLERC is designed to benchmark and accelerate advances in legal AI, particularly in systems supporting legal professionals in finding, citing, and analyzing case law with high factual and citation accuracy (Hou et al., 2024).

1. Source Corpus and Dataset Construction

CLERC is constructed atop the CAP dataset, comprising 1.84 million U.S. federal court opinions dated up to September 21, 2021, with a mean of 11.54 citations per document and over 20.7 million unique case-to-case references. The dataset is structured across four principal "views":

Dataset view	Instances	Average Length
CLERC/doc	1.84 million	~4,500 words (concatenated opinions)
CLERC/passage	23.7 million	350 words (sliding window, 175 overlap)
CLERC/queries	105,000	300 words (windowed around a masked citation)
CLERC/generation	6,000	~200 characters (analytical paragraphs)

Retrieval queries are further annotated for the presence of direct quotations of the cited text and for whether only the central citation is masked or all citations in the context are masked. For IR tasks, 105,000 query passages were extracted with ≈ 327,000 (query, positive, negative) triples designated for training fine-tuned retrievers, and 2,851 indirect-single queries reserved for evaluation. For RAG, 6,000 analytical paragraphs containing a minimum of two citations each were sampled from the final third of opinions, with 1,000 forming a fixed test set. Citation density increases markedly toward the end of opinions, peaking at 7.9 citations per 100 words in the last decile.

2. Task Definitions and Evaluation Metrics

CLERC defines two major tasks: Information Retrieval (IR) and Retrieval-Augmented Generation (RAG).

Information Retrieval:

Given a collection of documents or passages $D = \{d_1, d_2, \ldots, d_N\}$ , a query $q$ (concatenated left/right context around a masked citation, $q = [\ell; r]$ ), and ground truth cited passage(s) $R$ , the aim is to produce a ranked list $\hat{R}$ so that $R$ is retrieved efficiently. Evaluation metrics include:

Recall@ $k$ = $|\hat{R}@k \cap R| / |R|$
nDCG@ $k$ , utilizing standard graded-relevance discounting.

Retrieval-Augmented Generation:

Given context paragraphs $P_{<t} = (p_1, ..., p_{t-1})$ and retrieved passages $R_t = \{r_1, ..., r_k\}$ (text from relevant cases), the model must generate target paragraph $\hat{y}_t$ emulating the style and appropriately citing cases as in the gold paragraph $p_t$ with gold citations $C_r$ . Generation is evaluated with:

ROUGE-1, ROUGE-2, ROUGE-L $F_1$ .
BARTScore for overall text quality.
Citation Recall (CR), Citation Precision (CP), and Citation False-Positive rate (CFP):
- $CR = |\{c \in \hat{y}_t\} \cap C_r|/|C_r|$
- $CP = |\{c \in \hat{y}_t\} \cap C_r|/|\{c \in \hat{y}_t\}|$
- $CFP = 1 - CP$

3. Benchmark Model Performance

Zero-shot IR methods demonstrate low recall in the CLERC setting. BM25 achieves Recall@1000 ≈ 48.3%. Dense bi-encoders (BGE, E5-v2, Contriever-MSMarco) attain only 41–43% Recall@1000. Late-interaction models (ColBERTv2, Jina-ColBERT) perform less well (<17% Recall@1000), likely due to challenges in modeling long queries.

Fine-tuned retrievers benefit substantially from CLERC's supervision:

BERT-DPR trained on CLERC/passage: 63.1% Recall@1000.
LegalBERT-DPR (statutory pretraining, CLERC fine-tuning): 68.5% Recall@1000, nDCG@10 = 14.7.

RAG evaluations with instruct-tuned LLMs (Mistral-7B, Gemma-7B, Llama-3-8B, GPT-4o) supplied with the full text of retrieved citations exhibit the following:

Model	ROUGE-1 w/ refs	Citation Recall	Citation Precision	CFP w/ refs
GPT-4o	26.82	89.9%	52.8%	6.4%
Llama-3-8B	25.16	62.6%	33.4%	4.6%
Mistral-7B	23.78	42.7%	32.7%	5.3%
Gemma-7B	18.33	37.2%	36.6%	4.3%

Providing full cited passages improves CR (average +68%) and CP (+16×), and substantially reduces hallucination rates (CFP down by 87%). However, none of the open-source models surpass 63% recall on gold citations, and hallucinations remain problematic. GPT-4o achieves the highest overall scores but still exhibits a false-positive citation rate exceeding 6%, indicating persistent challenges with spurious or hallucinated citations (Hou et al., 2024).

4. Data Quality, Splits, and Limitations

CLERC offers a multi-granular resource with 1.84 million long case documents, 23.7 million retrieval passages, 105,000 highly annotated queries, and 6,000 real-world analytical paragraph targets. Corpus-specific challenges include OCR noise from the original CAP data and imperfect query-citation masking, with approximately 12% error rate attributed to masking the central citation. These artifacts may negatively impact both retrieval and citation accuracy and suggest the need for improved preprocessing or curated alignments.

The dataset currently spans only federal opinions; expanding coverage to state and statutory legal texts is identified as a future direction for broader applicability. The domain adaptation bottleneck is apparent: off-the-shelf models markedly underperform relative to those fine-tuned on CLERC, emphasizing the need for domain-specific training. Citation metrics (CR, CP, CFP) do not capture analytic or argumentative quality in full. More robust factuality metrics and human-in-the-loop validation are recommended to attain higher analytical soundness.

5. Practical Applications and Impact on Legal AI

CLERC models core subtasks faced by legal professionals: accurate precedent retrieval and synthesizing multi-source text into analytical argumentation. Practical downstream applications include:

Precedent retrieval, analogous to Westlaw or Lexis systems.
Citation-suggestion engines that operate in real time during document drafting.
Automated generation of first-draft analytical paragraphs, leveraging context and retrieved precedent text for further expert editing.

By establishing robust evaluation pipelines and fine-grained, citation-centered benchmarks, CLERC quantifies the persistent gap between current model capabilities and legal drafting expectations. The dataset's focus on both retrieval and citation-faithful generation foregrounds the importance of minimizing hallucinations—a failure modality with critical implications for legal responsibility and compliance.

6. Future Research Directions

Benchmarking on CLERC highlights substantial headroom for legal AI, especially for long-context retrieval, advanced domain adaptation, and factually accurate, citation-grounded generation. Research is needed to address hallucination control, more nuanced citation evaluation, and the integration of factuality-checking frameworks such as FactScore. Enhancements in data curation (e.g., improved OCR, more precise citation masking) and broader corpus scope are expected to support next-generation systems with higher analytical fidelity and real-world reliability in legal workflows (Hou et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

CLERC: A Dataset for Legal Case Retrieval and Retrieval-Augmented Analysis Generation (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CLERC Dataset.

CLERC: Case Law Evaluation Retrieval Corpus

1. Source Corpus and Dataset Construction

2. Task Definitions and Evaluation Metrics

3. Benchmark Model Performance

4. Data Quality, Splits, and Limitations

5. Practical Applications and Impact on Legal AI

6. Future Research Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

CLERC: Case Law Evaluation Retrieval Corpus

1. Source Corpus and Dataset Construction

2. Task Definitions and Evaluation Metrics

3. Benchmark Model Performance

4. Data Quality, Splits, and Limitations

5. Practical Applications and Impact on Legal AI

6. Future Research Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research