Skim-Aware Contrastive Learning Framework
- The paper presents a self-supervised, skimming-inspired approach (CPE) that enhances long document embeddings by selectively masking chunks and applying contrastive NLI-based loss.
- It leverages hierarchical transformer and sparse-attention encoders to maintain cross-chunk dependencies with reduced computational cost compared to traditional full-document encoding.
- Empirical results demonstrate significant improvements in legal and biomedical benchmarks, with up to 2-3 percentage point gains over state-of-the-art methods.
The Skim-Aware Contrastive Learning Framework, referred to in the original work as the "Chunk Prediction Encoder" (CPE), is a self-supervised, skimming-inspired approach to representing long documents efficiently and semantically. Drawing inspiration from human reading strategies, the framework selectively masks a segment of a document, then employs a contrastive, natural language inference (NLI)-based loss to guide learning. This leads to document embeddings that retain cross-chunk dependencies with lower computational cost compared to prior contrastive approaches. The CPE has demonstrated substantial gains in tasks involving lengthy legal and biomedical texts (Abro et al., 30 Dec 2025).
1. Model Architecture and Document Encoding
The framework is implemented in two alternative architectures: a hierarchical transformer encoder and a variant utilizing sparse-attention (Longformer).
Hierarchical Transformer Encoder
A long document is divided into contiguous "chunks" , with each chunk containing up to tokens and preceded by a [CLS] token. A shared transformer encoder (e.g., BERT, RoBERTa, LegalBERT, ClinicalBioBERT) is applied to each chunk, producing a vector representation:
The document representation is derived by aggregating chunk vectors through mean pooling or max pooling:
or
Sparse-Attention (Longformer) Encoder
Alternatively, the entire text (up to 4,096 tokens) is encoded by a single Longformer. For contrastive learning, two segments are assembled: the "reference" text (the document with one chunk removed) and the sampled chunk . Each passes through Longformer, yielding [CLS] vectors and in .
2. Skimming-Inspired Sampling and Masking
To model human-like skimming, the framework performs random chunk removal. For each document , a chunk is randomly selected and removed to form the "masked" document . The pair serves as a positive (entailment-like) example since originates from . Negative chunks are drawn randomly from other documents within the same mini-batch, explicitly excluding chunks from . For a mini-batch of documents, this yields positive and negative pairs.
3. NLI-Based Contrastive Learning Objective
CPE treats the chunk prediction as a proxy NLI task, distinguishing between entailment (positive) and contradiction (negative) relations, operationalized using the InfoNCE loss with cosine similarity:
The loss over a batch of is given by:
Here, is the temperature (set to 1). This loss function aligns positive pairs (masked remainder and removed chunk from the same document) and repels negative pairs (masked remainder and chunk from other documents in the batch).
4. Training Configuration and Computational Efficiency
The CPE framework employs the AdamW optimizer (learning rate , weight decay ). Pre-training is performed for 3 epochs (batch size 4), and downstream multilayer perceptron (MLP) classification for 20 epochs (batch size 16). Documents are truncated or padded to 4,096 tokens (2,048 for EURLEX), partitioned into 32 or 16 chunks of 128 tokens (64 for short documents). Only the masked remainder and one chunk are encoded per contrastive step, reducing memory and compute relative to SimCSE, which requires dual full-document encoding.
Empirically, CPE pre-training on ECHR requires approximately 3 hours on a Quadro RTX 8000, compared to roughly 4.5 hours for SimCSE or ESimCSE. Downstream MLP training is under 10 minutes per epoch (Abro et al., 30 Dec 2025).
5. Evaluation Setup and Empirical Results
Datasets
CPE has been evaluated on several long document classification datasets:
| Dataset | Type | Domain | Mean Document Length |
|---|---|---|---|
| ECHR | Multi-label | Human rights law | ~2,050 |
| SCOTUS | Single-label | US Supreme Court opinions | ~8,000 |
| EURLEX | Multi-label | European legal texts | ~1,400 |
| MIMIC-III | Multi-label | Clinical notes (ICD-9 codes) | ~3,200 |
| BioASQ | Multi-label | Biomedical abstracts | ~300 |
Baselines
Comparisons were drawn against: frozen pre-trained model embeddings + MLP, SimCSE/ESimCSE at document level + MLP, and end-to-end fine-tuned hierarchical models (Hi-LegalBERT, LSG, LegalLongformer, HAT).
Performance Metrics
Reported metrics include macro-averaged F1 and micro-averaged F1, with macro-F1 highlighted in key results.
Macro-F1 Results (representative)
| Dataset | Baseline Emb+MLP | SimCSE | ESimCSE | CPE |
|---|---|---|---|---|
| ECHR-BERT | 36.3 | 48.7 | 50.8 | 54.8 |
| SCOTUS-LegalBERT | 44.9 | 57.6 | 56.1 | 59.5 |
| EURLEX-RoBERTa | 21.9 | 35.1 | 33.8 | 41.9 |
| MIMIC-ClinicalBioBERT | 55.8 | 62.7 | 60.4 | 63.9 |
The Longformer-based CPE also outperformed its SimCSE/ESimCSE analogues (e.g., ECHR macro-F1: 35.9 → 48.9).
End-to-end fine-tuning (LegalBERT with CPE chunking and a 2-layer aggregator) yielded further improvements:
- ECHR: 66.1 macro/72.6 micro (2–3 pp gain over Hi-LegalBERT & LegalLongformer)
- SCOTUS: 67.3 macro/77.5 micro (1–2 pp gain)
Chunk size ablations indicated that 128-token chunks optimized the tradeoff between local and contextual information; larger chunk sizes (256/512) degraded performance. On shorter documents (BioASQ), CPE still improved macro-F1 by ~4 points compared to vanilla ClinicalBioBERT.
Visualizations (t-SNE) confirmed that CPE-based embeddings were more tightly clustered and semantically coherent than those from SimCSE (Abro et al., 30 Dec 2025).
6. Theoretical and Practical Implications
By emulating human skimming—i.e., omitting a section and requiring the model to relate it to the remaining context—CPE compels the encoder to focus on document-level, cross-chunk dependencies. This produces internal representations sensitive to global document structure. The technique avoids the high computational cost of encoding all possible chunk pairs or entire documents multiple times, as in SimCSE/ESimCSE, while preserving or exceeding their representational quality. A plausible implication is that skimming-based contrastive learning could generalize to other tasks involving structured, lengthy sequential data.
7. Limitations and Observed Tradeoffs
Empirical results show that optimal chunk size is critical: too large and the model misses fine detail, too small and global context is lost. The benefit of CPE is most pronounced on long, structured texts. On short documents, improvements remain positive but are reduced in magnitude. All claims and measured statistics are bounded to the legal and biomedical benchmarks tested; generalizability to other domains remains an open direction. Resource efficiency is noted in both compute time and memory, especially during contrastive pre-training (Abro et al., 30 Dec 2025).