Skim-Aware Contrastive Learning Framework

Updated 26 January 2026

The paper presents a self-supervised, skimming-inspired approach (CPE) that enhances long document embeddings by selectively masking chunks and applying contrastive NLI-based loss.
It leverages hierarchical transformer and sparse-attention encoders to maintain cross-chunk dependencies with reduced computational cost compared to traditional full-document encoding.
Empirical results demonstrate significant improvements in legal and biomedical benchmarks, with up to 2-3 percentage point gains over state-of-the-art methods.

The Skim-Aware Contrastive Learning Framework, referred to in the original work as the "Chunk Prediction Encoder" (CPE), is a self-supervised, skimming-inspired approach to representing long documents efficiently and semantically. Drawing inspiration from human reading strategies, the framework selectively masks a segment of a document, then employs a contrastive, natural language inference (NLI)-based loss to guide learning. This leads to document embeddings that retain cross-chunk dependencies with lower computational cost compared to prior contrastive approaches. The CPE has demonstrated substantial gains in tasks involving lengthy legal and biomedical texts (Abro et al., 30 Dec 2025).

1. Model Architecture and Document Encoding

The framework is implemented in two alternative architectures: a hierarchical transformer encoder and a variant utilizing sparse-attention (Longformer).

Hierarchical Transformer Encoder

A long document $D$ is divided into $n$ contiguous "chunks" $\{c_1, \ldots, c_n\}$ , with each chunk containing up to $T \leq 512$ tokens and preceded by a [CLS] token. A shared transformer encoder $\mathcal{M}$ (e.g., BERT, RoBERTa, LegalBERT, ClinicalBioBERT) is applied to each chunk, producing a vector representation:

$f(c_i) = \mathcal{M}([\mathrm{CLS}], w_1, \ldots, w_T) \in \mathbb{R}^d$

The document representation is derived by aggregating chunk vectors through mean pooling or max pooling:

$z_D = \frac{1}{n} \sum_{i=1}^n f(c_i) \quad \text{(mean pooling)}$

$z_D = \max_{i=1..n} f(c_i)$

Sparse-Attention (Longformer) Encoder

Alternatively, the entire text (up to 4,096 tokens) is encoded by a single Longformer. For contrastive learning, two segments are assembled: the "reference" text $\tilde{D}$ (the document with one chunk removed) and the sampled chunk $c$ . Each passes through Longformer, yielding [CLS] vectors $z_{\tilde D}$ and $z_c$ in $\mathbb{R}^d$ .

2. Skimming-Inspired Sampling and Masking

To model human-like skimming, the framework performs random chunk removal. For each document $D$ , a chunk $c^+$ is randomly selected and removed to form the "masked" document $\tilde D$ . The pair $(\tilde D, c^+)$ serves as a positive (entailment-like) example since $c^+$ originates from $D$ . Negative chunks $c^-$ are drawn randomly from other documents within the same mini-batch, explicitly excluding chunks from $D$ . For a mini-batch of $N$ documents, this yields $N$ positive and $N(N-1)$ negative pairs.

3. NLI-Based Contrastive Learning Objective

CPE treats the chunk prediction as a proxy NLI task, distinguishing between entailment (positive) and contradiction (negative) relations, operationalized using the InfoNCE loss with cosine similarity:

$\mathrm{sim}(u, v) = \frac{u^\top v}{\|u\| \|v\|}$

The loss over a batch of $N$ is given by:

$\mathcal{L} = -\frac{1}{N} \sum_{i=1}^{N} \log \frac{ \exp(\mathrm{sim}(z_{\tilde D_i}, z_{c_i^+})/\tau) }{ \exp(\mathrm{sim}(z_{\tilde D_i}, z_{c_i^+})/\tau) + \sum_{j \neq i} \exp(\mathrm{sim}(z_{\tilde D_i}, z_{c_j^-})/\tau) }$

Here, $\tau$ is the temperature (set to 1). This loss function aligns positive pairs (masked remainder and removed chunk from the same document) and repels negative pairs (masked remainder and chunk from other documents in the batch).

4. Training Configuration and Computational Efficiency

The CPE framework employs the AdamW optimizer (learning rate $2 \times 10^{-5}$ , weight decay $10^{-3}$ ). Pre-training is performed for 3 epochs (batch size 4), and downstream multilayer perceptron (MLP) classification for 20 epochs (batch size 16). Documents are truncated or padded to 4,096 tokens (2,048 for EURLEX), partitioned into 32 or 16 chunks of 128 tokens (64 for short documents). Only the masked remainder and one chunk are encoded per contrastive step, reducing memory and compute relative to SimCSE, which requires dual full-document encoding.

Empirically, CPE pre-training on ECHR requires approximately 3 hours on a Quadro RTX 8000, compared to roughly 4.5 hours for SimCSE or ESimCSE. Downstream MLP training is under 10 minutes per epoch (Abro et al., 30 Dec 2025).

5. Evaluation Setup and Empirical Results

Datasets

CPE has been evaluated on several long document classification datasets:

Dataset	Type	Domain	Mean Document Length
ECHR	Multi-label	Human rights law	~2,050
SCOTUS	Single-label	US Supreme Court opinions	~8,000
EURLEX	Multi-label	European legal texts	~1,400
MIMIC-III	Multi-label	Clinical notes (ICD-9 codes)	~3,200
BioASQ	Multi-label	Biomedical abstracts	~300

Baselines

Comparisons were drawn against: frozen pre-trained model embeddings + MLP, SimCSE/ESimCSE at document level + MLP, and end-to-end fine-tuned hierarchical models (Hi-LegalBERT, LSG, LegalLongformer, HAT).

Performance Metrics

Reported metrics include macro-averaged F1 and micro-averaged F1, with macro-F1 highlighted in key results.

Macro-F1 Results (representative)

Dataset	Baseline Emb+MLP	SimCSE	ESimCSE	CPE
ECHR-BERT	36.3	48.7	50.8	54.8
SCOTUS-LegalBERT	44.9	57.6	56.1	59.5
EURLEX-RoBERTa	21.9	35.1	33.8	41.9
MIMIC-ClinicalBioBERT	55.8	62.7	60.4	63.9

The Longformer-based CPE also outperformed its SimCSE/ESimCSE analogues (e.g., ECHR macro-F1: 35.9 → 48.9).

End-to-end fine-tuning (LegalBERT with CPE chunking and a 2-layer aggregator) yielded further improvements:

ECHR: 66.1 macro/72.6 micro (2–3 pp gain over Hi-LegalBERT & LegalLongformer)
SCOTUS: 67.3 macro/77.5 micro (1–2 pp gain)

Chunk size ablations indicated that 128-token chunks optimized the tradeoff between local and contextual information; larger chunk sizes (256/512) degraded performance. On shorter documents (BioASQ), CPE still improved macro-F1 by ~4 points compared to vanilla ClinicalBioBERT.

Visualizations (t-SNE) confirmed that CPE-based embeddings were more tightly clustered and semantically coherent than those from SimCSE (Abro et al., 30 Dec 2025).

6. Theoretical and Practical Implications

By emulating human skimming—i.e., omitting a section and requiring the model to relate it to the remaining context—CPE compels the encoder to focus on document-level, cross-chunk dependencies. This produces internal representations sensitive to global document structure. The technique avoids the high computational cost of encoding all possible chunk pairs or entire documents multiple times, as in SimCSE/ESimCSE, while preserving or exceeding their representational quality. A plausible implication is that skimming-based contrastive learning could generalize to other tasks involving structured, lengthy sequential data.

7. Limitations and Observed Tradeoffs

Empirical results show that optimal chunk size is critical: too large and the model misses fine detail, too small and global context is lost. The benefit of CPE is most pronounced on long, structured texts. On short documents, improvements remain positive but are reduced in magnitude. All claims and measured statistics are bounded to the legal and biomedical benchmarks tested; generalizability to other domains remains an open direction. Resource efficiency is noted in both compute time and memory, especially during contrastive pre-training (Abro et al., 30 Dec 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Skim-Aware Contrastive Learning for Efficient Document Representation (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Skim-Aware Contrastive Learning Framework.