Citation Importance-Aware Contrastive Learning

Updated 22 December 2025

The paper’s main contribution is integrating citation importance into contrastive learning by weighting citation features and normalizing them into a probabilistic distribution.
It introduces an adaptive sampling strategy by selecting positives proportional to citation weights and hard negatives from low-importance citations, refining document embedding discrimination.
Empirical evaluations show improved classification, ranking, and topic coherence in large-scale scientometric tasks compared to established baseline models.

The citation importance-aware contrastive learning framework is a methodology for learning high-quality document representations by systematically incorporating citation heterogeneity—i.e., the varying significance of citation links—into the contrastive learning process. Such frameworks fundamentally reshape the treatment of citation information in scholarly document embeddings, enabling finer-grained science mapping and related downstream tasks. This approach builds on contrastive objectives but adapts sampling and loss structures to account for the relative importance of individual citation links, leading to improved discrimination between meaningful and perfunctory citations and, as a result, more functionally relevant representations for large-scale scientometric tasks (Liang et al., 15 Dec 2025).

1. Quantifying Citation Importance

A central advance of the framework is the operationalization of citation importance. Each directed citation $i \to j$ is assigned a scalar weight $w_{ij} \in [0,1]$ reflecting its importance. The construction of $w_{ij}$ is based on a weighted combination of standardized citation-contextual features:

$f_{ij}^I$ : citation count of $j$ in the INTRODUCTION of $i$ .
$f_{ij}^R$ : citation count in the RESULTS section.
$f_{ij}^D$ : citation count in DISCUSSION & CONCLUSION.
$S_{ij}$ : a binary self-citation flag.

Each feature is min–max normalized. The final importance score is

$I_{ij} = w_I\,\tilde{f}_{ij}^I + w_R\,\tilde{f}_{ij}^R + w_D\,\tilde{f}_{ij}^D + w_S\,\tilde{S}_{ij}$

Empirically determined weights ( $w_I=0.1173, w_R=0.2933, w_D=0.2438, w_S=0.3457$ ) are calculated using the entropy-weight method. The raw $I_{ij}$ values are normalized as follows for every anchor $i$ :

$w_{ij} = \frac{I_{ij}}{\sum_{k\in\mathrm{Refs}(i)} I_{ik}}$

This normalization generates a probability distribution over references, emphasizing heterogeneity in citation importance (Liang et al., 15 Dec 2025).

2. Importance-Aware Contrastive Sampling

The framework modifies the standard contrastive sampling paradigm by leveraging citation importance to guide the selection of positives and negatives. Given an anchor document $a$ , its cited documents $R_a$ are sorted by $w_{aj}$ . The approach samples $K$ triplets per anchor, strategically partitioned between:

Positives: Sampled from cited documents in $R_a$ , with selection likelihood proportional to $w_{aj}^\alpha$ ( $\alpha\geq1$ ).
Hard negatives: Sampled from the lowest-importance citations in $R_a$ ; the probability is proportional to $(1-w_{aj})^\beta$ , focusing the encoder on distinguishing subtle semantic differences.
Easy negatives: Sampled uniformly at random from non-cited documents.

This adaptive sampling compels the model to distinguish fine-grained importance among citations, using low-importance citations as informative hard negatives and forcing separation in embedding space.

Sampling distributions:

$P(p=j\,|\,a) = \frac{w_{aj}^\alpha}{\sum_{k\in R_a} w_{ak}^\alpha}$

$P(n=j\,|\,a,\text{hard}) = \frac{(1-w_{aj})^\beta}{\sum_{k\in R_a}(1-w_{ak})^\beta}$

$P(n=j\,|\,a,\text{easy}) = \frac{1}{|\mathcal{D} \setminus (R_a\cup\{a\})|}$

Hyperparameters are typically $\alpha=\beta=1$ , $K=4$ , $H=2$ (Liang et al., 15 Dec 2025).

3. Contrastive Triplet Margin Objective

Embeddings ( ${\mathbf{v}_a}$ , ${\mathbf{v}_p}$ , ${\mathbf{v}_n} \in \mathbb{R}^{768}$ ) are computed via mean-pooling of SciBERT outputs for title + abstract. The core objective is the triplet margin loss:

$\mathcal{L}_{\text{triplet}} = \sum_{(a,p,n)} \max \{\, d(\mathbf{v}_a, \mathbf{v}_p) - d(\mathbf{v}_a, \mathbf{v}_n) + m,\, 0 \}$

where $d(x,y) = \|x-y\|_2$ and $m=1$ is the margin. Optionally, each triplet can be re-weighted by $w_{a,p}$ , closely tying gradient strength to citation importance. However, in the core implementation, importance mainly modulates the sampling pipeline rather than direct loss weighting (Liang et al., 15 Dec 2025).

4. Model Architecture and Training Protocol

The underlying encoder is SciBERT (12 layers, hidden size 768). Inputs consist of concatenated title and abstract, with mean pooling to produce fixed-size vectors. Training is conducted with AdamW (learning rate $2\times10^{-5}$ , weight decay 0.01, 10% linear warmup), batch size 8 (gradient accumulation 4), and for 2 epochs. The triplet construction uses $K=4$ sampled triplets per anchor, $H=2$ of which have hard negatives, with all other sampling parameters as above. No temperature parameter (as in NT-Xent) is used; all separation is margin-based (Liang et al., 15 Dec 2025).

5. Empirical Evaluation and Benchmarking

Evaluation is multifaceted, targeting both document representation and structural science mapping. The training dataset is the Elsevier full-text collection (4.34 million papers, 67.5 million citations; 1.2 million triplets).

SciDocs evaluation (226,743 docs; four tasks: classification, citation prediction, user-activity prediction, recommendation):
- F1 (classification), MAP, nDCG (ranking), P@1 (recommendation)
- Results (average over tasks): SPECTER 80.0; Importance-aware model 81.3 (+1.3 pp); SciNCL 81.8; E5-base-v2 80.2. Strongest improvements observed in classification (MAG F1 82.0→83.3; MeSH F1 86.4→89.8) and ranking metrics.
PubMed science mapping (2.94 million docs; clustering by Leiden; MeSH similarity coherence):
- DC baseline: 1.000; SPECTER2: 1.236; SciNCL: 1.172; Importance-aware: 1.321 (~+7.0% vs SPECTER2, +32.1% vs DC).

The framework consistently sets new peaks on MeSH-based topic coherence and document embedding benchmarks (Liang et al., 15 Dec 2025).

6. Large-Scale Science Mapping Application

Application of the fine-tuned encoder to 33.23 million Web of Science papers (2000–2022), connecting each document to its 20 nearest neighbors by cosine similarity, yields a large, connected graph (515 million edges). Notable graph statistics include a clustering coefficient of 0.16 (vs. 0.02 for direct-citation) and an average shortest-path of approximately 7.9. The resultant science map supports over 101,000 detected topics (using Leiden clustering and UMAP projection). Overlays by disciplinary field, interdisciplinarity, and publication year reveal clear disciplinary structure, concentration of interdisciplinary research at boundaries, and, notably, the emergence of central pandemic-related research clusters during 2020–2022.

Case analyses confirm that the learned embeddings recover both non-cited but semantically nearest papers (pseudo-relevance feedback) and emergent topics lacking citation connectivity (e.g., AI-chatbot pandemic resilience), clearly illustrating the mapping power granted by the framework’s treatment of citation heterogeneity (Liang et al., 15 Dec 2025).

Compared to prior models such as SciNCL (Ostendorff et al., 2022), which utilize continuous citation embedding spaces to guide positive and negative sampling for contrastive objectives, the citation importance-aware framework introduces direct citation-importance quantification and ties importance to sampling probability as well as hard-negative generation. Unlike CitationSum (Luo et al., 2023), which uses full-text semantic similarity to weigh citation graph edges for summarization graph contrastive learning, the importance-aware framework operates solely with citation metadata and document structure (section location, frequency, self-citation). The explicit encoding of heterogeneity and the sampling of low-importance citations as hard negatives distinguish this framework's methodological contribution and empirical effectiveness.