Graph Contrastive PU Learning
- The paper introduces a framework that integrates Positive-Unlabeled (PU) learning with graph contrastive pre-training to correct sampling bias.
- It leverages the InfoNCE loss as an estimator of semantic similarity, using confidence-based weighting to mine true positives.
- Empirical results demonstrate improved IID and OOD representation quality, validating the method’s effectiveness on multiple benchmarks.
Graph Contrastive PU Learning (GCL-PU) is a principled framework that integrates Positive-Unlabeled (PU) learning into graph contrastive pre-training. Traditional graph contrastive learning (GCL) relies on augmentation-induced positive pairs and treats all remaining pairs as negatives, which introduces substantial sampling bias by mislabeling many semantically similar pairs as negatives. By treating GCL as a PU learning task and leveraging the InfoNCE loss as a means for estimating the posterior probability of semantic similarity, GCL-PU sidesteps this bias and enables semantically guided self-supervision that has been empirically validated to improve both in-distribution (IID) and out-of-distribution (OOD) representation quality (Wang et al., 7 May 2025).
1. Problem Formulation: PU Learning in Graph Contrastive Pre-training
Let denote a pair of nodes from a graph, forming a contrastive sample. Two binary labels are defined: a semantic label , where if and only if is truly semantically similar (i.e., “positive”), and an observed label , where if and only if is labeled as positive by augmentation.
In traditional GCL:
- (labeled positives: augmented pairs)
- (treated as negatives, with unknown semantics)
Viewing this as a PU problem:
- (labeled positives)
- (unlabeled, containing both true positives and true negatives )
A critical issue is that semantically similar, non-augmented pairs () are forced as negatives, thus driving apart representations of genuinely similar nodes. This suggests a fundamental mismatch between the observable and semantic structures in GCL when not accounting for the positive-unlabeled nature of the data (Wang et al., 7 May 2025).
2. InfoNCE as a Positive-Unlabeled Estimator
InfoNCE, the prevailing contrastive loss, computes similarity between encoded node pairs:
The pairwise probability is then normalized:
Originally, InfoNCE is justified as modeling a density ratio
and, specifically for GCL, . The Invariance-of-Order (IOD) assumption asserts that this value preserves the ordering of . Therefore, , meaning that normalized InfoNCE similarity is proportional (in order) to the posterior probability of true semantic similarity (Wang et al., 7 May 2025).
3. Semantically Guided InfoNCE (IFL-GCL) Loss
The classic InfoNCE loss averages negative log-likelihood over labeled positives:
After mining semantically similar pairs from the unlabeled pool (, for threshold ), the positive set is expanded to . For each labeled positive, mined positives sharing the same anchor node are introduced with confidence weights:
The corrected local loss combines all positives for each anchor, weighted by a global factor and the normalized confidence:
The full semantically guided (IFL-GCL) loss:
Where:
- : likelihood of true augment positives
- : mined unlabeled positives
- : global weighting factor
- : normalized similarity as confidence weight (Wang et al., 7 May 2025)
4. Learning Algorithm: IFL-GCL
The IFL-GCL framework is instantiated as follows:
- Generate two augmented graph views.
- Initialize model parameters and split labeled/unlabeled pairs.
- Warm-up using standard InfoNCE.
- Repeatedly:
- Compute similarity for all unlabeled pairs, mining positives above a threshold.
- Resample positives and negatives from updated sets.
- Build the corrected loss from both true and mined positives, applying their confidence-based weights.
- Update model parameters over several optimization steps.
- Return the trained encoder parameters.
This routine integrates continual mining and re-weighting of unlabeled samples, aligning training dynamics with the underlying semantic structure (Wang et al., 7 May 2025).
5. Empirical Results and Practical Implications
The benchmark evaluation includes node classification tasks (via linear probe and fine-tuning) on commonly used datasets for IID and OOD generalization: Cora, PubMed, CiteSeer, WikiCS, Computers, Photo, and the GOOD series (Twitch, Cora, CBAS). Baseline methods include DGI, COSTA, BGRL, MVGRL, GBT, GRACE, and GCA. IFL-GCL is applied atop GRACE (IFL-GR) and GCA (IFL-GC).
- IFL-GR and IFL-GC achieve best or second-best accuracy on all nine benchmarks.
- Relative to their base methods, GRACE and GCA, average IID gains are 0.5–1.5%, with OOD gains up to 9.05%.
- Supplementing features with LLM-derived representations (Llama3.2, Qwen2.5) yields further improvements of 0.3–1.3%, confirming that more informative initial semantics facilitate superior positive mining.
- These findings validate the systemic gains from semantically guided positive reweighting and indicate the broader impact of PU learning for graph self-supervision (Wang et al., 7 May 2025).
6. Significance and Outlook
By recasting GCL as a PU learning problem and exploiting InfoNCE as a proxy estimator for semantic similarity, the IFL-GCL method systematically corrects augmentation-induced sampling bias, enabling robust mining of semantically rich positives. The approach is empirically established to provide consistent improvements across diverse graph representation benchmarks and demonstrates compatibility with recent advances in LLM-based semantic enhancement. A plausible implication is that further research may extend such PU-driven correction strategies to other self-supervised paradigms where observed and semantic label spaces are systematically misaligned (Wang et al., 7 May 2025).