Pseudo-Labeled Graph Condensation
- Pseudo-Labeled Graph Condensation is a technique that synthesizes compact graphs using latent pseudo-labels to preserve essential node embeddings for efficient GNN training.
- It employs self-supervised learning and pseudo-label guided replay to optimize representation matching even in noisy or label-scarce environments.
- Empirical evaluations show that PLGC achieves near-supervised performance on tasks like node classification and link prediction while significantly reducing the training graph size.
Pseudo-Labeled Graph Condensation (PLGC) is a graph dataset reduction paradigm that generates small, information-preserving synthetic graphs by leveraging pseudo-labels, enabling efficient graph neural network (GNN) training in both supervised and label-free, noisy, or weakly-labeled settings. PLGC encompasses self-supervised methods (as in "PLGC: Pseudo-Labeled Graph Condensation" (Nandy et al., 15 Jan 2026)) and pseudo-label-guided replay condensation for continual learning (as deployed in PUMA (Liu et al., 2023)). This entry outlines the principal methodologies and theoretical foundations of PLGC, presents detailed algorithmic procedures, analyzes empirical outcomes, and discusses implementation practices and limitations.
1. Motivation and Problem Setting
Graph condensation targets the replacement of a massive, costly-to-train graph of nodes with a compressed synthetic graph of nodes such that a GNN trained on preserves the predictive and representational statistics of . Classical supervised condensation requires dense, reliable ground-truth labels, optimizing:
However, real-world graphs frequently exhibit label scarcity, inconsistency, and noise. Under such conditions, supervised condensation misaligns class-conditional statistics, causing overfitting and poor generalization. PLGC reorients the condensation paradigm:
- In the self-supervised variant (Nandy et al., 15 Jan 2026), condensation proceeds without , constructing latent pseudo-labels (; prototypes of node embeddings) and node-to-prototype assignments (), matched by representation statistics.
- In continual learning (Liu et al., 2023), pseudo-labels with high confidence are dynamically generated for unlabeled nodes, expanding the set of condensation targets and improving distributional matching.
This strategy enables PLGC to remain robust and informative when ground-truth annotations are unreliable or absent.
2. Methodological Foundations
PLGC consists of two primary algorithmic phases: latent pseudo-label construction and condensed-graph optimization (Nandy et al., 15 Jan 2026).
A. Pseudo-Label Construction
Pseudo-labels represent prototype centroids in embedding space, each assigned to graph nodes through a balanced assignment matrix . The assignment ensures equitable representation: Under random augmentations , node embeddings () are computed, then soft assignments () are produced by solving a balanced entropy-regularized linear program (Sinkhorn-Knopp scaling). For each batch node, a swapped assignment view prediction loss aligns embeddings across views. The joint objective is backpropagated to update both the encoder and prototypes.
B. Condensed Graph Optimization
After prototype convergence, synthetic features are optimized to ensure each condensed node's embedding approximates its prototype: In practice, adjacency is often fixed or omitted (empty/identity), focusing optimization on .
C. Pseudo-Label Guided Condensation in Continual Learning
Within PUMA (Liu et al., 2023), PLGC operates over iterative tasks :
- Initial condensation uses available true labels to generate a preliminary condensed graph via class-conditional maximum-mean-discrepancy (MMD) loss between propagated real and synthetic features.
- Pseudo-label expansion trains a classifier on the replay buffer to infer pseudo-labels for unlabeled nodes, selecting those with softmax confidence above a threshold .
- Refined condensation incorporates the expanded label set into a repeated condensation loop, producing a replay buffer that facilitates efficient, edge-free continual training.
3. Formal Optimization Objectives and Algorithms
Self-supervised PLGC (Nandy et al., 15 Jan 2026)—for a given unlabeled graph and compression ratio :
- Set .
- Alternate:
- Pseudo-label learning: for each batch, sample augmentations, compute embeddings, solve Sinkhorn assignments, apply loss, update encoder/prototypes.
- Condensation: holding encoder and prototypes fixed, optimize via the representation-matching objective.
PUMA's PLGC module (Liu et al., 2023)—for labeled data:
- One-time feature propagation: where .
- Wide MLP embeddings: , .
- Task-wise condensation loss:
Optimized over , constrained by class ratios and memory budget.
Both frameworks employ edge-free condensed graphs to accelerate memory replay, training downstream models via MLPs.
4. Theoretical Foundations
PLGC incorporates rigorous guarantees for prototype concentration and assignment fidelity (Nandy et al., 15 Jan 2026). Under sub-Gaussian latent structure and cluster separability:
- Prototype concentration: Each learned prototype converges near its true cluster mean at rate .
- Interior-point recovery: All nodes sufficiently close to remain correctly assigned.
- Separation: For large enough cluster size , prototypes remain well-separated.
A plausible implication is that, even in the complete absence of ground-truth labels, the synthetic condensed graph will preserve the latent geometry and feature/structural statistics critical for downstream tasks.
In pseudo-label guided approaches (Liu et al., 2023), including high-confidence inferred labels expands the class coverage for matching, further improving alignment between synthetic and real distributions.
5. Empirical Performance and Practical Implementation
PLGC is evaluated on node classification and link prediction across both transductive and inductive graphs (Cora, Citeseer, Ogbn-Arxiv, Flickr, Reddit) (Nandy et al., 15 Jan 2026). Key results:
- On clean-label datasets, PLGC is within 1% of best supervised methods and exceeds all self-supervised baselines by up to 10 points.
- Under label noise (), supervised methods degrade by up to 30pp, PLGC degrades by pp, outperforming baselines by 15–25 points.
- For multi-source graphs, supervised baselines collapse, while PLGC maintains performance within 5 points of clean conditions.
- Link prediction AUROC is similarly robust to noise and source heterogeneity.
PUMA's continual-learning PLGC circa replay buffer achieves state-of-the-art accuracy and backward transfer on class-incremental task protocols (Liu et al., 2023), substantially outpacing regularization, sampling-based replay, and previous condensation frameworks. Condensation and retraining times are recorded in minutes for large graphs (e.g., 3 min for 170K nodes), compared to naive replay alternatives.
Hyper-parameter settings and ablations in both references indicate stability for budget ratios (), number of prototypes, augmentation strengths, Sinkhorn temperature, and learning rates.
| Dataset | PLGC Accuracy (clean) | PLGC Degradation (noise=0.7) | Best Supervised Baseline |
|---|---|---|---|
| Cora | 81.6% | −4.5pp | GEOM (83.6%) |
| 88.3% | −4.2pp | GCond (86.4%) | |
| Products | 74% | −3.7pp | CaT (71%) (Liu et al., 2023) |
6. Advantages, Limitations, and Implementation Recommendations
Advantages:
- Label-free condensation for completely unlabeled graphs.
- Noise robustness: avoids overfitting to spurious annotation errors in both supervised and self-supervised scenarios.
- Label efficiency: minimal annotations suffice for downstream fine-tuning.
- Multi-source extensibility: naturally condenses and integrates heterogeneous subgraphs.
- Task transferability: condensed graphs function across node classification, link prediction, and graph-level tasks with minimal modification.
Limitations:
- Adjacency structure is not learned explicitly; if downstream tasks are sensitive to edge topology, explicit adjacency matching may be required.
- Alternating self-supervised training entails nontrivial computational overhead; however, condensation remains orders of magnitude faster than full-graph retraining at every hyperparameter point.
Implementation tips:
- Use PyTorch Geometric or DGL for encoder sharing and algorithmic flexibility.
- Match number of prototypes to desired compression (), ensuring for statistical concentration.
- Apply standard graph augmentation (0.1–0.2 edge drop, 10% feature masking) for invariant prototype learning.
- Employ Sinkhorn temperature $0.05$–$0.2$ and batch sizes $256$–$1024$ per assignment round.
- Optimize pseudo-labels for 200 epochs and condensed features for 100 steps.
In summary, Pseudo-Labeled Graph Condensation provides a unified paradigm for efficient, robust, and minimally supervised graph reduction methods, yielding synthetic datasets that preserve latent geometric and predictive information under adverse labeling conditions and facilitating scalable GNN training for both static and continual learning scenarios (Nandy et al., 15 Jan 2026, Liu et al., 2023).