Pseudo-Labeled Graph Condensation

Updated 22 January 2026

Pseudo-Labeled Graph Condensation is a technique that synthesizes compact graphs using latent pseudo-labels to preserve essential node embeddings for efficient GNN training.
It employs self-supervised learning and pseudo-label guided replay to optimize representation matching even in noisy or label-scarce environments.
Empirical evaluations show that PLGC achieves near-supervised performance on tasks like node classification and link prediction while significantly reducing the training graph size.

Pseudo-Labeled Graph Condensation (PLGC) is a graph dataset reduction paradigm that generates small, information-preserving synthetic graphs by leveraging pseudo-labels, enabling efficient graph neural network (GNN) training in both supervised and label-free, noisy, or weakly-labeled settings. PLGC encompasses self-supervised methods (as in "PLGC: Pseudo-Labeled Graph Condensation" (Nandy et al., 15 Jan 2026)) and pseudo-label-guided replay condensation for continual learning (as deployed in PUMA (Liu et al., 2023)). This entry outlines the principal methodologies and theoretical foundations of PLGC, presents detailed algorithmic procedures, analyzes empirical outcomes, and discusses implementation practices and limitations.

1. Motivation and Problem Setting

Graph condensation targets the replacement of a massive, costly-to-train graph $\mathcal{T} = (X, A, Y)$ of $N$ nodes with a compressed synthetic graph $\mathcal{S} = (X', A', Y')$ of $N' \ll N$ nodes such that a GNN trained on $\mathcal{S}$ preserves the predictive and representational statistics of $\mathcal{T}$ . Classical supervised condensation requires dense, reliable ground-truth labels, optimizing:

$\min_{A', X', Y'} \mathcal{L}_{\text{node}}(\text{GNN}_\theta(A', X'), Y')$

However, real-world graphs frequently exhibit label scarcity, inconsistency, and noise. Under such conditions, supervised condensation misaligns class-conditional statistics, causing overfitting and poor generalization. PLGC reorients the condensation paradigm:

In the self-supervised variant (Nandy et al., 15 Jan 2026), condensation proceeds without $Y$ , constructing latent pseudo-labels ( $\tilde Y$ ; prototypes of node embeddings) and node-to-prototype assignments ( $Q$ ), matched by representation statistics.
In continual learning (Liu et al., 2023), pseudo-labels with high confidence are dynamically generated for unlabeled nodes, expanding the set of condensation targets and improving distributional matching.

This strategy enables PLGC to remain robust and informative when ground-truth annotations are unreliable or absent.

2. Methodological Foundations

PLGC consists of two primary algorithmic phases: latent pseudo-label construction and condensed-graph optimization (Nandy et al., 15 Jan 2026).

A. Pseudo-Label Construction

Pseudo-labels $\tilde Y \in \mathbb{R}^{K \times d}$ represent $K$ prototype centroids in embedding space, each assigned to graph nodes through a balanced assignment matrix $Q_{\mathcal{T}} \in \{0,1\}^{N\times K}$ . The assignment ensures equitable representation: $Q_{\mathcal{T}}^\top \mathbf{1}_N = (1/K)\mathbf{1}_K \;\text{and}\; Q_{\mathcal{T}}\mathbf{1}_K = (1/N)\mathbf{1}_N$ Under random augmentations $T_i(A, X)$ , node embeddings ( $Z_i$ ) are computed, then soft assignments ( $Q_i$ ) are produced by solving a balanced entropy-regularized linear program (Sinkhorn-Knopp scaling). For each batch node, a swapped assignment view prediction loss $\ell_{\text{swap}}$ aligns embeddings across views. The joint objective $\mathcal{L}_{\text{pseudo}}$ is backpropagated to update both the encoder and prototypes.

B. Condensed Graph Optimization

After prototype convergence, synthetic features $X'$ are optimized to ensure each condensed node's embedding $z_{\mathcal{S}|k}$ approximates its prototype: $\mathcal{L}_{\text{rep}}(X') = \sum_{k=1}^K \| \tilde y_k - z_{\mathcal{S}|k} \|_2^2$ In practice, adjacency $A'$ is often fixed or omitted (empty/identity), focusing optimization on $X'$ .

C. Pseudo-Label Guided Condensation in Continual Learning

Within PUMA (Liu et al., 2023), PLGC operates over iterative tasks $k$ :

Initial condensation uses available true labels to generate a preliminary condensed graph $\hat G_k$ via class-conditional maximum-mean-discrepancy (MMD) loss between propagated real and synthetic features.
Pseudo-label expansion trains a classifier on the replay buffer $M_{k-1} \cup \hat G_k$ to infer pseudo-labels for unlabeled nodes, selecting those with softmax confidence above a threshold $\tau$ .
Refined condensation incorporates the expanded label set into a repeated condensation loop, producing a replay buffer $M_k$ that facilitates efficient, edge-free continual training.

3. Formal Optimization Objectives and Algorithms

Self-supervised PLGC (Nandy et al., 15 Jan 2026)—for a given unlabeled graph $\mathcal{T} = (X, A)$ and compression ratio $r$ :

Set $K = N' = \lceil r N \rceil$ .
Alternate:
- Pseudo-label learning: for each batch, sample augmentations, compute embeddings, solve Sinkhorn assignments, apply loss, update encoder/prototypes.
- Condensation: holding encoder and prototypes fixed, optimize $X'$ via the representation-matching objective.

PUMA's PLGC module (Liu et al., 2023)—for labeled data:

One-time feature propagation: $F_k = L_k X_k$ where $L_k = D^{-½}A_k D^{-½}$ .
Wide MLP embeddings: $E_k = f_\theta(F_k)$ , $Ẽ_k = f_\theta(\tilde X_k)$ .
Task-wise condensation loss:

$\ell_{\text{MMD}}(\hat G_k; \Theta) = \sum_{c \in C_k} r_{c,k} \| \text{mean}(E_{c, k}) - \text{mean}(\tilde E_{c, k}) \|_2^2$

Optimized over $X̃_k, Ỹ_k$ , constrained by class ratios and memory budget.

Both frameworks employ edge-free condensed graphs to accelerate memory replay, training downstream models via MLPs.

4. Theoretical Foundations

PLGC incorporates rigorous guarantees for prototype concentration and assignment fidelity (Nandy et al., 15 Jan 2026). Under sub-Gaussian latent structure and cluster separability:

Prototype concentration: Each learned prototype $\tilde y_k$ converges near its true cluster mean $\mu_k$ at rate $\epsilon_k = 4\sigma \sqrt{(d + \log(2K/\delta))/s_k}$ .
Interior-point recovery: All nodes sufficiently close to $\mu_k$ remain correctly assigned.
Separation: For large enough cluster size $s_k$ , prototypes remain well-separated.

A plausible implication is that, even in the complete absence of ground-truth labels, the synthetic condensed graph will preserve the latent geometry and feature/structural statistics critical for downstream tasks.

In pseudo-label guided approaches (Liu et al., 2023), including high-confidence inferred labels expands the class coverage for matching, further improving alignment between synthetic and real distributions.

5. Empirical Performance and Practical Implementation

PLGC is evaluated on node classification and link prediction across both transductive and inductive graphs (Cora, Citeseer, Ogbn-Arxiv, Flickr, Reddit) (Nandy et al., 15 Jan 2026). Key results:

On clean-label datasets, PLGC is within 1% of best supervised methods and exceeds all self-supervised baselines by up to 10 points.
Under label noise ( $\text{noise} > 0.7$ ), supervised methods degrade by up to 30pp, PLGC degrades by $<5$ pp, outperforming baselines by 15–25 points.
For multi-source graphs, supervised baselines collapse, while PLGC maintains performance within 5 points of clean conditions.
Link prediction AUROC is similarly robust to noise and source heterogeneity.

PUMA's continual-learning PLGC circa replay buffer achieves state-of-the-art accuracy and backward transfer on class-incremental task protocols (Liu et al., 2023), substantially outpacing regularization, sampling-based replay, and previous condensation frameworks. Condensation and retraining times are recorded in minutes for large graphs (e.g., $\sim$ 3 min for 170K nodes), compared to naive replay alternatives.

Hyper-parameter settings and ablations in both references indicate stability for budget ratios ( $0.5\%-1\%$ ), number of prototypes, augmentation strengths, Sinkhorn temperature, and learning rates.

Dataset	PLGC Accuracy (clean)	PLGC Degradation (noise=0.7)	Best Supervised Baseline
Cora	81.6%	−4.5pp	GEOM (83.6%)
Reddit	88.3%	−4.2pp	GCond (86.4%)
Products	74%	−3.7pp	CaT (71%) (Liu et al., 2023)

6. Advantages, Limitations, and Implementation Recommendations

Advantages:

Label-free condensation for completely unlabeled graphs.
Noise robustness: avoids overfitting to spurious annotation errors in both supervised and self-supervised scenarios.
Label efficiency: minimal annotations suffice for downstream fine-tuning.
Multi-source extensibility: naturally condenses and integrates heterogeneous subgraphs.
Task transferability: condensed graphs function across node classification, link prediction, and graph-level tasks with minimal modification.

Limitations:

Adjacency structure is not learned explicitly; if downstream tasks are sensitive to edge topology, explicit adjacency matching may be required.
Alternating self-supervised training entails nontrivial computational overhead; however, condensation remains orders of magnitude faster than full-graph retraining at every hyperparameter point.

Implementation tips:

Use PyTorch Geometric or DGL for encoder sharing and algorithmic flexibility.
Match number of prototypes $K$ to desired compression ( $N' = rN$ ), ensuring $s_k \gg d$ for statistical concentration.
Apply standard graph augmentation (0.1–0.2 edge drop, 10% feature masking) for invariant prototype learning.
Employ Sinkhorn temperature $0.05$–$0.2$ and batch sizes $256$–$1024$ per assignment round.
Optimize pseudo-labels for $\sim$ 200 epochs and condensed features for $\sim$ 100 steps.

In summary, Pseudo-Labeled Graph Condensation provides a unified paradigm for efficient, robust, and minimally supervised graph reduction methods, yielding synthetic datasets that preserve latent geometric and predictive information under adverse labeling conditions and facilitating scalable GNN training for both static and continual learning scenarios (Nandy et al., 15 Jan 2026, Liu et al., 2023).

Markdown Report Issue Upgrade to Chat

References (2)

PLGC: Pseudo-Labeled Graph Condensation (2026)

PUMA: Efficient Continual Graph Learning for Node Classification with Graph Condensation (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Pseudo-Labeled Graph Condensation (PLGC).