Representation-Aligned Guidance (REPA-G)
- Representation-Aligned Guidance (REPA-G) is a method that aligns a diffusion model’s internal representations with high-quality pretrained encoder features to improve convergence and control.
- It employs explicit regularization using losses like MSE and cosine similarity to enforce both spatial structure and semantic fidelity during training and inference.
- Empirical results across image, molecular, and protein domains demonstrate that REPA-G significantly accelerates training while reducing FID and enhancing output quality.
Representation-Aligned Guidance (REPA-G) is a methodology for integrating pretrained feature representations into the training and inference of generative diffusion models. By enforcing alignment between the internal representations of a diffusion model and those from powerful, pretrained visual encoders, REPA-G provides inductive bias that accelerates convergence, improves fidelity, and enables flexible semantic and spatial control during both sampling and inverse problem solving.
1. Principles and Formal Definition
Representation-Aligned Guidance is defined as the explicit regularization or inference-time steering of diffusion model hidden states toward features produced by frozen, high-quality encoders such as DINOv2, CLIP, or VAEs. At each denoising step, the diffusion model’s intermediate features (patch tokens, latent maps, or layer outputs) are compared to the corresponding features extracted from the reference image or from a proxy via a pretrained vision encoder. The principal alignment losses are:
- Mean-squared error (MSE) between projected diffusion features and teacher features (Singh et al., 11 Dec 2025).
- Cosine similarity loss or normalized cross-entropy (NT-Xent) on patch-token features (Yu et al., 2024).
- Inverse-problem REPA guidance, formulated as a feature divergence or MMD-minimization in embedding space, using the proxy features as target (Sfountouris et al., 21 Nov 2025).
For patch-wise alignment using external representations:
Here, are external encoder patch features, is a learned MLP projection, and are the intermediate diffusion features.
2. Global Semantic Information vs. Spatial Structure
A key finding is that standard practice—selecting the strongest classifier-accuracy encoder as the REPA target—is suboptimal. Instead:
Global semantic information: Measured by linear probing accuracy (ImageNet-1K), corresponds to the encoder’s overall discriminative ability.
Spatial structure: Defined as pairwise similarity between patch tokens, quantified using metrics such as LDS (local-vs-distant similarity), CDS (center-distance similarity), SRSS (spatial relative self-similarity), and RMSC (relative mean spatial correlation) (Singh et al., 11 Dec 2025).
Empirical analysis over 27 vision encoders and multiple model scales reveals that spatial metrics exhibit high correlation with generation FID (Pearson ), while global accuracy is weakly correlated (). This indicates spatial organization in feature space is the primary determinant of generation quality for REPA-guided diffusion models.
3. Architectures and Alignment Mechanisms
The REPA-G framework utilizes both training-phase and inference-phase mechanisms:
- Training-phase regularization: Adds an auxiliary alignment loss at one or more diffusion layers, using an MLP or convolutional projection head to match to the external target.
- Vanilla REPA: 3-layer MLP projection for feature alignment.
- iREPA: Replaces MLP with a local 2D convolution (kernel size 3, padding 1) and introduces spatial normalization of teacher features, boosting local token contrast while suppressing “global overlay” artifacts (Singh et al., 11 Dec 2025).
- VAE-REPA: Uses bottleneck features from a pretrained VAE for efficient, intrinsic alignment without external encoder calls (Wang et al., 25 Jan 2026).
- Inference-time guidance: Injects gradients (potential terms) or representation predictors during the diffusion or flow-based sampling process.
- Test-time REPA-G: Steers denoising via gradients of a similarity potential between current latent representations and reference (target) features (Sereyjol-Garros et al., 3 Feb 2026, Sfountouris et al., 21 Nov 2025).
- Training-free guidance: Projects noisy latents to predicted clean embeddings using a frozen MLP, providing semantic anchors during sampling (Zu et al., 30 Jan 2026).
Pseudocode for training-phase REPA loss:
1 2 3 4 5 6 7 8 9 10 11 |
z = V(x) epsilon = Normal(0,1).sample_like(z) t = Uniform(0,1).sample(batch_size) x_t = alpha(t) * z + sigma(t) * epsilon _, hidden_k = model(x_t, t, return_hidden_at=layer_k) z_star = E(x) p = g(hidden_k) L_repa = -cos(p, z_star).mean() L_diff = MSE(epsilon, model(x_t, t)[0]) L = L_diff + lambda * L_repa L.backward(); optimizer.step() |
4. Implementation Enhancements and Recipe Variants
Several enhancements and recipe modifications have demonstrated significant gains in both training efficiency and generative quality:
- iREPA spatial transfer: Convolutional projection and spatial normalization consistently accelerate convergence (20–40% speedup) and reduce FID by 5–25% across architectures (REPA, REPA-E, Meanflow, JiT) and pretrained encoders. Minimal code changes (4 lines) suffice for adoption (Singh et al., 11 Dec 2025).
- Multimodal alignment: REPA-G aligns model features to both image and synthetic auxiliary modality (e.g., text caption from VLM), using a joint loss and multimodal curriculum (Wang et al., 11 Jul 2025).
- Curriculum phase-in: Scheduling diffusion and representation losses—starting with harder representation learning before gradually increasing diffusion loss weighting—improves learning and convergence (Wang et al., 11 Jul 2025).
- Intrinsic alignment: VAE-REPA reuses stable, noise-free VAE features for built-in guidance, improving generation metrics and matching external-encoder solutions with just 4% extra GFLOPs (Wang et al., 25 Jan 2026).
5. Empirical Results Across Domains
REPA-G and its variants have achieved substantial improvements across image, molecular, and protein domains:
| Model / Recipe | Training Iters / Epochs | FID Reduction | Speedup/Other Gains |
|---|---|---|---|
| SiT-XL/2 (Vanilla) | 7M | FID=8.3 | Baseline |
| SiT-XL/2 + REPA | 400K | FID=7.9 | 17.5× speedup |
| SiT-XL/2 + iREPA | 60–80K | FID=7.9 | 20–40% faster than REPA |
| SiT-XL/2 + REPA-G | 300K | FID=8.2 | 23.3× faster than baseline (Wang et al., 11 Jul 2025) |
| REPA-E + iREPA | (diverse encoders) | FID -15–25% | Applies to DINOv2-B, DINOv3-B, WebSSL-1B, PE-G |
| VAE-REPA | 1M–4M | FID=6.6 | Matches/ext. encoder, +7× speedup (Wang et al., 25 Jan 2026) |
| REPA-G (protein folding) | 50 epochs | scRMSD↓, pLDDT↑ | 3.6× epochs vs. base (Wang et al., 11 Jul 2025) |
| REPA-G (inverse) | 20–50 steps | LPIPS↓, FID↓ | 2–4× step reduction, sharper structure (Sfountouris et al., 21 Nov 2025) |
In addition, test-time REPA-G yields flexible semantic and spatial control: - ImageNet 256×256: FID 22.5 (uncond.) → 12.1 (λ=3.0) with REPA-G; IS improvement from 52.0 → 59.3; diversity maintained. - COCO-aesthetics: Higher CLIP-FID, better user study preference for REPA-G vs. classifier-free guidance (Sereyjol-Garros et al., 3 Feb 2026). - Training-free projector: REPA-XL/2 (no CFG) FID 6.82 → 3.34; SiT-XL/2 FID 9.35 → 6.17; SiT-XL/2 + CFG FID 2.15 → 2.08 (Zu et al., 30 Jan 2026).
6. Theoretical Foundation
Theoretical justification for REPA-G centers on its effect of tilting the sampling distribution or loss surface via a perceptual feature prior. Specifically:
- The addition of a potential term aligns the sampling with a distribution proportional to ; convergence to the intended manifold can be proven under regularity conditions (Sereyjol-Garros et al., 3 Feb 2026).
- Feature-space divergence minimization: REPA regularization reduces Maximum Mean Discrepancy in the embedding space of the external encoder, directly enhancing perceptual fidelity (Sfountouris et al., 21 Nov 2025).
- Feature contraction: Each REPA update (small ) provably contracts internal diffusion model representations toward those of the reference, i.e., with (Sfountouris et al., 21 Nov 2025).
7. Implications, Guidelines, and Limitations
The empirical findings and theoretical results suggest several design and implementation principles for REPA-G:
- Quantify and maximize local spatial structure (patch-wise similarity, LDS/CDS/SRSS/RMSC metrics), not just global semantics (Singh et al., 11 Dec 2025).
- Integrate inductive biases preserving locality, such as convolutional projections or spatial normalization (Singh et al., 11 Dec 2025).
- Deploy multimodal and multi-scale alignment for more expressive control, leveraging synthetic modalities and multi-layer features (Wang et al., 11 Jul 2025, Sereyjol-Garros et al., 3 Feb 2026).
- For inverse problems, utilize proxy reconstructions as feature targets, maintaining perceptual consistency with fewer steps (Sfountouris et al., 21 Nov 2025).
- Schedule alignment objectives carefully, with phase-in curricula for robust representation learning (Wang et al., 11 Jul 2025).
- For diffusion transformers, prefer early-layer alignment for optimal expressivity; deep-layer alignment can hinder final semantic refinement (Wang et al., 25 Jan 2026).
- Batch normalization and consistent feature normalization are crucial for stable training and inference (Sereyjol-Garros et al., 3 Feb 2026).
Limitations include dependency on pretrained encoder quality and domain, memory overhead for storing features, and potential interference when aligning in late transformer layers. Adaptive scheduling and layer selection remain areas for further research. REPA-G is extensible to non-image domains (video, molecules, proteins), with results confirmed in protein inverse folding and 3D molecule generation (Wang et al., 11 Jul 2025).
In summary, Representation-Aligned Guidance establishes a robust paradigm for enhancing diffusion-based generation, inverse problem solving, and controlled synthesis through explicit alignment to pretrained feature spaces. Prioritizing spatial structure over mere classifier accuracy and leveraging lightweight architectural modifications, REPA-G yields state-of-the-art convergence and fidelity across architectures, domains, and sampling regimes (Singh et al., 11 Dec 2025Yu et al., 2024Wang et al., 11 Jul 2025Sereyjol-Garros et al., 3 Feb 2026Wang et al., 25 Jan 2026Sfountouris et al., 21 Nov 2025Zu et al., 30 Jan 2026).