Papers
Topics
Authors
Recent
Search
2000 character limit reached

Representation-Aligned Guidance (REPA-G)

Updated 5 February 2026
  • Representation-Aligned Guidance (REPA-G) is a method that aligns a diffusion model’s internal representations with high-quality pretrained encoder features to improve convergence and control.
  • It employs explicit regularization using losses like MSE and cosine similarity to enforce both spatial structure and semantic fidelity during training and inference.
  • Empirical results across image, molecular, and protein domains demonstrate that REPA-G significantly accelerates training while reducing FID and enhancing output quality.

Representation-Aligned Guidance (REPA-G) is a methodology for integrating pretrained feature representations into the training and inference of generative diffusion models. By enforcing alignment between the internal representations of a diffusion model and those from powerful, pretrained visual encoders, REPA-G provides inductive bias that accelerates convergence, improves fidelity, and enables flexible semantic and spatial control during both sampling and inverse problem solving.

1. Principles and Formal Definition

Representation-Aligned Guidance is defined as the explicit regularization or inference-time steering of diffusion model hidden states toward features produced by frozen, high-quality encoders such as DINOv2, CLIP, or VAEs. At each denoising step, the diffusion model’s intermediate features (patch tokens, latent maps, or layer outputs) are compared to the corresponding features extracted from the reference image or from a proxy via a pretrained vision encoder. The principal alignment losses are:

For patch-wise alignment using external representations:

LREPA(θ,ϕ)=Ex,ϵ,t ⁣[  1Nn=1Nsim(zn,gϕ(fθ(xt,t))n)]\mathcal{L}_{\mathrm{REPA}}(\theta, \phi) = -\,\mathbb{E}_{x,\epsilon,t}\!\left[\; \frac{1}{N} \sum_{n=1}^N \mathrm{sim}\big( z^*_n,\,g_\phi(f_\theta(x_t,t))_n \big) \right]

Here, znz^*_n are external encoder patch features, gϕg_\phi is a learned MLP projection, and fθ(xt,t)f_\theta(x_t, t) are the intermediate diffusion features.

2. Global Semantic Information vs. Spatial Structure

A key finding is that standard practice—selecting the strongest classifier-accuracy encoder as the REPA target—is suboptimal. Instead:

Global semantic information: Measured by linear probing accuracy (ImageNet-1K), corresponds to the encoder’s overall discriminative ability.

Spatial structure: Defined as pairwise similarity between patch tokens, quantified using metrics such as LDS (local-vs-distant similarity), CDS (center-distance similarity), SRSS (spatial relative self-similarity), and RMSC (relative mean spatial correlation) (Singh et al., 11 Dec 2025).

Empirical analysis over 27 vision encoders and multiple model scales reveals that spatial metrics exhibit high correlation with generation FID (Pearson r>0.85|r| > 0.85), while global accuracy is weakly correlated (r=0.26|r| = 0.26). This indicates spatial organization in feature space is the primary determinant of generation quality for REPA-guided diffusion models.

3. Architectures and Alignment Mechanisms

The REPA-G framework utilizes both training-phase and inference-phase mechanisms:

  • Training-phase regularization: Adds an auxiliary alignment loss at one or more diffusion layers, using an MLP or convolutional projection head to match to the external target.
    • Vanilla REPA: 3-layer MLP projection for feature alignment.
    • iREPA: Replaces MLP with a local 2D convolution (kernel size 3, padding 1) and introduces spatial normalization of teacher features, boosting local token contrast while suppressing “global overlay” artifacts (Singh et al., 11 Dec 2025).
    • VAE-REPA: Uses bottleneck features from a pretrained VAE for efficient, intrinsic alignment without external encoder calls (Wang et al., 25 Jan 2026).
  • Inference-time guidance: Injects gradients (potential terms) or representation predictors during the diffusion or flow-based sampling process.

Pseudocode for training-phase REPA loss:

1
2
3
4
5
6
7
8
9
10
11
z = V(x)
epsilon = Normal(0,1).sample_like(z)
t = Uniform(0,1).sample(batch_size)
x_t = alpha(t) * z + sigma(t) * epsilon
_, hidden_k = model(x_t, t, return_hidden_at=layer_k)
z_star = E(x)
p = g(hidden_k)
L_repa = -cos(p, z_star).mean()
L_diff = MSE(epsilon, model(x_t, t)[0])
L = L_diff + lambda * L_repa
L.backward(); optimizer.step()
(Yu et al., 2024)

4. Implementation Enhancements and Recipe Variants

Several enhancements and recipe modifications have demonstrated significant gains in both training efficiency and generative quality:

  • iREPA spatial transfer: Convolutional projection and spatial normalization consistently accelerate convergence (20–40% speedup) and reduce FID by 5–25% across architectures (REPA, REPA-E, Meanflow, JiT) and pretrained encoders. Minimal code changes (<<4 lines) suffice for adoption (Singh et al., 11 Dec 2025).
  • Multimodal alignment: REPA-G aligns model features to both image and synthetic auxiliary modality (e.g., text caption from VLM), using a joint loss and multimodal curriculum (Wang et al., 11 Jul 2025).
  • Curriculum phase-in: Scheduling diffusion and representation losses—starting with harder representation learning before gradually increasing diffusion loss weighting—improves learning and convergence (Wang et al., 11 Jul 2025).
  • Intrinsic alignment: VAE-REPA reuses stable, noise-free VAE features for built-in guidance, improving generation metrics and matching external-encoder solutions with just 4% extra GFLOPs (Wang et al., 25 Jan 2026).

5. Empirical Results Across Domains

REPA-G and its variants have achieved substantial improvements across image, molecular, and protein domains:

Model / Recipe Training Iters / Epochs FID Reduction Speedup/Other Gains
SiT-XL/2 (Vanilla) 7M FID=8.3 Baseline
SiT-XL/2 + REPA 400K FID=7.9 17.5× speedup
SiT-XL/2 + iREPA 60–80K FID=7.9 20–40% faster than REPA
SiT-XL/2 + REPA-G 300K FID=8.2 23.3× faster than baseline (Wang et al., 11 Jul 2025)
REPA-E + iREPA (diverse encoders) FID -15–25% Applies to DINOv2-B, DINOv3-B, WebSSL-1B, PE-G
VAE-REPA 1M–4M FID=6.6 Matches/ext. encoder, +7× speedup (Wang et al., 25 Jan 2026)
REPA-G (protein folding) 50 epochs scRMSD↓, pLDDT↑ 3.6× epochs vs. base (Wang et al., 11 Jul 2025)
REPA-G (inverse) 20–50 steps LPIPS↓, FID↓ 2–4× step reduction, sharper structure (Sfountouris et al., 21 Nov 2025)

In addition, test-time REPA-G yields flexible semantic and spatial control: - ImageNet 256×256: FID 22.5 (uncond.) → 12.1 (λ=3.0) with REPA-G; IS improvement from 52.0 → 59.3; diversity maintained. - COCO-aesthetics: Higher CLIP-FID, better user study preference for REPA-G vs. classifier-free guidance (Sereyjol-Garros et al., 3 Feb 2026). - Training-free projector: REPA-XL/2 (no CFG) FID 6.82 → 3.34; SiT-XL/2 FID 9.35 → 6.17; SiT-XL/2 + CFG FID 2.15 → 2.08 (Zu et al., 30 Jan 2026).

6. Theoretical Foundation

Theoretical justification for REPA-G centers on its effect of tilting the sampling distribution or loss surface via a perceptual feature prior. Specifically:

  • The addition of a potential term aligns the sampling with a distribution proportional to exp(λLrep)\exp(-\lambda \mathcal{L}_{\mathrm{rep}}); convergence to the intended manifold can be proven under regularity conditions (Sereyjol-Garros et al., 3 Feb 2026).
  • Feature-space divergence minimization: REPA regularization reduces Maximum Mean Discrepancy in the embedding space of the external encoder, directly enhancing perceptual fidelity (Sfountouris et al., 21 Nov 2025).
  • Feature contraction: Each REPA update (small λ\lambda) provably contracts internal diffusion model representations toward those of the reference, i.e., htREPAhFC1hthF+C2(ApproxErr+MisREPA)||h^{\mathrm{REPA}}_t - h^*||_F \leq C_1 ||h_t - h^*||_F + C_2(\sqrt{\mathrm{ApproxErr}} + \sqrt{\mathrm{MisREPA}}) with C1<1C_1 < 1 (Sfountouris et al., 21 Nov 2025).

7. Implications, Guidelines, and Limitations

The empirical findings and theoretical results suggest several design and implementation principles for REPA-G:

Limitations include dependency on pretrained encoder quality and domain, memory overhead for storing features, and potential interference when aligning in late transformer layers. Adaptive scheduling and layer selection remain areas for further research. REPA-G is extensible to non-image domains (video, molecules, proteins), with results confirmed in protein inverse folding and 3D molecule generation (Wang et al., 11 Jul 2025).


In summary, Representation-Aligned Guidance establishes a robust paradigm for enhancing diffusion-based generation, inverse problem solving, and controlled synthesis through explicit alignment to pretrained feature spaces. Prioritizing spatial structure over mere classifier accuracy and leveraging lightweight architectural modifications, REPA-G yields state-of-the-art convergence and fidelity across architectures, domains, and sampling regimes (Singh et al., 11 Dec 2025Yu et al., 2024Wang et al., 11 Jul 2025Sereyjol-Garros et al., 3 Feb 2026Wang et al., 25 Jan 2026Sfountouris et al., 21 Nov 2025Zu et al., 30 Jan 2026).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Representation-Aligned Guidance (REPA-G).