Guidance in the Frequency Domain Enables High-Fidelity Sampling at Low CFG Scales

Published 24 Jun 2025 in cs.LG | (2506.19713v1)

Abstract: Classifier-free guidance (CFG) has become an essential component of modern conditional diffusion models. Although highly effective in practice, the underlying mechanisms by which CFG enhances quality, detail, and prompt alignment are not fully understood. We present a novel perspective on CFG by analyzing its effects in the frequency domain, showing that low and high frequencies have distinct impacts on generation quality. Specifically, low-frequency guidance governs global structure and condition alignment, while high-frequency guidance mainly enhances visual fidelity. However, applying a uniform scale across all frequencies -- as is done in standard CFG -- leads to oversaturation and reduced diversity at high scales and degraded visual quality at low scales. Based on these insights, we propose frequency-decoupled guidance (FDG), an effective approach that decomposes CFG into low- and high-frequency components and applies separate guidance strengths to each component. FDG improves image quality at low guidance scales and avoids the drawbacks of high CFG scales by design. Through extensive experiments across multiple datasets and models, we demonstrate that FDG consistently enhances sample fidelity while preserving diversity, leading to improved FID and recall compared to CFG, establishing our method as a plug-and-play alternative to standard classifier-free guidance.

Abstract PDF Upgrade to Chat

Summary

The paper presents Frequency-Decoupled Guidance (FDG) which separates guidance into low- and high-frequency components to improve sample fidelity and diversity.
It employs frequency decomposition techniques, such as Laplacian pyramids, to address the trade-offs between global structure and fine details in diffusion models.
Empirical evaluations demonstrate significant improvements in FID, recall, and prompt alignment across various models, validating FDG's effectiveness.

Guidance in the Frequency Domain for High-Fidelity Diffusion Sampling at Low CFG Scales

The paper "Guidance in the Frequency Domain Enables High-Fidelity Sampling at Low CFG Scales" (2506.19713) presents a systematic analysis and practical enhancement of classifier-free guidance (CFG) in diffusion models by decomposing the guidance signal into frequency components. The authors introduce Frequency-Decoupled Guidance (FDG), a plug-and-play modification to CFG that applies distinct guidance strengths to low- and high-frequency components, thereby improving sample fidelity and diversity, especially at low guidance scales.

Motivation and Analysis

CFG is a widely adopted technique in conditional diffusion models, interpolating between conditional and unconditional model predictions to improve sample quality and prompt alignment. However, standard CFG applies a uniform guidance scale across all frequency components, leading to a well-known trade-off: high guidance scales improve fidelity and prompt alignment but reduce diversity and introduce oversaturation, while low guidance scales preserve diversity but yield blurry, low-quality samples.

The authors analyze the effect of CFG in the frequency domain, leveraging linear and invertible transforms such as Laplacian pyramids or wavelet decompositions. Their empirical findings are:

Low-frequency guidance primarily controls global structure and prompt alignment, but excessive scaling in this band reduces diversity and causes oversaturation.
High-frequency guidance enhances visual details and fidelity, with minimal impact on diversity or global structure.

This decomposition reveals that the adverse effects of high CFG scales are predominantly due to over-amplification of low-frequency components, while the benefits for detail and sharpness are attributable to high-frequency guidance.

Frequency-Decoupled Guidance (FDG)

Building on this insight, the authors propose FDG, which applies separate guidance scales to low- and high-frequency components of the CFG signal. The method is implemented as follows:

Decompose the conditional and unconditional model predictions into low- and high-frequency components using a frequency transform (e.g., Laplacian pyramid).
Apply distinct guidance scales: use a conservative scale for low frequencies (to preserve diversity and avoid oversaturation) and a higher scale for high frequencies (to enhance detail).
Reconstruct the guided prediction by inverting the frequency transform.
Proceed with the standard diffusion sampling step using the modified prediction.

This approach requires only minor modifications to the standard CFG sampling loop and introduces negligible computational overhead. The method is compatible with any pretrained diffusion model and does not require retraining or fine-tuning.

Pseudocode

A high-level pseudocode for FDG sampling is as follows:

def fdg_guidance(pred_cond, pred_uncond, low_scale, high_scale, freq_decompose, freq_recompose):
    # Decompose predictions into frequency bands
    cond_low, cond_high = freq_decompose(pred_cond)
    uncond_low, uncond_high = freq_decompose(pred_uncond)
    # Apply separate guidance scales
    guided_low = uncond_low + low_scale * (cond_low - uncond_low)
    guided_high = uncond_high + high_scale * (cond_high - uncond_high)
    # Recompose the guided prediction
    guided_pred = freq_recompose(guided_low, guided_high)
    return guided_pred

Empirical Results

The authors conduct extensive experiments on class-conditional and text-to-image diffusion models, including EDM2, DiT-XL/2, Stable Diffusion 2.1, XL, and 3. Key findings include:

Consistent improvement in FID and recall across all tested models and samplers, indicating better sample quality and diversity.
Superior prompt alignment and detail at low guidance scales, as measured by CLIP Score, ImageReward, and other human preference metrics.
Compatibility with fast/distilled samplers (e.g., SDXL-Lightning), where standard CFG often degrades output quality.
Improved text rendering in generated images, particularly for models where high guidance scales typically cause artifacts.

Quantitative results show that FDG outperforms standard CFG on FID, recall, and prompt alignment metrics, often by substantial margins. For example, on DiT-XL/2, FDG achieves an FID of 5.33 versus 9.31 for CFG, and recall of 0.65 versus 0.54.

Implementation Considerations

Frequency decomposition: The method is robust to the choice of decomposition (Laplacian pyramid or wavelet transform), provided the transform meaningfully separates low and high frequencies.
Parameter selection: The low-frequency guidance scale should be set conservatively (e.g., 1–2), while the high-frequency scale can be set higher (e.g., 5–10), depending on the model and task.
Computational cost: The additional cost is negligible compared to the overall sampling process, as frequency transforms are efficient and the main bottleneck remains the neural network forward passes.
Integration: FDG can be implemented as a wrapper around the model’s prediction step, making it easy to integrate into existing diffusion pipelines.

Theoretical and Practical Implications

The frequency-domain analysis of CFG provides a principled explanation for the observed trade-offs in diffusion model sampling. By decoupling guidance across frequency bands, FDG enables high-fidelity, diverse, and prompt-aligned generation at low guidance scales, which was previously unattainable with standard CFG. This has several implications:

Improved generative quality: Applications requiring both diversity and detail (e.g., creative content generation, data augmentation) benefit from FDG without retraining.
Better control: The method allows fine-grained control over the trade-off between structure, detail, and diversity, which can be tuned per application.
Compatibility: FDG is compatible with other guidance and diversity-enhancing techniques (e.g., CADS, APG), and can be combined for further gains.

Future Directions

The paper identifies several avenues for further research:

Adaptive or learned frequency scaling: Automatically adjusting guidance scales per sample or timestep could further improve results.
Extension to other modalities: The approach may generalize to audio, video, or 3D generative models where frequency decomposition is meaningful.
Integration with training: Incorporating frequency-aware objectives during model training could yield additional benefits.

Conclusion

This work provides a rigorous analysis and practical solution to the limitations of classifier-free guidance in diffusion models. By leveraging frequency-domain decomposition, FDG achieves high-fidelity, diverse, and prompt-aligned generation at low guidance scales, with minimal implementation complexity and broad applicability. The method is poised to become a standard enhancement for conditional diffusion model sampling, and its underlying insights may inform future developments in generative modeling and controllable synthesis.