Papers
Topics
Authors
Recent
Search
2000 character limit reached

FBSDiff++: Advanced Frequency-Controlled I2I

Updated 3 February 2026
  • FBSDiff++ is a plug-and-play framework that employs frequency-band substitution in the DCT spectral domain for explicit control in text-driven image-to-image translation.
  • It uses AdaFBS to adaptively handle varying resolutions and aspect ratios while achieving an 8.9× inference speed boost compared to its predecessor.
  • Empirical evaluations demonstrate that FBSDiff++ matches or surpasses state-of-the-art methods in quality, controllability, and text-image alignment.

FBSDiff++ is an advanced plug-and-play framework for highly efficient and controllable text-driven image-to-image (I2I) translation based on large-scale text-to-image (T2I) diffusion models. Building upon its predecessor FBSDiff, FBSDiff++ leverages frequency-band substitution of diffusion features—primarily in the DCT spectral domain—to enable explicit, interpretable, and fine-grained control of the guidance imparted by a source image, without requiring any model fine-tuning or modification of core neural network weights. The method delivers significant improvements in inference speed, adaptability to arbitrary image resolutions and aspect ratios, and functionality for localized and style-specific editing, while matching or surpassing state-of-the-art (SOTA) baselines in quality, controllability, and efficiency (Gao et al., 27 Jan 2026, Gao et al., 2024).

1. Theoretical Foundations: Frequency Domain Control in Diffusion Models

The foundational insight in FBSDiff++ is that the latent features (output by, e.g., a UNet in an LDM backbone such as Stable Diffusion) encode different visual factors at distinct frequency bands:

  • Low frequencies: encode global appearance and coarse structure (style, color scheme)
  • Mid frequencies: encode layout information (position of semantic regions)
  • High frequencies: encode fine details and contours

FBSDiff++ operationalizes this by decomposing intermediate latent feature maps using the 2D Discrete Cosine Transform (DCT). Given a feature map ztRh×w×cz_t \in \mathbb{R}^{h \times w \times c} at any diffusion step tt, the DCT maps each channel to a spectral domain. Frequency bands are separated using binary masks with thresholds (e.g., tlp,tmp1,tmp2,thpt_{lp}, t_{mp_1}, t_{mp_2}, t_{hp} based on i+ji+j index sums), defining low-, mid-, and high-pass regions. Substitution of only selective bands permits appearance-, layout-, or contour-guided translation, respectively, without retraining or modifying attention mechanisms (Gao et al., 27 Jan 2026, Gao et al., 2024).

Latent manipulation occurs as follows at diffusion step tt:

  1. Compute 2D-DCT of the reconstruction feature z^t\hat z_t and sampling feature z~t\tilde z_t
  2. Mask-band substitution: S^M+S~(1M)\hat S \odot M + \tilde S \odot (1-M), where MM codes the desired frequency band
  3. Inverse DCT to obtain the updated z~t\tilde z_t

The operation is performed during a calibration phase controlled by λ\lambda, denoting the fraction of early diffusion steps where frequency band substitution (FBS) is active.

2. FBSDiff++ Algorithmic Structure and Comparative Architecture

FBSDiff and FBSDiff++ operate entirely in the latent space of pretrained T2I diffusion models, exemplified by Stable Diffusion v1.5, leaving the backbone weights untouched. The process may be summarized:

  • Initialization: Invert the reference image xx via DDIM inversion under null-text into zTinvz_{T_{inv}} (Tinv=1000T_{inv}=1000 for FBSDiff; T=50T=50 for FBSDiff++)
  • Parallel Trajectories:
    • Reconstruction (from zTinvz_{T_{inv}} under null-text, for guidance features)
    • Sampling (from random noise under target prompt, for image generation)
  • Frequency Band Substitution: At each step, FBS merges prescribed frequency bands from the reconstruction into the sampling trajectory, for the first λT\lambda T steps (λ0.45\lambda \approx 0.45 by default).
  • Decoding: The final latent z~0\tilde z_0 is decoded to the output image with D()D(\cdot).

FBSDiff++ eliminates the costly reconstruction trajectory by caching the inverted trajectory and reusing feature maps in a stackwise fashion, realizing an 8.9×\times increase in inference speed (from 85 s to 9.6 s on A100 for 50 sampling steps) (Gao et al., 27 Jan 2026).

Feature FBSDiff FBSDiff++
Inversion length 1000 steps 50 steps (matching sampling)
Reconstruction path Full trajectory Caching/popping mechanism
Speed (A100, 512²) ∼85 s 9.6 s (3.5 s inversion + 6.1 s sampling)
Arbitrary resolution No Yes (via AdaFBS)

3. Adaptive Frequency Band Substitution (AdaFBS) and Generalization

FBSDiff++ introduces AdaFBS to handle variable input resolutions and aspect ratios, which are problematic for fixed-index 2D-DCT band definitions. AdaFBS employs cascaded 1D-DCT transforms along spatial axes, enabling percentile-based masking (e.g., low-pass fractions ptp_t in [0,100][0, 100]) independent of shape:

  • 1D-DCT along width and height: Applies DCT and thresholding per axis, then reconstructs in two dimensions sequentially.
  • Spatial masking: Selective region substitution is achieved by further masking the AdaFBS output with MfM_f at the latent feature resolution.

Pseudocode for AdaFBS, as adapted from (Gao et al., 27 Jan 2026):

1
2
3
4
5
6
7
def AdaFBS(z_guid, z_samp, M_w, M_h):
    S_w = DCT_1D_width(z_guid)
    S_samp_w = DCT_1D_width(z_samp)
    z_samp = IDCT_1D_width(S_w * M_w + S_samp_w * (1 - M_w))
    S_h = DCT_1D_height(z_guid)
    S_samp_h = DCT_1D_height(z_samp)
    return IDCT_1D_height(S_h * M_h + S_samp_h * (1 - M_h))

AdaFBS enables robust performance across 512×512, 1024×512, 512×1024, and up to 2048² images (Gao et al., 27 Jan 2026).

4. Advanced Controllability: Continuous Guidance and Functionality Extensions

The framework supports continuous and explicit modulation of:

  • Guidance type: Selecting low, mid, or high bands for appearance, layout, or contour transfer.
  • Guidance intensity: Adjusting bandwidth thresholds (or percentiles) dictating the extent of source-image influence.

For continuous control, parameters λ\lambda (duration of FBS), tlpt_{lp}/ptlppt_{lp} (low-band thresholds), and analogous high-band settings are dynamically varied. Empirical measurements (Table 2 in (Gao et al., 27 Jan 2026)) demonstrate structure similarity decreases and CLIPSim increases as low-band width narrows, indicating a smooth transition from reference-to-text dominance.

FBSDiff++ further extends functionality to:

  • Localized region editing: Through spatial masks, only specific areas in the latent map are substituted, supporting fine-grained regional edits.
  • Style-specific content creation: Introduction of a Spatial Transformation Pool (STP) in low-band FBS to randomize structure but preserve statistics, supporting non-deterministic, structure-free stylizations.

5. Empirical Evaluation: Qualitative, Quantitative, and User Studies

Comprehensive experiments on the LAION-Mini and LAION-Aesthetics datasets benchmark FBSDiff++ against a range of SOTA text-driven I2I methods—including Null-text Inversion, Pix2Pix-zero, Prompt-Tuning Inversion, StyleDiffusion, InstructPix2Pix, and DesignBooster—using standard metrics:

  • Structure Similarity (StructureSim ↑)
  • Perceptual Distance (LPIPS ↓)
  • Style Loss (AdaIN ↓)
  • Text-Image Alignment (CLIPSim ↑)
  • Aesthetic Score (NIMA/LAION ↑)

FBSDiff and FBSDiff++ appear in the top four for all metrics, frequently ranking first or second. For low-frequency substitution (appearance/derivative image generation), StructureSim=0.988 (highest reported), LPIPS minimally increases with decreasing low-pass band width (Gao et al., 27 Jan 2026, Gao et al., 2024). AdaFBS ablations show loss of I2I correlation without frequency band substitution.

Scatter plot analyses (CLIPSim vs StructureSim and LPIPS) illustrate that FBSDiff and FBSDiff++ are closest to the ideal top-right corner (high similarity to both text and structure). FBSDiff++ maintains robustness with significantly fewer inversion steps (T=50T=50 vs Tinv=1000T_{inv}=1000) and is resolution-independent.

In a user study involving 70 users across 400 queries, FBSDiff and FBSDiff++ received the highest fraction of “excellent/optimal” subjective ratings on both global and localized I2I tasks (Gao et al., 27 Jan 2026).

6. Implementation, Limitations, and Extensions

Key properties of FBSDiff++:

  • Backbone compatibility: Implemented as a plug-in for Stable Diffusion v1.5 (latent diffusion model, UNet-based) without any modification or retraining of backbone weights.
  • Sampling and inversion settings: T=50T=50 for both inversion and sampling; classifier-free guidance scale ω=7.5\omega=7.5; FBS applied for the first $0.45T$ steps by default.
  • Model-agnosticity: The substitution layer is independent of the detailed architecture of the UNet, relying only on access to latent feature maps at each diffusion timestep.
  • Qualitative control: Adjusting FBS type and bandwidth enables practitioners to balance between structure retention and stylization; insertion of STP enables randomized, reference-style image synthesis.

A notable limitation is the dependence on a reference image with meaningful content in the relevant frequency bands; if the source lacks necessary structure or style, guidance will be limited to what is present in the referent. Another practical consideration is that while AdaFBS largely solves aspect ratio and scaling issues, the effectiveness of frequency band separation may still vary for extreme or highly divergent resolutions.

7. Relation to Prior and Contemporary I2I Approaches

Prior I2I translation methods for pretrained T2I diffusion models include:

  • Fine-tuning or optimization-based: SINE, Imagic, Prompt-Tuning Inversion, Null-text Inversion, StyleDiffusion, and Pix2Pix-zero, which require model updates, per-step optimization, or explicit reweighting.
  • Direct plug-and-play methods: FBSDiff establishes a distinct family by using frequency-band substitution, achieving equal or superior controllability and quality without fine-tuning.

A plausible implication is that frequency-domain substitution provides a more interpretable and granular means of guiding diffusion sampling than prior approaches based on attention map manipulation or textual prompt engineering. FBSDiff++ generalizes this paradigm to practical, high-throughput, real-world applications by eliminating computational bottlenecks and supporting diverse editing tasks (Gao et al., 27 Jan 2026, Gao et al., 2024).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FBSDiff++.