Latent and Attention Mixing with Schedulers (LAMS)
- LAMS is a methodology that dynamically interpolates between latent states and attention maps using scheduler-controlled weights.
- It employs time-varying mixing strategies to balance structure preservation and edit flexibility in diffusion-based image editing.
- In Transformer models, LAMS (via Alternating Sparse Attention) reduces memory usage while efficiently integrating local and global dependencies.
Latent and Attention Mixing with Schedulers (LAMS) refers to a class of methodologies designed to improve fidelity and controllability in generative models through dynamic, scheduler-driven interpolation between multiple internal representations—predominantly latent states and attention maps—within neural network architectures. The approach has proven effective in both diffusion-based image editing for precise content-preserving manipulations as well as in Transformers for efficient long-context modeling. In this context, LAMS denotes both the explicit mixing techniques developed for diffusion models in real-image editing (Fu et al., 6 Jan 2026) and the scheduler-based alternations of latent and attention branches in sequence models (Hu et al., 2 Nov 2025), each leveraging time- or layer-varying mixing coefficients and scheduler functions to modulate information flow.
1. Motivation and Problem Space
Text-to-image (T2I) diffusion models, particularly for real-image editing, face a strict trade-off: edits must be visually aligned with new prompts (editability) while remaining structurally faithful to the source (fidelity). Traditional inversion-based pipelines invert the original image into a latent code (e.g., DDIM inversion) and apply Prompt-to-Prompt (P2P) attention hacks for editing; however, these methods fail to reliably preserve structure, as only the final inversion latent is utilized, discarding intermediate cues crucial for spatial consistency.
In sequence modeling, attention-based Transformers must efficiently capture both local and global dependencies while managing memory and compute. Fixed-pattern sparse attention methods often inadequately propagate long-range dependencies and incur significant key-value (KV) cache requirements. Alternating latent and attention strategies, scheduled across layers, can facilitate broad context integration while economizing cache footprint (Hu et al., 2 Nov 2025).
2. Mathematical Formalism in Diffusion Models
LAMS in diffusion-based editing introduces a time-varying, dual-branch mixing strategy, utilizing intermediate latents and attention maps collected during inversion, and merging them with the corresponding edited states , at each denoising step. Mathematically, for total denoising steps , the procedure can be summarized as follows:
- Record:
during inversion.
- For each step in editing:
- Attention Mixing:
- Latent Mixing:
- Mixing weights , are generated by independent schedulers, parameterized by start/end values, a decay type, and a decay horizon:
with types selected from .
This scheduler-guided mixing enables early diffusion steps to closely track the original image's structure, gradually transferring control to the editing trajectory as denoising proceeds (Fu et al., 6 Jan 2026).
3. Algorithmic Workflow and Integration
The principal workflow for LAMS in diffusion editing is as follows:
Inversion: Invert to collect via the DDIM inversion path with prompt .
Scheduler Construction: Compute , by the defined scheduler functions.
Edit Loop: For :
- Compute or reconstruct both original and edit attention.
- Mix edit and inversion attentions via scheduler weights and use mixed attention in Prompt-to-Prompt editing.
- Denoise latent with the mixed-attention-modulated U-Net.
- Mix output latent with inversion latent under scheduler control.
- Masking (Optional): For region-specific edits, apply a spatial mask on the mixing step:
- Style Transfer (Optional with LoRA): Load LoRA weights after inversion, then proceed with the mixing edit loop unmodified.
The inferred effect is a tunable balance between structure preservation and edit scope, adjustable through scheduler hyperparameters.
4. Scheduler Design and Hyperparameters
Schedulers are critical in controlling the per-step influence of inversion versus edit-guided latents and attention. Each scheduler is defined via starting and ending weights, a duration over which the decay occurs, and a schedule type:
| Parameter | Typical Range | Function |
|---|---|---|
| [0, 1] | Initial mixing weight at | |
| [0, 1] | Final mixing weight at | |
| Step at which decay completes | ||
| {stepped, linear, ...} | Controls decay curve (linear/logistic, etc.) |
Default for attention: , for latent: . This suggests early denoising is dominated by the inversion signal, quickly yielding to the edited trajectory for maximal edit flexibility at lower noise levels.
5. Broader Applications: Transformers with Alternating Sparse Attention
A conceptually related form of LAMS emerges in sequence modeling, specifically Alternating Sparse Attention (ASA) in Transformers (Hu et al., 2 Nov 2025). Here, layer-wise schedulers alternate between local (sliding-window + Multi-head Latent Attention) and global (compression/selective + Group-head Latent Attention) processing. Each even/odd Transformer block specializes in either local or global context, facilitating comprehensive dependency modeling and halving KV-cache requirements.
Key architectural choices:
- Odd layers: MLA-enhanced local window attention
- Even layers: GLA-enhanced global branches
- Only the relevant branch’s KV states are cached, yielding a 50% memory reduction without accuracy loss
Empirical benchmarks on LLaMA-style models demonstrate that this scheduler-driven alternation matches or exceeds dense/full or static sparse attention on common-sense reasoning and long-context understanding tasks, while improving resource efficiency.
6. Empirical Results and Ablation Analyses
In image editing (Fu et al., 6 Jan 2026):
- Models/Datasets: Evaluated on Stable Diffusion v1.5, Anything V4, using 100 COCO2017 images and paired prompts.
- Metrics: LPIPS (fidelity), CLIP Score (edit alignment), FID (realism).
- Results: LAMS-Edit with mask outperforms prior tuning-free methods (DiffEdit, Pix2Pix-Zero, Null-Text Inv + P2P, LEDITS++, PnP, PnPInv), achieving 10–20% improvements in FID and LPIPS at matched CLIP scores.
- Qualitative Performance: Edits exhibit minimal structural artifacts, with object insertions/removals and attribute changes achieved with high semantic accuracy; style transfer achieves top-user preference scores.
- Ablations: Attention mixing alone yields coarse layout preservation; latent mixing alone restricts edits but maintains pixel-level details; their combination under scheduler control (full LAMS) achieves optimal fidelity/edit balance.
In Transformer models (Hu et al., 2 Nov 2025):
- Tasks: Common-sense reasoning, needle-in-a-haystack retrieval, long-context QA.
- Key Results: ASA outperforms NSA/GQA baselines, improves retrieval by 40pp on S-NIAH-3, and reduces perplexity on long-context tasks.
7. Summary and Implications
Latent and Attention Mixing with Schedulers (LAMS) establishes a principled framework for step- or layer-wise fusion of structural and semantic cues in neural generative models. Through explicit scheduler-based mixing, it enables a tunable balance between preserving core content and enabling flexible, targeted edits—outperforming static or naïve interpolation schemes. Its adoption in both image diffusion and sequence Transformer architectures highlights the generality of scheduler-driven mixing, offering substantial empirical gains without increasing storage or computational complexity (Fu et al., 6 Jan 2026, Hu et al., 2 Nov 2025).