Papers
Topics
Authors
Recent
Search
2000 character limit reached

SenCache: Accelerating Diffusion Model Inference via Sensitivity-Aware Caching

Published 27 Feb 2026 in cs.CV and cs.LG | (2602.24208v1)

Abstract: Diffusion models achieve state-of-the-art video generation quality, but their inference remains expensive due to the large number of sequential denoising steps. This has motivated a growing line of research on accelerating diffusion inference. Among training-free acceleration methods, caching reduces computation by reusing previously computed model outputs across timesteps. Existing caching methods rely on heuristic criteria to choose cache/reuse timesteps and require extensive tuning. We address this limitation with a principled sensitivity-aware caching framework. Specifically, we formalize the caching error through an analysis of the model output sensitivity to perturbations in the denoising inputs, i.e., the noisy latent and the timestep, and show that this sensitivity is a key predictor of caching error. Based on this analysis, we propose Sensitivity-Aware Caching (SenCache), a dynamic caching policy that adaptively selects caching timesteps on a per-sample basis. Our framework provides a theoretical basis for adaptive caching, explains why prior empirical heuristics can be partially effective, and extends them to a dynamic, sample-specific approach. Experiments on Wan 2.1, CogVideoX, and LTX-Video show that SenCache achieves better visual quality than existing caching methods under similar computational budgets.

Summary

  • The paper introduces a principled caching strategy based on the denoiser’s local sensitivity, enabling dynamic and sample-specific cache reuse.
  • It leverages Jacobian norms of latent variables and timesteps to trigger caching, thereby substantially reducing the number of expensive forward passes.
  • Empirical results show that SenCache achieves significant reductions in NFE and improves visual quality compared to traditional heuristic-based approaches.

Sensitivity-Aware Caching for Efficient Diffusion Model Inference

Overview

The paper "SenCache: Accelerating Diffusion Model Inference via Sensitivity-Aware Caching" (2602.24208) presents a theoretically grounded framework for cache-based acceleration of diffusion model inference. Standard diffusion generators, especially video diffusion transformers, require numerous sequential denoising steps, each involving costly forward passes through large neural networks. Existing caching approaches select cache/reuse timesteps using empirical heuristics, but these lack theoretical justification, require extensive hyperparameter tuning, and are statically applied to all samples. SenCache introduces a principled, dynamic policy: caching is triggered based on the denoiser's local sensitivity to input perturbations, offering stronger guarantees and improved fidelity across a range of video diffusion models.

Sensitivity as the Foundation for Caching Decisions

SenCache's central premise is that the denoiser’s local sensitivity—characterized by the Jacobian norms with respect to the noisy latent variable xt\mathbf{x}_t and timestep tt—acts as a proxy for estimating output change between denoising steps. The framework formalizes a sensitivity score:

St=JxΔxt+JtΔt,S_t = \|J_x\| \, \|\Delta \mathbf{x}_t\| + \|J_t\| \, |\Delta t|,

where JxJ_x and JtJ_t are the Jacobians of the network outputs with respect to the latent and timestep, and Δxt\Delta \mathbf{x}_t, Δt\Delta t are their respective perturbations between steps. Cache reuse occurs whenever StS_t is below a user-defined tolerance ε\varepsilon, thus dynamically adapting cache decisions to each sample according to the network's local stability.

This avoids pitfalls of methods relying on single-term proxies (e.g., magnitude or time embedding differences), which can fail when unmodeled sensitivity components are substantial. The approach is agnostic to model architecture, sampling schedule, and can be calibrated using a small, diverse video set. Figure 1

Figure 1: SenCache utilizes sensitivity scores as the criterion for cache-based inference acceleration in diffusion models.

Empirical Analysis of Sensitivity Dynamics

To validate the importance of both latent and timestep sensitivities, extensive sensitivity analyses across SiT-XL/2 checkpoints show substantial and varying Jacobian norms for both xt\mathbf{x}_t and tt as a function of the denoising trajectory. This substantiates the necessity of a joint criterion for cache decisions. Figure 2

Figure 2

Figure 2: Left: Norm of the Jacobian w.r.t. the noisy latent. Right: Norm w.r.t. the timestep—both contribute significantly to output variation.

Calibration with only 8 diverse samples offers sensitivity estimates highly comparable to those obtained from much larger datasets (4096 videos). This highlights the practical efficiency and stability of sensitivity-based caching. Figure 3

Figure 3: Sensitivity profiles from calibration sets of 8 and 4096 videos; small sets suffice for accurate estimates.

Quantitative Results and Comparative Evaluation

SenCache is evaluated on Wan 2.1, CogVideoX, and LTX-Video—state-of-the-art video diffusion models—against recent caching baselines (TeaCache and MagCache). Both efficiency (NFE, Cache Ratio) and visual quality (LPIPS, SSIM, PSNR) metrics are reported. Under identical compute budgets, SenCache achieves consistently superior visual quality, especially in aggressive caching regimes, where the gap in LPIPS and SSIM widens compared to MagCache and TeaCache.

The sensitivity-aware approach also brings interpretability: prior heuristics are shown to be special cases that can either over-cache challenging samples or under-cache easy ones due to static, monolithic triggers. In contrast, SenCache’s dynamic adaptation per sample leverages actual input-output responsiveness, minimizing both over-caching and cumulative approximation error. Figure 4

Figure 4: Mean absolute error between denoiser outputs at consecutive timesteps, illuminating model-specific sensitivity differences.

Ablation and Cross-Model Insights

Parameter ablations reveal sharp accuracy-efficiency trade-offs: increasing the cache lifetime nn and error threshold ε\varepsilon reduces NFE, but over-extension leads to quality degradation due to less accurate local approximations. Small calibration sets suffice for robust sensitivity estimation, and the sensitivity profile varies significantly across models and timesteps, suggesting that dynamic scheduling of ε\varepsilon and cache lifetime could further optimize speed-quality tradeoff.

Additional efficiency metrics indicate substantial wall-clock speedups and compute reductions compared to prior baselines. Visualizations show that in critical denoising regimes, certain models exhibit markedly higher sensitivity, necessitating higher tolerance for comparable NFE reductions but potentially incurring more visual artifacts.

Practical and Theoretical Implications

SenCache’s framework enhances practical deployability of diffusion generators without retraining or architecture modification, making it suitable for latency-critical applications and modalities beyond video (e.g., audio, motion). Theoretically, it provides a foundation for adaptive caching policies grounded in the model's local Lipschitz behavior, bridging the gap between empirical heuristics and robust decision-making. The approach is extensible: higher-order sensitivity estimators or learned surrogates could further refine accuracy, and per-timestep or globally optimized tolerances can be integrated for even finer control.

Conclusion

SenCache establishes a principled mechanism for cache-based diffusion model acceleration, leveraging sensitivity both to the noisy latent and timestep as the criterion for safe reuse. This results in dynamic, sample-specific caching schedules that robustly minimize approximation error, directly improving the speed-quality envelope for generative architectures. By unifying and generalizing prior approaches, the framework sets a foundation for future research on adaptive inference schedules, model-agnostic acceleration, and cross-modal diffusion system optimizations.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

Overview: What this paper is about

This paper is about making video-generating AI models (called diffusion models) faster without hurting how good the videos look. These models usually need to do the same kind of heavy calculation many times in a row to create one video, which takes a long time. The authors propose a new way, called SenCache, to skip some of those calculations by smartly “reusing” earlier results when it’s safe.

What questions the paper tries to answer

  • Can we speed up video diffusion models by reusing past results without retraining the model?
  • How can we decide, in a principled way (not just guesswork), when it’s safe to reuse old results and when we must compute fresh ones?
  • Can this decision be different for each video, depending on how hard it is, instead of using a one-size-fits-all rule?

How the method works (in everyday terms)

Imagine you’re drawing a picture step by step. If the next step looks almost the same as the current one, you don’t need to redraw everything—you can copy what you already have. But you need a reliable way to tell whether the next step will look “almost the same.”

That’s what SenCache does using the idea of “sensitivity”:

  • Sensitivity means “how much the model’s output changes when its input changes a little.” Think of it like a road: if the road is smooth (low sensitivity), small steering changes won’t move you much—you can stay on cruise control. If the road is bumpy (high sensitivity), tiny moves cause big changes—you need to drive carefully.
  • In diffusion models, two things change at every step:
    • The noisy picture inside the model (the “latent”).
    • The position in the denoising process (the “timestep,” like which step number you’re on).
  • SenCache measures how sensitive the model is to both:
    • Changes in the current noisy picture.
    • Changes in the timestep.
  • It combines these into a single score that predicts how much the output will change in the next step. If the score is small (below a chosen tolerance), SenCache reuses the last result (a “cache hit”). If the score is big, it computes a fresh result.

How do they get the sensitivity numbers?

  • They “poke” the model slightly during a short setup phase: they make tiny changes to the input or timestep and see how much the output moves. This is like gently tapping something to see how wobbly it is.
  • They store these sensitivity estimates so they can be quickly looked up during video generation.
  • They also limit how many times in a row they reuse results (a safety cap) so small errors don’t build up.

Why this is different from earlier methods:

  • Earlier methods used rough rules (heuristics), like “if the time difference is small, reuse,” or “if the model’s residual is small, reuse.” These only looked at one side of the problem.
  • SenCache looks at both sides (latent and timestep) and makes a per-video, per-step decision based on a clear, math-based reason.

Main findings and why they matter

The authors tested SenCache on three strong video diffusion systems (Wan 2.1, CogVideoX, and LTX-Video). Here’s what they found:

  • It keeps video quality higher at the same speed compared to previous caching methods, or matches the speed with better quality.
  • It adapts to each video: easy parts get more reuse (faster), hard parts get fresh calculations (safer).
  • It needs no retraining and no changes to the model’s structure.
  • A small calibration set (as little as 8 videos) is enough to estimate sensitivities well.
  • There’s a simple “knob” (the tolerance) to trade speed for quality: higher tolerance → faster but riskier; lower tolerance → safer but slower.
  • There’s also a cap on consecutive reuses: too many reuses in a row can hurt quality, so they stop after a few and refresh.

These results matter because they reduce the time and compute needed to generate videos—often the biggest roadblock to using these models in real applications—while keeping the results looking good.

What this could lead to next

  • Faster, cheaper video generation in real-world tools (like video editing, animation, and content creation).
  • A general idea (sensitivity-aware caching) that could be used beyond videos, such as for audio or motion generation.
  • Future improvements could use smarter ways to estimate sensitivity or adjust the tolerance differently at different stages of generation for even better speed-quality balance.

In short

SenCache is like a smart “reuse switch” that checks how sensitive the model is before deciding whether to skip work. By paying attention to both the noisy picture and the timestep, it makes safer, smarter reuse decisions. The result: faster generation with better-looking videos, and no need to retrain the model.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a consolidated list of concrete gaps and unresolved questions that future work could address:

  • Lack of global error guarantees
    • No theoretical link between the per-step sensitivity threshold StεS_t \le \varepsilon and the cumulative deviation of the final sample (or trajectory) from the full-evaluation baseline.
    • Absent analysis of how errors compound across cached runs and different ODE/SDE solvers; no guarantees on bounding global error for a given (ε,n)(\varepsilon, n).
  • First-order approximation limitations
    • The method relies on a first-order (linear) sensitivity bound with unmodeled O(Δxt2+Δt2)\mathcal{O}(\|\Delta\mathbf{x}_t\|^2 + |\Delta t|^2) terms; no treatment of curvature, cross-terms, or solver-induced nonlinearity.
    • No exploration of practical second-order surrogates (e.g., Hessian-vector products, curvature proxies) that could reduce mispredictions in highly nonlinear regions.
  • Sensitivity calibration is coarse and static
    • Sensitivity is precomputed on a small calibration set (8 videos) and used as global per-tt scalars (αx,αt)(\alpha_x,\alpha_t); there is no per-sample or per-trajectory online adaptation of sensitivities.
    • Unclear robustness under domain shift between the calibration set and target prompts/scenes; no statistical analysis of variance, confidence intervals, or coverage for sensitivity estimates.
    • Open question: can lightweight online estimators (e.g., using internal activations, gradient-free proxies) refine (αx,αt)(\alpha_x,\alpha_t) per sample without extra forward passes?
  • Choice of norms and perceptual alignment
    • The sensitivity score uses unweighted L2L_2 norms of denoiser outputs; no justification that this aligns with perceptual degradation or temporal artifacts.
    • Unexplored alternatives: feature-space norms, perceptual weighting, spatial/temporal attention weighting, or per-channel scaling that could better predict perceptual error.
  • Mapping from sensitivity to perceptual quality
    • No quantitative analysis of correlation between StS_t and actual frame-level or video-level perceptual error (e.g., R², calibration curves).
    • Lack of a principled method to set ε\varepsilon to meet a target quality metric (e.g., FVD, VBench scores) or a user-specified quality budget.
  • Limited evaluation metrics for video
    • Reported metrics (LPIPS/PSNR/SSIM) are frame-wise similarity to a reference; no assessment of temporal metrics (e.g., FVD, VBench temporal coherence, flicker/stability measures) or human preference studies.
    • No analysis of how caching affects motion smoothness, scene dynamics, or long-range temporal consistency.
  • Interaction with guidance and conditioning
    • Classifier-free guidance (CFG) scale, negative prompts, and conditional-unconditional branch interactions are not studied; these can change sensitivity and error accumulation.
    • Open: adapt ε\varepsilon and/or (αx,αt)(\alpha_x,\alpha_t) based on CFG scale, prompt complexity, or conditioning strength.
  • Sampler and schedule generality
    • Limited exploration across samplers (Euler, Heun, DPM-Solver++, iPNDM) and noise/timestep schedules (e.g., Karras, EDM); sensitivity and caching safety may vary significantly.
    • No results for stochastic samplers (SDE/ancestral) where noise injection may invalidate reuse assumptions.
  • Hyperparameter scheduling is heuristic
    • The more conservative early-step threshold (e.g., 1% error for the first 20%) is hand-set; there is no systematic or learned schedule of ε(t)\varepsilon(t).
    • Open: design and validate dynamic, solver-aware or content-aware schedules that optimize the speed–quality trade-off.
  • Memory footprint and systems implications
    • Storing full denoiser outputs for high-resolution, multi-frame latents can be memory intensive; memory–latency trade-offs and cache eviction policies are not quantified.
    • No exploration of output compression, mixed-precision storage, tiling, or partial-output caching to reduce memory pressure.
  • Runtime overhead and end-to-end speed
    • While NFE and cache ratios are reported, wall-clock speedups, GPU utilization, and memory bandwidth impacts are not measured; real-world latency gains are unclear.
  • Extreme operating regimes
    • Behavior under very low-step generation (e.g., ≤10 NFEs) is unexplored; caching may interact differently with aggressive solvers.
    • Long videos (hundreds/thousands of frames) and high-motion/fast-action scenarios are not stress-tested for drift and temporal artifacts.
  • Applicability beyond tested models and modalities
    • Claims of architecture/sampler/modalities agnosticism are not empirically validated for images, audio, motion, or multimodal diffusion; transferability remains open.
    • Sensitivity profiles may depend on latent dimensionality, resolution, or VAE configuration; robustness to such changes is untested.
  • Fairness and tuning of baselines
    • Baselines use “official hyperparameters”; it is unclear whether methods were re-tuned for each model/regime, potentially affecting comparative conclusions.
  • Failure mode analysis
    • No qualitative taxonomy of cases where caching introduces artifacts (e.g., sudden camera motion, complex texture emergence); diagnostic tools and automatic fail-safes are absent.
  • Combination with other accelerations
    • Interaction with quantization, pruning, distillation (LCM/consistency), and intra-timestep caching (e.g., FasterCache) is unstudied; potential synergies or conflicts are unknown.
  • Theoretical characterization of stability regions
    • No formal identification of regions in (xt,t)(\mathbf{x}_t, t) space where caching is provably safe under specific Lipschitz or smoothness assumptions of the denoiser.
    • Open: derive solver-aware conditions under which reuse preserves contractivity/stability, and connect to Parseval/robustness constraints.
  • Calibration cost and reproducibility
    • The one-time sensitivity precomputation cost and its dependence on resolution/model size are not reported.
    • Reproducibility details (seeds, solver configs, guidance scales) for all reported metrics are incomplete, making it hard to replicate sensitivity curves or cache schedules.

Practical Applications

Below are practical, real-world applications enabled by SenCache’s sensitivity‑aware caching for diffusion model inference. Each item includes sectors, actionable workflows/products, and key assumptions or dependencies that affect feasibility.

Immediate Applications

  • Media & Entertainment (VFX, animation, post-production)
    • Actionable use cases:
    • Faster previsualization/iteration of text-to-video (T2V) scenes and storyboards.
    • “Preview vs. Final” workflows: permissive ε for creative exploration; strict ε for final renders.
    • Tools/products/workflows:
    • Plugin for DCC tools (e.g., Unreal/Unity/Blender) that exposes a “Speed–Quality” dial (ε, n).
    • Pipeline module for studio render farms to boost throughput per GPU-hour.
    • Assumptions/dependencies:
    • Access to per-step denoiser outputs and sampler timesteps.
    • Per-model calibration of sensitivity profiles (often feasible with ~8 videos).
    • Early steps need stricter ε; set n to limit drift.
  • Advertising, Marketing, and Creative Automation
    • Actionable use cases:
    • Rapid A/B generation of video creatives under fixed compute budgets.
    • Budget-aware batch generation (auto-tune ε to hit NFE or latency SLAs).
    • Tools/products/workflows:
    • “Sensitivity‑aware scheduler” in cloud T2V services (Kubernetes operators that set ε per job).
    • API parameterization for customers: “good/fast” presets mapped to ε and n.
    • Assumptions/dependencies:
    • Sampler/model compatibility (e.g., DPM-Solver settings).
    • Robust QA for brand safety—use conservative ε for critical assets.
  • Social Media and Creator Apps
    • Actionable use cases:
    • Near‑real‑time short video generation or filter effects on edge/consumer devices.
    • Faster “draft” mode for mobile editing apps; “export” with stricter settings.
    • Tools/products/workflows:
    • Mobile inference SDKs integrating SenCache with quantization/pruning.
    • On-device caching with memory caps (bounded by n and cache size).
    • Assumptions/dependencies:
    • Memory footprint for caching is acceptable on device.
    • Calibrations may need to be hardware-specific and resolution-specific.
  • Cloud Inference Platforms and MLOps (Software/Cloud)
    • Actionable use cases:
    • Increase model‑as‑a‑service throughput without retraining.
    • Dynamic QoS: adjust ε at runtime to meet latency targets when clusters are saturated.
    • Tools/products/workflows:
    • Hugging Face Diffusers or similar integration: “SenCache‑aware scheduler.”
    • Sensitivity‑profile registry per model/resolution/sampler and a calibration CLI.
    • Observability panels: S_t traces, cache-hit ratios, quality metrics vs. ε.
    • Assumptions/dependencies:
    • Need to instrument inference loops (black-box APIs may not permit caching).
    • CFG/conditional branches: ensure reuse policy is consistent within each guidance step.
  • Research and Academia
    • Actionable use cases:
    • Run larger evaluation suites (VBench, compositional benches) with lower cost.
    • Faster prototyping of samplers/schedules; ablate ε and n to quantify error budgets.
    • Tools/products/workflows:
    • Open-source codebase (GitHub) for quick drop-in to DiT-based video models.
    • Jupyter utilities to compute and visualize sensitivity profiles.
    • Assumptions/dependencies:
    • Stable sensitivity statistics across small calibration sets (shown to hold with 8 videos).
    • Must log and monitor LPIPS/PSNR/SSIM to avoid silent drifts.
  • Sustainability and FinOps (Energy sector within IT)
    • Actionable use cases:
    • Reduce NFEs to cut GPU time and energy, lowering carbon footprint and cost.
    • Tools/products/workflows:
    • “Green budget” mode: enforce ε ceilings tied to emissions/compute budgets.
    • Reporting dashboards linking cache ratio and NFE to energy consumption.
    • Assumptions/dependencies:
    • Energy models calibrated for specific hardware (A100/A800/H100, etc.).
    • Quality thresholds aligned with product QA requirements.

Long-Term Applications

  • Audio, Speech, and Music Generation (Cross‑modal diffusion)
    • Potential use cases:
    • Lower‑latency TTS, voice cloning, and music generation by sensitivity‑aware reuse in audio diffusion.
    • Tools/products/workflows:
    • Audio‑domain sensitivity calibrators; schedulers adapted to audio timesteps.
    • Assumptions/dependencies:
    • Requires extension and validation on audio architectures; sensitivity behavior may differ.
  • Robotics and Control (Diffusion Policies)
    • Potential use cases:
    • Reduce latency in diffusion‑based action planners by reusing outputs when state/time sensitivity is low.
    • Tools/products/workflows:
    • Real-time inference engines with safety caps on ε and n; online monitoring of S_t.
    • Assumptions/dependencies:
    • Strict safety constraints; worst‑case guarantees needed; rigorous evaluation under distribution shift.
  • Medical Imaging and Scientific Computing
    • Potential use cases:
    • Accelerate diffusion‑based reconstruction/denoising (e.g., MRI/CT) where iterative inference is used.
    • Tools/products/workflows:
    • Sensitivity‑aware reconstruction pipelines with clinically constrained error budgets.
    • Assumptions/dependencies:
    • Regulated domains demand extensive validation; conservative ε; per‑scanner/per‑protocol calibration.
  • AR/VR and Interactive Media
    • Potential use cases:
    • On-device or edge‑assisted generative video backdrops and scene synthesis with adaptive QoS.
    • Tools/products/workflows:
    • Dynamic ε schedules tied to frame budget and motion; “graceful degradation” in live sessions.
    • Assumptions/dependencies:
    • Tight latency bounds; memory and thermals on head‑mounted devices; robust user‑perceived quality.
  • Compiler/Runtime and Hardware Co‑Design
    • Potential use cases:
    • Sensitivity‑aware passes in compilers (TensorRT/TVM/ONNX Runtime) that decide cache/skips.
    • Accelerator support for fast sensitivity estimation and cache management.
    • Tools/products/workflows:
    • Runtime modules that maintain S_t and trigger reuse at kernel granularity.
    • Assumptions/dependencies:
    • Requires vendor collaboration; standardizing model interfaces for per‑step introspection.
  • Training‑Time Synergy: Cache‑Friendly Models
    • Potential use cases:
    • Regularize Jacobian norms during training to enlarge “safe‑to‑reuse” regions; co‑design samplers with caching in mind.
    • Tools/products/workflows:
    • Loss terms or curriculum that promote smoother denoisers; learned ε(t) schedules.
    • Assumptions/dependencies:
    • Additional training cost and data; must ensure no quality regression.
  • Standardization and Policy
    • Potential use cases:
    • Establish “error budget” reporting (ε, n, NFE, cache ratio) as part of model cards and green AI disclosures.
    • Tools/products/workflows:
    • Benchmarks linking speedups to perceptual metrics; procurement guidelines for energy‑aware generation.
    • Assumptions/dependencies:
    • Consensus on metrics and acceptable perceptual deltas; cross‑vendor comparability.
  • Synthetic Data at Scale (Autonomy, Vision, Simulation)
    • Potential use cases:
    • Cheaper generation of large video datasets for training and evaluation.
    • Tools/products/workflows:
    • Data factories with adaptive ε to meet dataset quality specs at minimal cost.
    • Assumptions/dependencies:
    • Task‑specific tolerance to visual artifacts; automated quality gating.

Notes on feasibility across applications:

  • Works best when the model exhibits stable local sensitivity; some models (e.g., Wan 2.1) tolerate reuse better than others (e.g., CogVideoX, LTX-Video shown to need larger ε to match NFEs).
  • Requires control over the inference loop (stepwise access to x_t and t); may not be possible with opaque third‑party APIs.
  • Parameter tuning (ε for tolerance, n for max consecutive reuse) is essential; early timesteps usually need stricter thresholds.
  • Memory overhead for storing cached denoiser outputs must be budgeted, especially on edge devices.
  • Sensitivity profiles are model-, resolution-, and sampler‑specific; maintain a registry and re‑calibrate when these change.

Glossary

  • attention/MLP outputs: Intermediate representations from attention and multilayer perceptron sublayers in transformers that can be reused to save computation. "FORA reuses intermediate attention/MLP outputs across steps without retraining~\citep{Selvaraju2024FORA}."
  • cache lifetime parameter: A limit on how many consecutive steps a cached value may be reused before refreshing to prevent drift. "We first ablate the cache lifetime parameter nn, which limits the maximum number of consecutive cache reuses before a refresh."
  • cache ratio: The percentage of denoising steps that use a cached output instead of a fresh network evaluation. "We also report NFE (number of function evaluations) and Cache Ratio (percentage of denoising steps retrieved from cache) to assess computational efficiency."
  • calibration set: A small dataset used to precompute or estimate model statistics (e.g., sensitivities) prior to inference. "These sensitivity values are computed once per model on a small calibration set and cached for use during inference."
  • classifier-free guidance: A technique that combines unconditional and conditional model predictions to steer generation without an explicit classifier. "FasterCache further shows strong redundancy between conditional and unconditional branches in classifier-free guidance and reuses them efficiently within a timestep~\citep{Lv2024FasterCache}."
  • conditional target velocity: The expected instantaneous change of the latent under the generative dynamics given current state, used in flow/ODE formulations. "For the interpolation above, the conditional target velocity is"
  • denoiser: The network component in diffusion models that predicts noise or velocity to iteratively denoise the latent. "caching reduces computation by reusing previously computed denoiser outputs across timesteps."
  • denoising steps: The sequential iterations that progressively remove noise from the latent to synthesize a sample. "Diffusion models achieve state-of-the-art video generation quality, but their inference remains expensive due to the large number of sequential denoising steps."
  • diffusion models: Generative models that transform noise into data through a stochastic or ODE-based denoising process. "Diffusion models achieve state-of-the-art video generation quality, but their inference remains expensive due to the large number of sequential denoising steps."
  • Diffusion Transformers (DiTs): Transformer architectures tailored for diffusion-based generation, often used for images and videos. "This motivated the introduction of Diffusion Transformers (DiTs)~\citep{peebles2023dit}, which now form the backbone of many state of the art text-to-video generators."
  • DPM-Solver: A specialized fast ODE solver designed for efficient diffusion model sampling. "In practice, ODE solvers (e.g., Euler or diffusion ODE solvers such as DPM-Solver \cite{lu2022dpm}) evaluate the learned field vθ(xt,t)\mathbf{v}_\theta(\mathbf{x}_t,t) repeatedly across timesteps"
  • error budget: An allocated tolerance for approximation error per step that guides how aggressively to cache or skip computations. "as the sensitivity threshold ε\varepsilon maps directly to an error budget at each denoising step"
  • expert transformer modules: Specialized transformer components that can be routed or composed to improve capacity or efficiency. "Large-scale systems such as CogVideoX~\citep{yang2025cogvideox} and Wan~2.1~\citep{wan2025open} adopt DiTs with expert transformer modules and deliver strong visual quality"
  • first-order expansion: A linear approximation of function change using Jacobians/derivatives around the current point. "A first-order expansion between consecutive steps gives:"
  • finite-difference (secant) estimates: Numerical approximations of derivatives/Jacobians by evaluating function differences over small input changes. "Since computing exact sensitivities is expensive, we approximate them using directional finite-difference (secant) estimates."
  • flow matching models: Generative models that learn a velocity field to transform noise into data via deterministic flows. "Diffusion models~\citep{ho2020ddpm,song2021sde} and flow matching models ~\cite{albergo2022building,lipman2022flow} have reshaped generative modeling"
  • flow-matching view: The perspective of diffusion as learning a continuous flow/velocity field that maps data to noise (and back) via an ODE. "We adopt the flow-matching view of diffusion models \cite{albergo2023stochastic,ma2024sit}, where a data sample $\mathbf{x}_0 \sim p_{\mathrm{data}$ is continuously transformed into a noisy variable xt\mathbf{x}_t"
  • Jacobian: The matrix of first-order partial derivatives capturing how outputs change with respect to inputs. "we compute the Jacobian and partial derivative of the denoiser output with respect to the noisy latent and the timestep, respectively:"
  • Jacobian norm: A scalar measure of sensitivity derived from the Jacobian’s norm, indicating how responsive the model is locally. "the local sensitivity is expressed through the Jacobian norm:"
  • latent drift: The change in the latent variable between steps, which can impact output differences. "These sensitivities quantify the effect of latent drift and timestep spacing on the denoiser output"
  • lexicographic minimax: An optimization criterion that prioritizes minimizing the worst-case error in a lexicographic (ordered) sense. "LeMiCa takes a different perspective and formulates cache scheduling as a global path optimization problem (lexicographic minimax) to control worst-case accumulated error across steps."
  • local Lipschitz constants: Bounds relating input perturbations to output changes in a neighborhood, derived from Jacobian norms. "Thus, the Jacobian norms act as local Lipschitz constants governing the model responsiveness to latent and timestep perturbations."
  • local sensitivity: The responsiveness of a model’s output to small input perturbations around a given point. "Our key idea is to use the denoiser’s local sensitivity—i.e., the variation of its output with respect to perturbations in the noisy latent and timestep"
  • LPIPS: A perceptual similarity metric based on deep features used to assess visual quality. "Following MagCache~\citep{ma2025magcache}, we report LPIPS~\citep{zhang2018lpips}, SSIM~\citep{wang2004ssim}, and PSNR as metrics for visual quality."
  • noisy latent: The intermediate latent variable corrupted by noise during diffusion sampling. "perturbations in the denoising inputs, i.e., the noisy latent and the timestep"
  • number of function evaluations (NFEs): The count of model evaluations performed by the sampler, largely determining inference time. "Hence, the number of function evaluations (NFEs) largely determines latency."
  • ODE solvers: Numerical integrators used to solve ordinary differential equations governing the sampling dynamics. "In practice, ODE solvers (e.g., Euler or diffusion ODE solvers such as DPM-Solver \cite{lu2022dpm}) evaluate the learned field"
  • Probability Flow ODE: An ODE formulation of generative diffusion that deterministically transports samples along a probability flow. "Flow Matching and the Probability Flow ODE."
  • PSNR: Peak Signal-to-Noise Ratio, a distortion-based metric for fidelity evaluation. "Following MagCache~\citep{ma2025magcache}, we report LPIPS~\citep{zhang2018lpips}, SSIM~\citep{wang2004ssim}, and PSNR as metrics for visual quality."
  • pyramid schedule: A multi-scale scheduling strategy (e.g., for attention maps) that exploits hierarchical redundancy. "PAB targets video DiTs by broadcasting attention maps in a pyramid schedule, exploiting the U-shaped redundancy of attention differences"
  • residual: The difference between a model’s prediction and its input (or between successive outputs), used as a reuse heuristic. "MagCache~\citep{ma2025magcache} selects reuse timesteps based on the magnitude of the residual (the difference between the model’s prediction and its input)."
  • sensitivity-aware caching: A caching approach that bases reuse decisions on measured or estimated model sensitivity. "We address this limitation with a principled sensitivity-aware caching framework."
  • Sensitivity-Aware Caching (SenCache): The proposed dynamic policy that reuses outputs only when predicted changes are below a tolerance. "we propose Sensitivity-Aware Caching (\text{SenCache}), a dynamic caching policy that adaptively selects caching timesteps on a per-sample basis."
  • sensitivity score: A combined metric using latent and time sensitivities to decide whether to reuse a cached output. "such that the sensitivity score (see \Cref{eq:sensitivity-score}) falls below ε\varepsilon"
  • time embedding: A learned representation of timestep used by diffusion models to condition the denoiser. "TeaCache~\citep{liu2025teacache} builds cache-reuse rules through output residual modeling with time embedding difference"
  • timestep spacing: The step size between successive timesteps, which affects output change and reuse safety. "These sensitivities quantify the effect of latent drift and timestep spacing on the denoiser output"
  • U-Net: A convolutional encoder–decoder architecture commonly used in diffusion denoisers. "For U-Net models, DeepCache reuses high-level features across adjacent timesteps to cut redundant work with minimal quality loss~\citep{Ma2024DeepCache}."
  • velocity field: The vector field defining the instantaneous change of the latent under the generative ODE. "The marginal distribution of xt\mathbf{x}_t evolves under a velocity field v(x,t)\mathbf{v}(\mathbf{x},t)"
  • velocity-matching objective: A training loss that matches the learned velocity field to the ground-truth target velocity. "A neural network vθ(xt,t)\mathbf{v}_\theta(\mathbf{x}_t,t) is trained with the standard velocity-matching objective"
  • Video-VAE: A variational autoencoder tailored for video latents, often paired with a diffusion denoiser. "LTX-Video~\citep{HaCohen2024LTXVideo} further improves efficiency by tightly coupling a Video-VAE with a DiT-based denoiser."
  • video diffusion transformers: Transformer-based diffusion architectures specialized for video generation. "This cost is especially prohibitive for modern video diffusion transformers, which contain billions of parameters"

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 62 likes about this paper.