EMA-Sink: Exponential Moving Average for Streaming Data
- EMA-Sink is a constant-memory streaming mechanism that computes exponentially weighted averages using bias-corrected recursions for both fixed- and variable-interval data.
- It underpins applications in sequence modeling, such as streaming video generation and adaptive learning rate optimization, by efficiently compressing long-horizon dependencies.
- Employing controlled decay hyperparameters and adaptive schemes like p-EMA, EMA-Sink ensures robust convergence, minimized bias, and optimal memory utilization.
The Exponential Moving Average Sink (EMA-Sink) is a constant-memory streaming mechanism for maintaining exponentially weighted averages under sequential data ingestion. EMA-Sink generalizes exponential smoothing to both fixed- and variable-interval sampling regimes, and has recently seen critical application in sequence modeling, notably in autoregressive diffusion processes for streaming video generation (Movellan, 2015, Lu et al., 4 Dec 2025). In parallel, strong stochastic convergence properties for adaptive EMA-Sink variants have been established in the ergodic theory and stochastic gradient descent literature (Köhne et al., 15 May 2025), distinguishing EMA-Sink from classical constant-decay EMA.
1. Definitions and Update Recursions
EMA-Sink maintains a running estimate of an exponentially weighted mean over time-indexed observations , optionally adapting to non-uniform arrival times.
Variable-interval (Version 1)
For , define time constant , , and . The EMA-Sink maintains numerators and denominators :
Yielding a bias-corrected average:
Fixed-interval (Version 2, Single-State)
For fixed step , define constant , and recursively update: converges asymptotically to the exponentially smoothed mean as the normalizing denominator .
In streaming attention architectures, EMA-Sink operates by maintaining fixed-size "sink" vectors (projections of evicted tokens) updated with exponentials upon window slide: for keys, and analogously for values (Lu et al., 4 Dec 2025).
2. Effective Memory, Bias, and Hyperparameters
EMA-Sink's design provides direct control over the effective averaging window and bias.
- Effective Sample Count: For fixed , the variance of the smoothed sequence for i.i.d. noise is . Equating this to yields the "effective sample count":
- Effective Time Window: In the small regime, the real-time window relates to as , or equivalently set . For discrete time:
When , is justified.
- Bias and Initialization: Two-state EMA-Sink (Version 1) avoids startup bias, while the single-state variant (Version 2) incurs slight initial bias, vanishing after several periods. In sequence attention, a static sink (no EMA update) leads to strong bias toward initial tokens ("frame copying"), while EMA-Sink yields a dynamic compromise (Lu et al., 4 Dec 2025).
Hyperparameter tuning chiefly involves :
- : Long horizon memory, slow decay of history, increased lag.
- : Rapid adaptation, minimal effective memory, high variance. Empirical ablations in streaming video show increased dynamic scores with , with optimal drift/motion tradeoffs depending on the application (Lu et al., 4 Dec 2025).
3. Theoretical Guarantees and Convergence Properties
Extensions of EMA-Sink employing time-dependent decay (-EMA with , ) achieve almost sure convergence under mild mixing/autocorrelation assumptions (Köhne et al., 15 May 2025).
- Weighted SLLN: General strong law of large numbers for weighted averages applies when variance and covariance decay meet summability requirements:
$\Var\left(\sum_{k=1}^n b_kX_k\right) = O(A_n^2/\psi(A_n)), \quad b_n \leq A_n/\psi(A_n)$
With and suitably slow decay, almost surely.
- Fixed- EMA Limitation: For constant decay, the weight on new samples remains bounded away from zero, preventing the averaged variance from vanishing. -EMA resolves this via subharmonic decay rates. A plausible implication is that in adaptive learning rate applications, -EMA estimators reliably track variance and mean without excess lag, and converge to stationary expectations of the underlying Markov chain (Köhne et al., 15 May 2025).
4. Implementation, Complexity, and Streaming Workflow
EMA-Sink is designed for efficient streaming with O(1) time and space.
- Streaming Pseudocode (Version 1):
1 2 3 4 5 6 7 8 9 10 11 |
state: last_time ← t0
X_acc ← 0
w_acc ← 0
on_receive(X_new at t_new):
Δ ← t_new − last_time
α ← exp(−Δ/τ)
X_acc ← α·X_acc + X_new
w_acc ← α·w_acc + 1
last_time ← t_new
ĤX ← X_acc / w_acc
output ĤX |
1 2 |
sink_K = α * sink_K + (1−α) * K_evict sink_V = α * sink_V + (1−α) * V_evict |
- Complexity:
- Update: O(1) exponentiation, multiplies, and adds per step.
- Memory: O(1) per state accumulator.
- No need for historical data storage.
This design enables real-time streaming without cache growth or unbounded memory requirements, fitting applications where the time horizon N can be arbitrarily large (Movellan, 2015, Lu et al., 4 Dec 2025).
5. Applications in Sequence Modeling and Adaptive Optimization
EMA-Sink has broad applicability in time-series analysis, deep learning, and compression of long-horizon dependencies:
- Streaming Video Generation: In diffusion-based streaming architectures, EMA-Sink fuses evicted sliding-window tokens to produce sink key/value vectors. This preserves global context and recent dynamics, eliminating catastrophic drift and initial-frame copying observed in static sinks. Models employing EMA-Sink outperform both window-only and static-sink baselines, especially in dynamic score and drift metrics (Lu et al., 4 Dec 2025).
- Adaptive Step Size in SGD: -EMA is directly used to average gradient norms and variances for adaptive learning rate schemes. By decaying the weighting factor to zero, the estimator's noise is suppressed and the step size is reliably adjusted, preventing explosion or stagnation (Köhne et al., 15 May 2025).
- General Streaming Statistics: EMA-Sink's constant-memory, bias-corrected updates make it broadly applicable to on-the-fly computation of means and higher moments in both fixed- and variable-rate settings (Movellan, 2015).
6. Comparative Metrics, Ablation, and Limitations
Empirical results substantiate EMA-Sink's role in both maintaining coherence and capturing motion:
| Model Variant | Dynamic Score | Drift |
|---|---|---|
| EMA-Sink + Re-DMD (full) | 64.06 | 2.51 |
| –w/o EMA (static sink) | 35.15 | 2.65 |
| –w/o any sink (window only) | 51.56 | 5.08 |
Qualitative examples distinguish drifting (window only), frame copying (static sink), and natural transitions (EMA-Sink) (Lu et al., 4 Dec 2025).
Limitations identified include dependence on selection (single value may not fit heterogeneous dynamics), loss of fine-grained details in coarse aggregation, and requirement that EMA-Sink be paired with semantically aware distillation (such as Re-DMD) to preserve fidelity (Lu et al., 4 Dec 2025).
A plausible implication is that adaptive or multi-sink formulations, or an schedule responsive to content complexity, may mitigate loss of detail while retaining the long-horizon benefits.
7. Concluding Remarks and Outlook
EMA-Sink establishes a mathematically principled and empirically validated approach for streaming average estimation and memory compression under exponential weighting. It achieves constant-time and memory operation, direct control over memory dynamics via decay parameters, and strong convergence guarantees when using time-adaptive -EMA schedules. Recent advances in streaming generative modeling demonstrate its practical utility, while theoretical studies explicate its advantages over classical EMAs (Movellan, 2015, Lu et al., 4 Dec 2025, Köhne et al., 15 May 2025).
Future research involves refinement of context-adaptive sinks, integration with more granular semantic measures, and extension to multi-resolution and multi-modal streaming scenarios. EMA-Sink remains central to scalable, bias-resilient long-horizon estimation in both statistical and deep learning models.