Papers
Topics
Authors
Recent
Search
2000 character limit reached

EMA-Sink: Exponential Moving Average for Streaming Data

Updated 27 January 2026
  • EMA-Sink is a constant-memory streaming mechanism that computes exponentially weighted averages using bias-corrected recursions for both fixed- and variable-interval data.
  • It underpins applications in sequence modeling, such as streaming video generation and adaptive learning rate optimization, by efficiently compressing long-horizon dependencies.
  • Employing controlled decay hyperparameters and adaptive schemes like p-EMA, EMA-Sink ensures robust convergence, minimized bias, and optimal memory utilization.

The Exponential Moving Average Sink (EMA-Sink) is a constant-memory streaming mechanism for maintaining exponentially weighted averages under sequential data ingestion. EMA-Sink generalizes exponential smoothing to both fixed- and variable-interval sampling regimes, and has recently seen critical application in sequence modeling, notably in autoregressive diffusion processes for streaming video generation (Movellan, 2015, Lu et al., 4 Dec 2025). In parallel, strong stochastic convergence properties for adaptive EMA-Sink variants have been established in the ergodic theory and stochastic gradient descent literature (Köhne et al., 15 May 2025), distinguishing EMA-Sink from classical constant-decay EMA.

1. Definitions and Update Recursions

EMA-Sink maintains a running estimate of an exponentially weighted mean over time-indexed observations {Xk}\{X_k\}, optionally adapting to non-uniform arrival times.

Variable-interval (Version 1)

For t1t2t_1 \leq t_2 \leq \dots, define time constant τ>0\tau>0, Δk=tktk1\Delta_k = t_k - t_{k-1}, and αk=exp(Δk/τ)\alpha_k = \exp(-\Delta_k/\tau). The EMA-Sink maintains numerators X~k\tilde{X}_k and denominators w~k\tilde{w}_k: X~k=Xk+αkX~k1,X~1=X1\tilde{X}_k = X_k + \alpha_k \tilde{X}_{k-1}, \quad \tilde{X}_1 = X_1

w~k=1+αkw~k1,w~1=1\tilde{w}_k = 1 + \alpha_k \tilde{w}_{k-1}, \quad \tilde{w}_1 = 1

Yielding a bias-corrected average: X^k=X~k/w~k\hat{X}_k = \tilde{X}_k / \tilde{w}_k

Fixed-interval (Version 2, Single-State)

For fixed step Δ\Delta, define constant α=exp(Δ/τ)\alpha = \exp(-\Delta/\tau), and recursively update: Bk=(1α)Xk+αBk1,B1=X1B_k = (1-\alpha) X_k + \alpha B_{k-1}, \quad B_1 = X_1 BkB_k converges asymptotically to the exponentially smoothed mean as the normalizing denominator w~k1\tilde{w}_k \to 1.

In streaming attention architectures, EMA-Sink operates by maintaining fixed-size "sink" vectors (projections of evicted tokens) updated with exponentials upon window slide: Si=αSi1+(1α)KiwS^i = \alpha S^{i-1} + (1-\alpha) K^{i-w} for keys, and analogously for values (Lu et al., 4 Dec 2025).

2. Effective Memory, Bias, and Hyperparameters

EMA-Sink's design provides direct control over the effective averaging window and bias.

  • Effective Sample Count: For fixed α\alpha, the variance of the smoothed sequence for i.i.d. noise σ2\sigma^2 is σ2(1α)/(1+α)\sigma^2 (1-\alpha)/(1+\alpha). Equating this to σ2/n\sigma^2 / n yields the "effective sample count":

n=1+α1α,α=n1n+1n = \frac{1+\alpha}{1-\alpha}, \quad \alpha = \frac{n-1}{n+1}

  • Effective Time Window: In the small Δ\Delta regime, the real-time window TT relates to τ\tau as T2τT \approx 2\tau, or equivalently set τ=T/2\tau = T/2. For discrete time:

α=eΔ/τ,eΔ/τ=T/Δ1T/Δ+1\alpha = e^{-\Delta/\tau}, \quad e^{-\Delta/\tau} = \frac{T/\Delta - 1}{T/\Delta + 1}

When Δ/T1\Delta/T \ll 1, τT/2\tau \approx T/2 is justified.

  • Bias and Initialization: Two-state EMA-Sink (Version 1) avoids startup bias, while the single-state variant (Version 2) incurs slight initial bias, vanishing after several τ\tau periods. In sequence attention, a static sink (no EMA update) leads to strong bias toward initial tokens ("frame copying"), while EMA-Sink yields a dynamic compromise (Lu et al., 4 Dec 2025).

Hyperparameter tuning chiefly involves α\alpha:

  • α1\alpha \to 1: Long horizon memory, slow decay of history, increased lag.
  • α0\alpha \to 0: Rapid adaptation, minimal effective memory, high variance. Empirical ablations in streaming video show increased dynamic scores with α[0.9,0.99]\alpha \in [0.9, 0.99], with optimal drift/motion tradeoffs depending on the application (Lu et al., 4 Dec 2025).

3. Theoretical Guarantees and Convergence Properties

Extensions of EMA-Sink employing time-dependent decay (pp-EMA with αn=C/(n+1)p\alpha_n = C/(n+1)^p, p(12,1]p\in(\tfrac{1}{2}, 1]) achieve almost sure convergence under mild mixing/autocorrelation assumptions (Köhne et al., 15 May 2025).

  • Weighted SLLN: General strong law of large numbers for weighted averages applies when variance and covariance decay meet summability requirements:

$\Var\left(\sum_{k=1}^n b_kX_k\right) = O(A_n^2/\psi(A_n)), \quad b_n \leq A_n/\psi(A_n)$

With bn/Λn0b_n/\Lambda_n \to 0 and suitably slow decay, Sn/AnE[X]S_n/A_n \to \mathbb{E}[X] almost surely.

  • Fixed-α\alpha EMA Limitation: For constant decay, the weight on new samples remains bounded away from zero, preventing the averaged variance from vanishing. pp-EMA resolves this via subharmonic decay rates. A plausible implication is that in adaptive learning rate applications, pp-EMA estimators reliably track variance and mean without excess lag, and converge to stationary expectations of the underlying Markov chain (Köhne et al., 15 May 2025).

4. Implementation, Complexity, and Streaming Workflow

EMA-Sink is designed for efficient streaming with O(1) time and space.

  • Streaming Pseudocode (Version 1):

1
2
3
4
5
6
7
8
9
10
11
state: last_time ← t0
       X_acc    ← 0
       w_acc    ← 0
on_receive(X_new at t_new):
    Δ ← t_new − last_time
    α ← exp(−Δ/τ)
    X_acc ← α·X_acc + X_new
    w_acc ← α·w_acc + 1
    last_time ← t_new
    ĤX ← X_acc / w_acc
    output ĤX
For attention-based window compression in autoregressive models:
1
2
sink_K = α * sink_K + (1−α) * K_evict
sink_V = α * sink_V + (1−α) * V_evict

  • Complexity:
    • Update: O(1) exponentiation, multiplies, and adds per step.
    • Memory: O(1) per state accumulator.
    • No need for historical data storage.

This design enables real-time streaming without cache growth or unbounded memory requirements, fitting applications where the time horizon N can be arbitrarily large (Movellan, 2015, Lu et al., 4 Dec 2025).

5. Applications in Sequence Modeling and Adaptive Optimization

EMA-Sink has broad applicability in time-series analysis, deep learning, and compression of long-horizon dependencies:

  • Streaming Video Generation: In diffusion-based streaming architectures, EMA-Sink fuses evicted sliding-window tokens to produce sink key/value vectors. This preserves global context and recent dynamics, eliminating catastrophic drift and initial-frame copying observed in static sinks. Models employing EMA-Sink outperform both window-only and static-sink baselines, especially in dynamic score and drift metrics (Lu et al., 4 Dec 2025).
  • Adaptive Step Size in SGD: pp-EMA is directly used to average gradient norms and variances for adaptive learning rate schemes. By decaying the weighting factor αn\alpha_n to zero, the estimator's noise is suppressed and the step size is reliably adjusted, preventing explosion or stagnation (Köhne et al., 15 May 2025).
  • General Streaming Statistics: EMA-Sink's constant-memory, bias-corrected updates make it broadly applicable to on-the-fly computation of means and higher moments in both fixed- and variable-rate settings (Movellan, 2015).

6. Comparative Metrics, Ablation, and Limitations

Empirical results substantiate EMA-Sink's role in both maintaining coherence and capturing motion:

Model Variant Dynamic Score Drift
EMA-Sink + Re-DMD (full) 64.06 2.51
–w/o EMA (static sink) 35.15 2.65
–w/o any sink (window only) 51.56 5.08

Qualitative examples distinguish drifting (window only), frame copying (static sink), and natural transitions (EMA-Sink) (Lu et al., 4 Dec 2025).

Limitations identified include dependence on α\alpha selection (single value may not fit heterogeneous dynamics), loss of fine-grained details in coarse aggregation, and requirement that EMA-Sink be paired with semantically aware distillation (such as Re-DMD) to preserve fidelity (Lu et al., 4 Dec 2025).

A plausible implication is that adaptive or multi-sink formulations, or an α\alpha schedule responsive to content complexity, may mitigate loss of detail while retaining the long-horizon benefits.

7. Concluding Remarks and Outlook

EMA-Sink establishes a mathematically principled and empirically validated approach for streaming average estimation and memory compression under exponential weighting. It achieves constant-time and memory operation, direct control over memory dynamics via decay parameters, and strong convergence guarantees when using time-adaptive pp-EMA schedules. Recent advances in streaming generative modeling demonstrate its practical utility, while theoretical studies explicate its advantages over classical EMAs (Movellan, 2015, Lu et al., 4 Dec 2025, Köhne et al., 15 May 2025).

Future research involves refinement of context-adaptive sinks, integration with more granular semantic measures, and extension to multi-resolution and multi-modal streaming scenarios. EMA-Sink remains central to scalable, bias-resilient long-horizon estimation in both statistical and deep learning models.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Exponential Moving Average Sink (EMA-Sink).