RWKV-TS: Efficient Long-Sequence Model

Updated 18 January 2026

RWKV-TS is a family of architectures that adapts the RWKV model with linear recurrent attention for efficient long-sequence modeling across domains like time series and language.
It integrates a dual-module design—Time-Mix and Channel-Mix—with stateful recurrence, learnable decay, and parallel training to optimize performance.
Enhanced with meta-learning and gating mechanisms, RWKV-TS achieves significant compute, memory, and latency reductions versus Transformer and CNN approaches.

RWKV-TS refers to a family of architectures derived from the Receptance Weighted Key-Value (RWKV) model, designed for efficient long-sequence modeling via linear recurrent attention mechanisms. RWKV-TS uniquely adapts the RWKV paradigm for time series, language generation, and speech modeling through modifications in recurrence, gating, and sometimes meta-learning. In a variety of domains, RWKV-TS offers competitive or superior performance versus Transformer- and CNN-based approaches, delivering substantial improvements in computational cost, memory usage, and latency (weile et al., 8 Mar 2025, Hou et al., 2024, Pan, 21 Feb 2025, An et al., 2023, Peng et al., 2023, Yueyu et al., 4 Apr 2025).

1. Architectural Foundations

The canonical RWKV block composes two main modules: Time-Mix and Channel-Mix. The Time-Mix module maintains a state vector and carries out an "attention-like" weighted aggregation using the WKV (weighted key-value) mechanism. Formally, for input $x_t \in \mathbb{R}^d$ at time $t$ :

Time-Mix State Update (RWKV-7 style, as in Rimer):

$\mathrm{State}_{t} = \mathrm{State}_{t-1} \left( \mathrm{diag}(w_{t}) - \hat\kappa_{t}^\top(a_{t} \odot \hat\kappa_{t}) \right) + v_{t}^\top \tilde{k}_{t} \odot a_{t}$

where $w_t$ , $a_t$ , $v_t$ , $\hat\kappa_t$ , $\tilde{k}_t \in \mathbb{R}^d$ are learned gate vectors, and $\odot$ denotes element-wise product.

Channel-Mix (MLP):

$y_t = \phi(W_1 x_t + b_1) \odot (W_2 x_t + b_2)$

with $\phi(\cdot)$ as an activation function, typically ReLU.

Time-Mix also admits a Deep Equilibrium (DEQ) fixed-point formulation for implicit parallelism:

$h_t = \phi \Big( W h_t + V \big( \mathrm{State}_{t-1} \odot (\mathrm{diag}(w_t) - \kappa_t^\top(a_t \odot \hat\kappa_t)) \big) + U( v_t^\top \kappa_t \odot a_t ) \Big )$

RWKV-TS adapts these blocks for the target domain, with recurring feature projection and residual connections. Embedding, normalization, and dropout are implemented as in standard deep sequence models.

2. Mechanisms for Efficient Long-Sequence Modeling

RWKV-TS achieves linear compute and memory complexity, $O(Td)$ for a sequence of length $T$ and hidden dimension $d$ (Hou et al., 2024, Peng et al., 2023):

Stateful Recurrence: Only the current state, a set of running accumulators (e.g., $a_t$ , $b_t$ ), and the previous input are required at each time step.
Learnable Decay: Per-channel exponential decay, $w_j \in (0,1)$ , enables selective long-term information retention, mitigating the vanishing gradient problem of classical RNNs.
Parallel Training via Scan Kernels: During training, efficient time-parallel scans are used to aggregate across sequence positions while maintaining the recurrence, enabling fast hardware utilization.

This design allows for immediate token-wise outputs in streaming scenarios and supports very long contexts (tested up to $T=4096$ in time series, $T\gg 4000$ in NLP and ASR tasks).

3. Domain-Specific Innovations and Extensions

3.1 Time-Series (RWKV-TS / Rimer)

RWKV-TS integrates meta-learning into the time-mix mechanism. All time-mix gates are functions $[w_t, a_t, v_t, \kappa_t, \tilde{k}_t] = g_\theta(x_t)$ , where $g_\theta$ is a meta-learner producing gates optimized for rapid task adaptation. Training uses a combined loss:

$\min_{\theta, \cdots} \sum_{\text{tasks}} L_{\text{task}} + \lambda L_{\text{meta}}$

where $L_{\text{meta}}$ penalizes large updates to $\theta$ (e.g., via $\ell_2$ regularization). Even a small MAML-style inner-loop update per task yields 5–10% accuracy improvement at negligible additional compute. No architectural changes beyond the addition of $\theta$ are required (weile et al., 8 Mar 2025).

3.2 Language Modeling (RWKV-TS with Convolutional Shift and Gating)

Enhancements entail:

Position-Aware Convolutional Shift (CS): Aggregates and convolves the last $K$ hidden states with learnable position-decay, maintaining global coherence even with small $K$ via multi-layer stacking.

$h_t^{shift} = \sum_{i=1}^K (p_i \odot W^{cs}_i) \odot h_{t-i}^{(\ell)} + b^{cs}$

Neurally-Gated Router (G): Dynamically gates the shifted information:

$g_t = \sigma(W_g [h_t^{raw} \| h_{t-1}^{(\ell)}] + b_g)$

$h_t^{enhanced} = h_t^{raw} + g_t \odot h_t^{shift}$

These mechanisms enhance long-range dependency modeling and syntactic adaptation, yielding marked improvements in ROUGE-L, clause boundary handling, and entity coreference (Pan, 21 Feb 2025).

3.3 Streaming ASR (RWKV-TS Transducer)

RWKV-TS for ASR uses minimally cached recurrence for streaming decoding:

Zero-latency: Only left context $=1$ is cached; no future frames required.
Boundary-Aware Transducer (BAT): On-the-fly alignment via CIF restricts loss computation to active windows, reducing training memory by over 40%. RWKV-TS matches or surpasses conformer-based transducers on Mandarin and English ASR benchmarks up to 10k hours, with significant memory and latency reductions (An et al., 2023).

3.4 TTS (RWKVTTS)

RWKVTTS deploys RWKV-7 blocks in both text and audio embedding, processing a concatenated sequence of text and audio tokens produced by VQ-VAE. The architecture admits online/streamed generation and demonstrates high subjective quality and efficiency, rivaling transformer-based systems in production metrics (Yueyu et al., 4 Apr 2025).

4. Performance Benchmarks and Comparative Analysis

RWKV-TS models consistently outperform or match Transformer- and CNN-based baselines across domains:

Application	Metric	Transformer	RWKV-TS (Rimer)	Improvement
Time series (ECL)	RMSE	0.6488	0.2409	×2.7
Time series (ETTH)	RMSE	0.5770	0.0133	×43.3
Time series (Traffic)	RMSE	0.0055	0.0025	×2.2
Time series (Weather)	RMSE	6.1765	5.4311	×1.1
Training time (ECL)	hours/epoch	9	2	×4.5 faster

RWKV-TS requires only $\sim$ 1/23 parameters of Timer (1.6M vs. 37.8M), with speedups up to $43\times$ in RMSE and $4.5\times$ in wall-clock training throughput (weile et al., 8 Mar 2025). In general time series tasks, RWKV-TS delivers comparable or superior forecasting and classification accuracy, while reducing compute and memory by factors of $3$–$80$ (Hou et al., 2024).

Ablation studies in text modeling show additive benefits from convolutional shift and gating (+0.11 ROUGE-L above baseline) at minimal ( $\sim$ 1.7%) inference latency overhead (Pan, 21 Feb 2025). ASR results indicate near-parity in error rates versus chunked conformer baselines, but with zero extra latency and sharply reduced memory (An et al., 2023). TTS evaluation via human ratings demonstrates equivalence to leading transformer-based systems in production quality and enjoyment, though quantitative acoustic metrics are not reported (Yueyu et al., 4 Apr 2025).

5. Implementation and Training Practices

RWKV-TS models are readily deployed on AMD (ROCm + Triton) and NVIDIA (CUDA) GPUs. Principal codebases include:

Time series Rimer: https://github.com/Alic-Li/BlackGoose_Rimer
- /models/rwkv_block.py: Time-Mix + Channel-Mix modules
- /train.py: training loop
- /inference/*.py: streaming mode
TTS: https://github.com/yynil/RWKVTTS

Recommended practices:

Normalize input features per channel (zero mean, unit variance)
Tune batch size and layer width for large-data scaling; use gradient accumulation as needed
Configure DEQ implicit blocks for 5 fixed-point iterations (no gradients through inner loop)
Switch between ROCm and CUDA depending on hardware
For language and TTS, stack multiple RWKV-7 blocks with appropriate embedding and VQ tokenization

All code and weights are made publicly accessible for reproducibility, but certain hyperparameters (for TTS in particular) may require inspection or contact with maintainers (weile et al., 8 Mar 2025, Yueyu et al., 4 Apr 2025).

6. Limitations, Interpretations, and Directions for Future Research

RWKV-TS inherits several strengths and current constraints:

Unidirectional recurrence limits imputation in time series and TTS; bidirectional extensions ("Bi-RWKV-TS") and local convolutional augmentations are plausible next steps (Hou et al., 2024, Yueyu et al., 4 Apr 2025).
Depth and head count remain lower than in largest transformers; future exploration of wide/deep variants is encouraged.
Meta-learning yields rapid adaptation but introduces additional optimization complexity.
In TTS, scores for Production Complexity remain low across all models, indicating a need for improved expressiveness and prosody handling.
Empirical results indicate robust scaling laws in model size and data regime; RWKV preserves hardware parallelism in training while enabling true streamable inference (Peng et al., 2023).

Further proposed directions include bidirectional recurrence, adaptive decay kernels, sparse adaptation in MLP layers, and broader pretraining on unlabeled sequence data.

7. Significance and Impact

RWKV-TS architectures demonstrate that meta-learned, stateful, recurrent linear attention can replace quadratic-cost transformer backbones across diverse domains, including time-series modeling, language generation, speech recognition, and synthesis. RWKV-TS achieves dramatic reductions in parameter count, training time, and inference latency, without sacrificing SOTA accuracy. The combination of linear scaling, minimal memory footprint, and flexibility in integration of gating/meta-learning mechanisms positions RWKV-TS as a highly efficient alternative to Transformer architectures in both research and real-world applications (weile et al., 8 Mar 2025, Hou et al., 2024, Pan, 21 Feb 2025, An et al., 2023, Yueyu et al., 4 Apr 2025).