Diffusion in RWKV-7 (DIR-7)

Updated 28 December 2025

The paper introduces DIR-7, which integrates score-based diffusion probabilistic modeling with RWKV-7's linear-time recurrence and novel CrossWKV mechanism to achieve state-of-the-art image synthesis.
The methodology employs a two-stage DDPM framework with recurrent WKV state updates and LoRA adaptations, ensuring constant memory usage and linear complexity compared to Transformer models.
DIR-7 demonstrates competitive performance on benchmarks like ImageNet 256×256 while significantly reducing computational and memory costs, highlighting its scalable cross-modal fusion capabilities.

Diffusion in RWKV-7 (DIR-7) refers to the implementation of score-based diffusion probabilistic modeling within the RWKV-7 neural architecture, leveraging the linear-time Weighted Key-Value (WKV) recurrence and the novel CrossWKV cross-attention mechanism. This approach enables efficient and expressive text-conditioned and unconditional image synthesis with constant memory and linear complexity, in contrast to conventional Transformer-based diffusion pipelines. DIR-7 achieves state-of-the-art results on benchmarks such as ImageNet 256×256 and demonstrates robust cross-modal alignment capabilities.

1. Pipeline Overview and Architecture

DIR-7 follows a two-stage denoising diffusion probabilistic model (DDPM) framework, adapted for integration with RWKV-7 and CrossWKV. The major components in the DIR-7 pipeline are:

Image and Text Tokenization: A noisy image $x_t \in \mathbb{R}^{B \times H \times W \times C}$ is processed via a small convolutional encoder into features $\mathbf{x} \in \mathbb{R}^{B \times T \times D}$ , with $T = H \cdot W$ . The conditioning text is encoded by a frozen CLIP model to yield token embeddings $\mathbf{q} \in \mathbb{R}^{B \times L \times D_q}$ , zero-padded to length $T$ for joint processing.
CrossWKV Layers: The concatenated image and text sequences are jointly processed through one or more CrossWKV layers. Each layer maintains a recurrent state matrix $S_{t-1} \in \mathbb{R}^{H \times N \times N}$ per attention head; $H$ is the number of heads and $N$ is the per-head feature dimension.
Denoising Prediction: The output $\mathbf{o} \in \mathbb{R}^{B \times T \times D}$ from CrossWKV layers is added residually and passed to a U-Net decoder. The network predicts the noise residual $\epsilon_\theta(x_t, q, t)$ required for DDPM sampling.
Diffusion Process: At each training iteration, the model minimizes the DDPM objective:

$\mathcal{L} = \mathbb{E}_{\epsilon, x_0, q, t} \left\| \epsilon - \epsilon_\theta(x_t, q, t) \right\|_2^2,$

where $x_t = \sqrt{\bar\alpha_t} x_0 + \sqrt{1-\bar\alpha_t} \epsilon$ and the schedule for $\beta_t$ is linear in $[10^{-4}, 2 \times 10^{-2}]$ , $T=1000$ .

2. RWKV-7 and WKV State Update Mechanism

DIR-7’s core computation is built around the recurrent WKV state update, which replaces the quadratic-complexity self-attention with a linear, input-adaptive recurrence per token. For each CrossWKV head, the update at time $t$ involves:

State Transition:

$S_t = S_{t-1} \left( \operatorname{diag}(w_t) - k_t^T (a_t \otimes k_t) \right) + v_t^T k_t$

where $k_t, v_t$ are key and value vectors; $w_t$ is a vector-valued decay gate; $a_t$ is a vector-valued learning rate; and $\otimes$ is the element-wise outer product.

Output Computation:

$y_t = r_t S_t + \left( r_t (p \otimes k_t)^T \right) v_t$

Here, $r_t$ is the receptance gate, and $p$ is a small trainable scalar.

The state transition matrix $T(x_t) = \operatorname{diag}(w_t) - k_t^T (a_t \otimes k_t)$ is fully non-diagonal and input-dependent, granting expressive power beyond the $\mathrm{TC}^0$ complexity class and enabling representation of all regular languages, demonstrated by state-tracking tasks such as $S_5$ permutation modeling (Xiao et al., 19 Apr 2025).

3. CrossWKV Cross-Attention: Fusion of Modalities

CrossWKV implements a single-pass cross-attention mechanism by jointly projecting and mixing image and text representations in each recurrent step:

Projection and Gating: Both $\mathbf{x}$ and $\mathbf{q}$ are projected into $\mathbf{r}, \mathbf{k}, \mathbf{v}, \mathbf{w}, \mathbf{a}, \mathbf{g}$ , each implemented as a sum of a base linear projection and a low-rank adaptation (LoRA):

$W^\star = W^\star_0 + A^\star B^\star$

with prescribed low-rank dimensions: rank 64 for $A^w, B^w$ and $A^a, B^a$ ; rank 16 for $A^v, B^v$ ; rank 128 for $A^g, B^g$ .

Temporal and Spatial Fusion: Temporal context is captured via a time-shift feature $\delta = \mathrm{shift}(\mathbf{x}) - \mathbf{x}$ , permitting spatial conditioning typical in vision tasks.
Unified Recurrent Sweep: All fusion is realized within one recurrent scan, eliminating the need for separate cross-attention modules or iterative alternation between modalities.
Generalized Delta Rule: The WKV recurrence is abstracted as:

$\Delta h_t = h_t - h_{t-1} = f( T(x_t) h_{t-1}, x_t )$

with $T(x_t)$ fully determined by the token and prompt inputs.

4. Diffusion Formulation and Sampling in DIR-7

DIR-7 retains the standard forward and reverse process from DDPM, but adapts all denoising prediction to the RWKV-7 recurrent backbone. Formally:

Forward Process:

$q(x_t \mid x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t} x_{t-1}, \beta_t I)$

with $t=1,...,T$ and $x_0$ the clean image.

Reverse Process (Sampling):

At test time, DIR-7 samples $x_T \sim \mathcal{N}(0, I)$ and proceeds iteratively via

$x_{t-1} = \frac{1}{\sqrt{\alpha_t}}\left( x_t - \frac{1-\alpha_t}{\sqrt{1-\bar\alpha_t}} \epsilon_\theta(x_t, q, t) \right) + \sqrt{\beta_t} z, \quad z \sim \mathcal{N}(0, I)$

at each time step, using the network's prediction of $\epsilon_\theta(x_t, q, t)$ produced by the CrossWKV + U-Net modules.

5. Computational and Memory Complexity

The architectural design yields the following computational properties:

Linear Complexity: Each CrossWKV layer costs $O(B \times T \times H \times N)$ FLOPs per diffusion step ( $H=16, N=64$ for DIR-7-H), in contrast to the $O(T^2)$ cost for standard Transformer cross-attention (Xiao et al., 19 Apr 2025, Fei et al., 2024).
Constant Memory: Only the state matrix $S \in \mathbb{R}^{H \times N \times N}$ per head is maintained throughout diffusion. All core updates are recurrent; the model avoids storing quadratic activation tensors.
Empirical FLOPs and Memory: For a $256 \times 256$ image, inference requires $\sim$ 17.5 Gflops (DIR-7-H) versus $>$ 50 Gflops for DiT-style Transformer diffusion models. GPU memory usage is 4.5 GB for DIR-7-H versus 6.5 GB for DiT-XL (both on $256 \times 256$ images).
Scalability: Both image dimensions and prompt length yield linear scaling in compute and memory, making DIR-7 suitable for high-resolution and long-context generation (Xiao et al., 19 Apr 2025, Fei et al., 2024).

6. Empirical Performance and Ablation Results

DIR-7 demonstrates competitive results on conditional and unconditional generation tasks. Key results on ImageNet 256×256:

Model	FID	CLIP Score	FLOPs/Image	Memory (GB)
DIR-7-H	2.88	0.33	17.5 Gflops	4.5
DiT-XL/2	2.27	≈0.35	50+ Gflops	6.5
Diffusion-RWKV	—	0.26	—	—

Key observations:

DIR-7 achieves an FID of 2.88 and a CLIP score of 0.33, matching or surpassing state-of-the-art transformer-based models while requiring substantially less compute and memory (Xiao et al., 19 Apr 2025).
LoRA ablations confirm the importance of decay, learning rate (LR), and value adaptation, with CLIP and FID dropping when these are removed (e.g., FID increases from 2.88 to 3.30, CLIP drops from 0.33 to 0.27 when decay-LoRA is ablated).
LoRA ranks (64, 64, 16, 128) yield optimal cross-modal alignment.

This suggests DIR-7’s recurrent, state-based formulation—along with efficient LoRA-based cross-attention and constant memory—offers a favorable trade-off between expressivity and efficiency, particularly for large-scale, conditional image synthesis at high resolutions.

7. Theoretical Implications and Scientific Significance

DIR-7, by leveraging RWKV-7’s non-diagonal, input-dependent WKV transition, achieves expressivity beyond $\mathrm{TC}^0$ and, crucially, the representational power needed for arbitrary regular languages and dynamic state tracking (Xiao et al., 19 Apr 2025). CrossWKV provides full cross-modal integration in a single recurrent scan, eliminating the quadratic bottleneck of standard attention. A plausible implication is that this framework can be generalized beyond image synthesis to other domains where cross-modal, long-context, or high-resolution modeling is computationally prohibitive for transformers. DIR-7 demonstrates empirical and theoretical advances in efficient, expressive, and scalable diffusion modeling.

Markdown Report Issue Upgrade to Chat

References (2)

Cross-attention for State-based model RWKV-7 (2025)

Diffusion-RWKV: Scaling RWKV-Like Architectures for Diffusion Models (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Diffusion in RWKV-7 (DIR-7).