Papers
Topics
Authors
Recent
Search
2000 character limit reached

Diffusion in RWKV-7 (DIR-7)

Updated 28 December 2025
  • The paper introduces DIR-7, which integrates score-based diffusion probabilistic modeling with RWKV-7's linear-time recurrence and novel CrossWKV mechanism to achieve state-of-the-art image synthesis.
  • The methodology employs a two-stage DDPM framework with recurrent WKV state updates and LoRA adaptations, ensuring constant memory usage and linear complexity compared to Transformer models.
  • DIR-7 demonstrates competitive performance on benchmarks like ImageNet 256×256 while significantly reducing computational and memory costs, highlighting its scalable cross-modal fusion capabilities.

Diffusion in RWKV-7 (DIR-7) refers to the implementation of score-based diffusion probabilistic modeling within the RWKV-7 neural architecture, leveraging the linear-time Weighted Key-Value (WKV) recurrence and the novel CrossWKV cross-attention mechanism. This approach enables efficient and expressive text-conditioned and unconditional image synthesis with constant memory and linear complexity, in contrast to conventional Transformer-based diffusion pipelines. DIR-7 achieves state-of-the-art results on benchmarks such as ImageNet 256×256 and demonstrates robust cross-modal alignment capabilities.

1. Pipeline Overview and Architecture

DIR-7 follows a two-stage denoising diffusion probabilistic model (DDPM) framework, adapted for integration with RWKV-7 and CrossWKV. The major components in the DIR-7 pipeline are:

  • Image and Text Tokenization: A noisy image xtRB×H×W×Cx_t \in \mathbb{R}^{B \times H \times W \times C} is processed via a small convolutional encoder into features xRB×T×D\mathbf{x} \in \mathbb{R}^{B \times T \times D}, with T=HWT = H \cdot W. The conditioning text is encoded by a frozen CLIP model to yield token embeddings qRB×L×Dq\mathbf{q} \in \mathbb{R}^{B \times L \times D_q}, zero-padded to length TT for joint processing.
  • CrossWKV Layers: The concatenated image and text sequences are jointly processed through one or more CrossWKV layers. Each layer maintains a recurrent state matrix St1RH×N×NS_{t-1} \in \mathbb{R}^{H \times N \times N} per attention head; HH is the number of heads and NN is the per-head feature dimension.
  • Denoising Prediction: The output oRB×T×D\mathbf{o} \in \mathbb{R}^{B \times T \times D} from CrossWKV layers is added residually and passed to a U-Net decoder. The network predicts the noise residual ϵθ(xt,q,t)\epsilon_\theta(x_t, q, t) required for DDPM sampling.
  • Diffusion Process: At each training iteration, the model minimizes the DDPM objective:

L=Eϵ,x0,q,tϵϵθ(xt,q,t)22,\mathcal{L} = \mathbb{E}_{\epsilon, x_0, q, t} \left\| \epsilon - \epsilon_\theta(x_t, q, t) \right\|_2^2,

where xt=αˉtx0+1αˉtϵx_t = \sqrt{\bar\alpha_t} x_0 + \sqrt{1-\bar\alpha_t} \epsilon and the schedule for βt\beta_t is linear in [104,2×102][10^{-4}, 2 \times 10^{-2}], T=1000T=1000.

2. RWKV-7 and WKV State Update Mechanism

DIR-7’s core computation is built around the recurrent WKV state update, which replaces the quadratic-complexity self-attention with a linear, input-adaptive recurrence per token. For each CrossWKV head, the update at time tt involves:

  • State Transition:

St=St1(diag(wt)ktT(atkt))+vtTktS_t = S_{t-1} \left( \operatorname{diag}(w_t) - k_t^T (a_t \otimes k_t) \right) + v_t^T k_t

where kt,vtk_t, v_t are key and value vectors; wtw_t is a vector-valued decay gate; ata_t is a vector-valued learning rate; and \otimes is the element-wise outer product.

  • Output Computation:

yt=rtSt+(rt(pkt)T)vty_t = r_t S_t + \left( r_t (p \otimes k_t)^T \right) v_t

Here, rtr_t is the receptance gate, and pp is a small trainable scalar.

The state transition matrix T(xt)=diag(wt)ktT(atkt)T(x_t) = \operatorname{diag}(w_t) - k_t^T (a_t \otimes k_t) is fully non-diagonal and input-dependent, granting expressive power beyond the TC0\mathrm{TC}^0 complexity class and enabling representation of all regular languages, demonstrated by state-tracking tasks such as S5S_5 permutation modeling (Xiao et al., 19 Apr 2025).

3. CrossWKV Cross-Attention: Fusion of Modalities

CrossWKV implements a single-pass cross-attention mechanism by jointly projecting and mixing image and text representations in each recurrent step:

  • Projection and Gating: Both x\mathbf{x} and q\mathbf{q} are projected into r,k,v,w,a,g\mathbf{r}, \mathbf{k}, \mathbf{v}, \mathbf{w}, \mathbf{a}, \mathbf{g}, each implemented as a sum of a base linear projection and a low-rank adaptation (LoRA):

W=W0+ABW^\star = W^\star_0 + A^\star B^\star

with prescribed low-rank dimensions: rank 64 for Aw,BwA^w, B^w and Aa,BaA^a, B^a; rank 16 for Av,BvA^v, B^v; rank 128 for Ag,BgA^g, B^g.

  • Temporal and Spatial Fusion: Temporal context is captured via a time-shift feature δ=shift(x)x\delta = \mathrm{shift}(\mathbf{x}) - \mathbf{x}, permitting spatial conditioning typical in vision tasks.
  • Unified Recurrent Sweep: All fusion is realized within one recurrent scan, eliminating the need for separate cross-attention modules or iterative alternation between modalities.
  • Generalized Delta Rule: The WKV recurrence is abstracted as:

Δht=htht1=f(T(xt)ht1,xt)\Delta h_t = h_t - h_{t-1} = f( T(x_t) h_{t-1}, x_t )

with T(xt)T(x_t) fully determined by the token and prompt inputs.

4. Diffusion Formulation and Sampling in DIR-7

DIR-7 retains the standard forward and reverse process from DDPM, but adapts all denoising prediction to the RWKV-7 recurrent backbone. Formally:

  • Forward Process:

q(xtxt1)=N(xt;1βtxt1,βtI)q(x_t \mid x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t} x_{t-1}, \beta_t I)

with t=1,...,Tt=1,...,T and x0x_0 the clean image.

  • Reverse Process (Sampling):

At test time, DIR-7 samples xTN(0,I)x_T \sim \mathcal{N}(0, I) and proceeds iteratively via

xt1=1αt(xt1αt1αˉtϵθ(xt,q,t))+βtz,zN(0,I)x_{t-1} = \frac{1}{\sqrt{\alpha_t}}\left( x_t - \frac{1-\alpha_t}{\sqrt{1-\bar\alpha_t}} \epsilon_\theta(x_t, q, t) \right) + \sqrt{\beta_t} z, \quad z \sim \mathcal{N}(0, I)

at each time step, using the network's prediction of ϵθ(xt,q,t)\epsilon_\theta(x_t, q, t) produced by the CrossWKV + U-Net modules.

5. Computational and Memory Complexity

The architectural design yields the following computational properties:

  • Linear Complexity: Each CrossWKV layer costs O(B×T×H×N)O(B \times T \times H \times N) FLOPs per diffusion step (H=16,N=64H=16, N=64 for DIR-7-H), in contrast to the O(T2)O(T^2) cost for standard Transformer cross-attention (Xiao et al., 19 Apr 2025, Fei et al., 2024).
  • Constant Memory: Only the state matrix SRH×N×NS \in \mathbb{R}^{H \times N \times N} per head is maintained throughout diffusion. All core updates are recurrent; the model avoids storing quadratic activation tensors.
  • Empirical FLOPs and Memory: For a 256×256256 \times 256 image, inference requires \sim17.5 Gflops (DIR-7-H) versus >>50 Gflops for DiT-style Transformer diffusion models. GPU memory usage is 4.5 GB for DIR-7-H versus 6.5 GB for DiT-XL (both on 256×256256 \times 256 images).
  • Scalability: Both image dimensions and prompt length yield linear scaling in compute and memory, making DIR-7 suitable for high-resolution and long-context generation (Xiao et al., 19 Apr 2025, Fei et al., 2024).

6. Empirical Performance and Ablation Results

DIR-7 demonstrates competitive results on conditional and unconditional generation tasks. Key results on ImageNet 256×256:

Model FID CLIP Score FLOPs/Image Memory (GB)
DIR-7-H 2.88 0.33 17.5 Gflops 4.5
DiT-XL/2 2.27 ≈0.35 50+ Gflops 6.5
Diffusion-RWKV 0.26

Key observations:

  • DIR-7 achieves an FID of 2.88 and a CLIP score of 0.33, matching or surpassing state-of-the-art transformer-based models while requiring substantially less compute and memory (Xiao et al., 19 Apr 2025).
  • LoRA ablations confirm the importance of decay, learning rate (LR), and value adaptation, with CLIP and FID dropping when these are removed (e.g., FID increases from 2.88 to 3.30, CLIP drops from 0.33 to 0.27 when decay-LoRA is ablated).
  • LoRA ranks (64, 64, 16, 128) yield optimal cross-modal alignment.

This suggests DIR-7’s recurrent, state-based formulation—along with efficient LoRA-based cross-attention and constant memory—offers a favorable trade-off between expressivity and efficiency, particularly for large-scale, conditional image synthesis at high resolutions.

7. Theoretical Implications and Scientific Significance

DIR-7, by leveraging RWKV-7’s non-diagonal, input-dependent WKV transition, achieves expressivity beyond TC0\mathrm{TC}^0 and, crucially, the representational power needed for arbitrary regular languages and dynamic state tracking (Xiao et al., 19 Apr 2025). CrossWKV provides full cross-modal integration in a single recurrent scan, eliminating the quadratic bottleneck of standard attention. A plausible implication is that this framework can be generalized beyond image synthesis to other domains where cross-modal, long-context, or high-resolution modeling is computationally prohibitive for transformers. DIR-7 demonstrates empirical and theoretical advances in efficient, expressive, and scalable diffusion modeling.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Diffusion in RWKV-7 (DIR-7).