Diffusion in RWKV-7 (DIR-7)
- The paper introduces DIR-7, which integrates score-based diffusion probabilistic modeling with RWKV-7's linear-time recurrence and novel CrossWKV mechanism to achieve state-of-the-art image synthesis.
- The methodology employs a two-stage DDPM framework with recurrent WKV state updates and LoRA adaptations, ensuring constant memory usage and linear complexity compared to Transformer models.
- DIR-7 demonstrates competitive performance on benchmarks like ImageNet 256×256 while significantly reducing computational and memory costs, highlighting its scalable cross-modal fusion capabilities.
Diffusion in RWKV-7 (DIR-7) refers to the implementation of score-based diffusion probabilistic modeling within the RWKV-7 neural architecture, leveraging the linear-time Weighted Key-Value (WKV) recurrence and the novel CrossWKV cross-attention mechanism. This approach enables efficient and expressive text-conditioned and unconditional image synthesis with constant memory and linear complexity, in contrast to conventional Transformer-based diffusion pipelines. DIR-7 achieves state-of-the-art results on benchmarks such as ImageNet 256×256 and demonstrates robust cross-modal alignment capabilities.
1. Pipeline Overview and Architecture
DIR-7 follows a two-stage denoising diffusion probabilistic model (DDPM) framework, adapted for integration with RWKV-7 and CrossWKV. The major components in the DIR-7 pipeline are:
- Image and Text Tokenization: A noisy image is processed via a small convolutional encoder into features , with . The conditioning text is encoded by a frozen CLIP model to yield token embeddings , zero-padded to length for joint processing.
- CrossWKV Layers: The concatenated image and text sequences are jointly processed through one or more CrossWKV layers. Each layer maintains a recurrent state matrix per attention head; is the number of heads and is the per-head feature dimension.
- Denoising Prediction: The output from CrossWKV layers is added residually and passed to a U-Net decoder. The network predicts the noise residual required for DDPM sampling.
- Diffusion Process: At each training iteration, the model minimizes the DDPM objective:
where and the schedule for is linear in , .
2. RWKV-7 and WKV State Update Mechanism
DIR-7’s core computation is built around the recurrent WKV state update, which replaces the quadratic-complexity self-attention with a linear, input-adaptive recurrence per token. For each CrossWKV head, the update at time involves:
- State Transition:
where are key and value vectors; is a vector-valued decay gate; is a vector-valued learning rate; and is the element-wise outer product.
- Output Computation:
Here, is the receptance gate, and is a small trainable scalar.
The state transition matrix is fully non-diagonal and input-dependent, granting expressive power beyond the complexity class and enabling representation of all regular languages, demonstrated by state-tracking tasks such as permutation modeling (Xiao et al., 19 Apr 2025).
3. CrossWKV Cross-Attention: Fusion of Modalities
CrossWKV implements a single-pass cross-attention mechanism by jointly projecting and mixing image and text representations in each recurrent step:
- Projection and Gating: Both and are projected into , each implemented as a sum of a base linear projection and a low-rank adaptation (LoRA):
with prescribed low-rank dimensions: rank 64 for and ; rank 16 for ; rank 128 for .
- Temporal and Spatial Fusion: Temporal context is captured via a time-shift feature , permitting spatial conditioning typical in vision tasks.
- Unified Recurrent Sweep: All fusion is realized within one recurrent scan, eliminating the need for separate cross-attention modules or iterative alternation between modalities.
- Generalized Delta Rule: The WKV recurrence is abstracted as:
with fully determined by the token and prompt inputs.
4. Diffusion Formulation and Sampling in DIR-7
DIR-7 retains the standard forward and reverse process from DDPM, but adapts all denoising prediction to the RWKV-7 recurrent backbone. Formally:
- Forward Process:
with and the clean image.
- Reverse Process (Sampling):
At test time, DIR-7 samples and proceeds iteratively via
at each time step, using the network's prediction of produced by the CrossWKV + U-Net modules.
5. Computational and Memory Complexity
The architectural design yields the following computational properties:
- Linear Complexity: Each CrossWKV layer costs FLOPs per diffusion step ( for DIR-7-H), in contrast to the cost for standard Transformer cross-attention (Xiao et al., 19 Apr 2025, Fei et al., 2024).
- Constant Memory: Only the state matrix per head is maintained throughout diffusion. All core updates are recurrent; the model avoids storing quadratic activation tensors.
- Empirical FLOPs and Memory: For a image, inference requires 17.5 Gflops (DIR-7-H) versus 50 Gflops for DiT-style Transformer diffusion models. GPU memory usage is 4.5 GB for DIR-7-H versus 6.5 GB for DiT-XL (both on images).
- Scalability: Both image dimensions and prompt length yield linear scaling in compute and memory, making DIR-7 suitable for high-resolution and long-context generation (Xiao et al., 19 Apr 2025, Fei et al., 2024).
6. Empirical Performance and Ablation Results
DIR-7 demonstrates competitive results on conditional and unconditional generation tasks. Key results on ImageNet 256×256:
| Model | FID | CLIP Score | FLOPs/Image | Memory (GB) |
|---|---|---|---|---|
| DIR-7-H | 2.88 | 0.33 | 17.5 Gflops | 4.5 |
| DiT-XL/2 | 2.27 | ≈0.35 | 50+ Gflops | 6.5 |
| Diffusion-RWKV | — | 0.26 | — | — |
Key observations:
- DIR-7 achieves an FID of 2.88 and a CLIP score of 0.33, matching or surpassing state-of-the-art transformer-based models while requiring substantially less compute and memory (Xiao et al., 19 Apr 2025).
- LoRA ablations confirm the importance of decay, learning rate (LR), and value adaptation, with CLIP and FID dropping when these are removed (e.g., FID increases from 2.88 to 3.30, CLIP drops from 0.33 to 0.27 when decay-LoRA is ablated).
- LoRA ranks (64, 64, 16, 128) yield optimal cross-modal alignment.
This suggests DIR-7’s recurrent, state-based formulation—along with efficient LoRA-based cross-attention and constant memory—offers a favorable trade-off between expressivity and efficiency, particularly for large-scale, conditional image synthesis at high resolutions.
7. Theoretical Implications and Scientific Significance
DIR-7, by leveraging RWKV-7’s non-diagonal, input-dependent WKV transition, achieves expressivity beyond and, crucially, the representational power needed for arbitrary regular languages and dynamic state tracking (Xiao et al., 19 Apr 2025). CrossWKV provides full cross-modal integration in a single recurrent scan, eliminating the quadratic bottleneck of standard attention. A plausible implication is that this framework can be generalized beyond image synthesis to other domains where cross-modal, long-context, or high-resolution modeling is computationally prohibitive for transformers. DIR-7 demonstrates empirical and theoretical advances in efficient, expressive, and scalable diffusion modeling.