CascadedViT (CViT): Efficient Vision Transformer
- CascadedViT is a lightweight vision transformer that employs a Cascaded-Chunk Feed Forward Network and Cascaded Group Attention to reduce computational, memory, and energy demands.
- It splits feature representations into sequential chunks and groups, substantially lowering parameters and FLOPs while preserving competitive ImageNet accuracy.
- The design is ideal for resource-constrained environments like mobile devices, drones, and edge clusters, offering up to 15–20% compute savings with minimal accuracy trade-offs.
CascadedViT (CViT) is a lightweight, compute-efficient vision transformer architecture that introduces the Cascaded-Chunk Feed Forward Network (CCFFN) and combines it with a Cascaded Group Attention (CGA) mechanism to reduce computational, memory, and energy consumption while maintaining competitive recognition accuracy in computer vision tasks. CViT is positioned as a strong candidate for resource-constrained deployments such as on mobile devices and drones (Sivakumar et al., 18 Nov 2025).
1. Architectural Foundations
CViT integrates two principal architectural modules:
A. Cascaded-Chunk Feed Forward Network (CCFFN)
Given an input token matrix (where is number of tokens, is feature dimension), the CCFFN splits into equal-sized channel chunks (). The chunks are processed sequentially using a cascaded mechanism:
- ; for ,
- Each chunk is passed through its own FFN:
with , , and nonlinear activation (e.g., ReLU).
- Outputs are concatenated: . The cascade structure increases effective depth without additional parameters.
Parameter and FLOP efficiency is achieved as:
A standard FFN costs parameters, so halves the FFN parameters and FLOPs.
B. Cascaded Group Attention (CGA)
Inherited unchanged from EfficientViT, the CGA divides channel features into groups, performs self-attention sequentially per group, and accumulates outputs:
- , ,
- , finishing with
Computational cost and peak memory scale as $1/g$ of standard attention. At each step, only one group’s is in memory, reducing buffer requirements.
2. Computational Complexity and Memory Analysis
A. Parameter and FLOP Savings
CViT achieves substantial reductions in both parameter count and FLOPs:
| Model | Params (M) | FLOPs (M) | Parameter Reduction | FLOP Reduction |
|---|---|---|---|---|
| EfficientViT-M5 | 12.4 | 522 | - | - |
| CViT-XL | 9.8 | 435 | 21% | 16.7% |
Relative to standard ViT-Base models (86M params, 17G FLOPs), CViT reports 80% reductions (Sivakumar et al., 18 Nov 2025).
B. Memory Efficiency
The per-chunk and per-group computation scheme reduces both live and reserved memory footprints due to the local accumulation of intermediate results and sequential processing, decreasing DRAM accesses and energy costs.
3. Energy Consumption and Empirical Resource Use
Energy efficiency was evaluated using power sampling protocols on the Apple M4 Pro GPU. For each image:
Empirical results:
- CViT-XL: mJ/img
- EfficientViT-M5: mJ/img
This constitutes a 3.3% reduction in energy per image. A plausible implication is that such savings, while modest per operation, are operationally significant for continuously running systems or battery-powered hardware (Sivakumar et al., 18 Nov 2025).
4. Experimental Performance
CViT models are benchmarked on ImageNet-1K, demonstrating competitive accuracy and superior efficiency:
| Model | Top-1 (%) | FLOPs (M) | Params (M) | Energy (mJ/img) |
|---|---|---|---|---|
| CViT-L | 73.0 | 249 | 7.0 | 588 ± 42 |
| EfficientViT-M4 | 74.3 | 299 | 8.8 | 620 ± 45 |
| CViT-XL | 75.5 | 435 | 9.8 | 653 ± 16 |
| EfficientViT-M5 | 77.1 | 522 | 12.4 | 675 ± 23 |
CViT-L delivers a Top-1 accuracy 2.2% higher than EfficientViT-M2 with comparable Accuracy-Per-FLOP (APF) scores.
5. Compute Efficiency: The APF Metric
To holistically quantify compute efficiency, the metric Accuracy-Per-FLOP (APF) is introduced:
Example APF values:
| Model | Top-1 (%) | FLOPs (M) | APF |
|---|---|---|---|
| CViT-M | 69.9 | 173 | 31.2 |
| EfficientViT-M2 | 70.8 | 201 | 30.7 |
| CViT-L | 73.0 | 249 | 30.5 |
| EfficientViT-M4 | 74.3 | 299 | 30.0 |
| CViT-XL | 75.5 | 435 | 28.6 |
| EfficientViT-M5 | 77.1 | 522 | 28.4 |
CViT models consistently achieve state-of-the-art APF, especially at smaller to medium model sizes (Sivakumar et al., 18 Nov 2025).
6. Deployment, Practical Insights, and Future Directions
CViT represents a shift in the accuracy-efficiency trade-off for vision transformers, providing up to 15–20% savings in compute and energy for a 1–2% reduction in accuracy. Primary deployment targets include:
- Mobile phones and wearables: Energy and memory savings impact battery life and thermal management.
- Autonomous drones and UAVs: Lower per-frame inference energy extends operational time.
- Edge clusters: Lower memory and compute requirements suit tight power budgets.
Limitations include kernel-launch overhead, as multiple small FFNs per transformer block can incur GPU latency penalties, and potential minor reduction in global feature capacity due to chunking. Future work is aimed at adaptive chunk sizing, intra-cascade convolutions, and integration of ultra-light attention heads to further optimize the trade-off between expressivity and efficiency (Sivakumar et al., 18 Nov 2025).