Papers
Topics
Authors
Recent
Search
2000 character limit reached

CascadedViT (CViT): Efficient Vision Transformer

Updated 28 January 2026
  • CascadedViT is a lightweight vision transformer that employs a Cascaded-Chunk Feed Forward Network and Cascaded Group Attention to reduce computational, memory, and energy demands.
  • It splits feature representations into sequential chunks and groups, substantially lowering parameters and FLOPs while preserving competitive ImageNet accuracy.
  • The design is ideal for resource-constrained environments like mobile devices, drones, and edge clusters, offering up to 15–20% compute savings with minimal accuracy trade-offs.

CascadedViT (CViT) is a lightweight, compute-efficient vision transformer architecture that introduces the Cascaded-Chunk Feed Forward Network (CCFFN) and combines it with a Cascaded Group Attention (CGA) mechanism to reduce computational, memory, and energy consumption while maintaining competitive recognition accuracy in computer vision tasks. CViT is positioned as a strong candidate for resource-constrained deployments such as on mobile devices and drones (Sivakumar et al., 18 Nov 2025).

1. Architectural Foundations

CViT integrates two principal architectural modules:

A. Cascaded-Chunk Feed Forward Network (CCFFN)

Given an input token matrix XRN×dX \in \mathbb{R}^{N \times d} (where NN is number of tokens, dd is feature dimension), the CCFFN splits XX into nn equal-sized channel chunks (X1,,XnRN×(d/n)X_1,\ldots, X_n \in \mathbb{R}^{N\times (d/n)}). The chunks are processed sequentially using a cascaded mechanism:

  • X1=X1X'_1 = X_1; for i>1i > 1, Xi=Xi+Yi1X'_i = X_i + Y_{i-1}
  • Each chunk is passed through its own FFN:

Yi=FFNi(Xi)=σ(XiWi,1)Wi,2Y_i = \mathrm{FFN}_i(X'_i) = \sigma(X'_i W_{i,1}) W_{i,2}

with Wi,1R(d/n)×(rd/n)W_{i,1} \in \mathbb{R}^{(d/n)\times(r d/n)}, Wi,2R(rd/n)×(d/n)W_{i,2} \in \mathbb{R}^{(r d/n)\times(d/n)}, and nonlinear activation σ\sigma (e.g., ReLU).

  • Outputs are concatenated: Concat(Y1,...,Yn)RN×d\operatorname{Concat}(Y_1, ..., Y_n) \in \mathbb{R}^{N \times d}. The cascade structure increases effective depth without additional parameters.

Parameter and FLOP efficiency is achieved as:

ParamsCCFFN=2rd2/n\text{Params}_{\text{CCFFN}} = 2r d^2 / n

A standard FFN costs 2rd22 r d^2 parameters, so n=2n=2 halves the FFN parameters and FLOPs.

B. Cascaded Group Attention (CGA)

Inherited unchanged from EfficientViT, the CGA divides channel features into gg groups, performs self-attention sequentially per group, and accumulates outputs:

  • Qk=X(k)WkQQ_k = X^{(k)} W^Q_k, Kk=X(k)WkKK_k = X^{(k)} W^K_k, Vk=X(k)WkVV_k = X^{(k)} W^V_k
  • Ak=softmax(QkKkdk)VkA_k = \operatorname{softmax}\left(\frac{Q_k K_k^\top}{\sqrt{d_k}}\right) V_k
  • Ok=Ok1+AkO_k = O_{k-1} + A_k, finishing with Og+XO_g + X

Computational cost and peak memory scale as $1/g$ of standard attention. At each step, only one group’s Q,K,VQ, K, V is in memory, reducing buffer requirements.

2. Computational Complexity and Memory Analysis

A. Parameter and FLOP Savings

CViT achieves substantial reductions in both parameter count and FLOPs:

Model Params (M) FLOPs (M) Parameter Reduction FLOP Reduction
EfficientViT-M5 12.4 522 - -
CViT-XL 9.8 435 21% 16.7%

Relative to standard ViT-Base models (\approx86M params, \approx17G FLOPs), CViT reports >>80% reductions (Sivakumar et al., 18 Nov 2025).

B. Memory Efficiency

The per-chunk and per-group computation scheme reduces both live and reserved memory footprints due to the local accumulation of intermediate results and sequential processing, decreasing DRAM accesses and energy costs.

3. Energy Consumption and Empirical Resource Use

Energy efficiency was evaluated using power sampling protocols on the Apple M4 Pro GPU. For each image:

E=PavgTtotal#imagesE = P_{\text{avg}} \frac{T_{\text{total}}}{\#\text{images}}

Empirical results:

  • CViT-XL: 653±16653 \pm 16 mJ/img
  • EfficientViT-M5: 675±23675 \pm 23 mJ/img

This constitutes a 3.3% reduction in energy per image. A plausible implication is that such savings, while modest per operation, are operationally significant for continuously running systems or battery-powered hardware (Sivakumar et al., 18 Nov 2025).

4. Experimental Performance

CViT models are benchmarked on ImageNet-1K, demonstrating competitive accuracy and superior efficiency:

Model Top-1 (%) FLOPs (M) Params (M) Energy (mJ/img)
CViT-L 73.0 249 7.0 588 ± 42
EfficientViT-M4 74.3 299 8.8 620 ± 45
CViT-XL 75.5 435 9.8 653 ± 16
EfficientViT-M5 77.1 522 12.4 675 ± 23

CViT-L delivers a Top-1 accuracy 2.2% higher than EfficientViT-M2 with comparable Accuracy-Per-FLOP (APF) scores.

5. Compute Efficiency: The APF Metric

To holistically quantify compute efficiency, the metric Accuracy-Per-FLOP (APF) is introduced:

APF=Top-1 Accuracy (%)log10(FLOPs(M))\mathrm{APF} = \frac{\text{Top-1 Accuracy (\%)}}{\log_{10}(\mathrm{FLOPs (M)})}

Example APF values:

Model Top-1 (%) FLOPs (M) APF
CViT-M 69.9 173 31.2
EfficientViT-M2 70.8 201 30.7
CViT-L 73.0 249 30.5
EfficientViT-M4 74.3 299 30.0
CViT-XL 75.5 435 28.6
EfficientViT-M5 77.1 522 28.4

CViT models consistently achieve state-of-the-art APF, especially at smaller to medium model sizes (Sivakumar et al., 18 Nov 2025).

6. Deployment, Practical Insights, and Future Directions

CViT represents a shift in the accuracy-efficiency trade-off for vision transformers, providing up to 15–20% savings in compute and energy for a 1–2% reduction in accuracy. Primary deployment targets include:

  • Mobile phones and wearables: Energy and memory savings impact battery life and thermal management.
  • Autonomous drones and UAVs: Lower per-frame inference energy extends operational time.
  • Edge clusters: Lower memory and compute requirements suit tight power budgets.

Limitations include kernel-launch overhead, as multiple small FFNs per transformer block can incur GPU latency penalties, and potential minor reduction in global feature capacity due to chunking. Future work is aimed at adaptive chunk sizing, intra-cascade convolutions, and integration of ultra-light attention heads to further optimize the trade-off between expressivity and efficiency (Sivakumar et al., 18 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CascadedViT (CViT).