CascadedViT (CViT): Efficient Vision Transformer

Updated 28 January 2026

CascadedViT is a lightweight vision transformer that employs a Cascaded-Chunk Feed Forward Network and Cascaded Group Attention to reduce computational, memory, and energy demands.
It splits feature representations into sequential chunks and groups, substantially lowering parameters and FLOPs while preserving competitive ImageNet accuracy.
The design is ideal for resource-constrained environments like mobile devices, drones, and edge clusters, offering up to 15–20% compute savings with minimal accuracy trade-offs.

CascadedViT (CViT) is a lightweight, compute-efficient vision transformer architecture that introduces the Cascaded-Chunk Feed Forward Network (CCFFN) and combines it with a Cascaded Group Attention (CGA) mechanism to reduce computational, memory, and energy consumption while maintaining competitive recognition accuracy in computer vision tasks. CViT is positioned as a strong candidate for resource-constrained deployments such as on mobile devices and drones (Sivakumar et al., 18 Nov 2025).

1. Architectural Foundations

CViT integrates two principal architectural modules:

A. Cascaded-Chunk Feed Forward Network (CCFFN)

Given an input token matrix $X \in \mathbb{R}^{N \times d}$ (where $N$ is number of tokens, $d$ is feature dimension), the CCFFN splits $X$ into $n$ equal-sized channel chunks ( $X_1,\ldots, X_n \in \mathbb{R}^{N\times (d/n)}$ ). The chunks are processed sequentially using a cascaded mechanism:

$X'_1 = X_1$ ; for $i > 1$ , $X'_i = X_i + Y_{i-1}$
Each chunk is passed through its own FFN:

$Y_i = \mathrm{FFN}_i(X'_i) = \sigma(X'_i W_{i,1}) W_{i,2}$

with $W_{i,1} \in \mathbb{R}^{(d/n)\times(r d/n)}$ , $W_{i,2} \in \mathbb{R}^{(r d/n)\times(d/n)}$ , and nonlinear activation $\sigma$ (e.g., ReLU).

Outputs are concatenated: $\operatorname{Concat}(Y_1, ..., Y_n) \in \mathbb{R}^{N \times d}$ . The cascade structure increases effective depth without additional parameters.

Parameter and FLOP efficiency is achieved as:

$\text{Params}_{\text{CCFFN}} = 2r d^2 / n$

A standard FFN costs $2 r d^2$ parameters, so $n=2$ halves the FFN parameters and FLOPs.

B. Cascaded Group Attention (CGA)

Inherited unchanged from EfficientViT, the CGA divides channel features into $g$ groups, performs self-attention sequentially per group, and accumulates outputs:

$Q_k = X^{(k)} W^Q_k$ , $K_k = X^{(k)} W^K_k$ , $V_k = X^{(k)} W^V_k$
$A_k = \operatorname{softmax}\left(\frac{Q_k K_k^\top}{\sqrt{d_k}}\right) V_k$
$O_k = O_{k-1} + A_k$ , finishing with $O_g + X$

Computational cost and peak memory scale as $1/g$ of standard attention. At each step, only one group’s $Q, K, V$ is in memory, reducing buffer requirements.

2. Computational Complexity and Memory Analysis

A. Parameter and FLOP Savings

CViT achieves substantial reductions in both parameter count and FLOPs:

Model	Params (M)	FLOPs (M)	Parameter Reduction	FLOP Reduction
EfficientViT-M5	12.4	522	-	-
CViT-XL	9.8	435	21%	16.7%

Relative to standard ViT-Base models ( $\approx$ 86M params, $\approx$ 17G FLOPs), CViT reports $>$ 80% reductions (Sivakumar et al., 18 Nov 2025).

B. Memory Efficiency

The per-chunk and per-group computation scheme reduces both live and reserved memory footprints due to the local accumulation of intermediate results and sequential processing, decreasing DRAM accesses and energy costs.

3. Energy Consumption and Empirical Resource Use

Energy efficiency was evaluated using power sampling protocols on the Apple M4 Pro GPU. For each image:

$E = P_{\text{avg}} \frac{T_{\text{total}}}{\#\text{images}}$

Empirical results:

CViT-XL: $653 \pm 16$ mJ/img
EfficientViT-M5: $675 \pm 23$ mJ/img

This constitutes a 3.3% reduction in energy per image. A plausible implication is that such savings, while modest per operation, are operationally significant for continuously running systems or battery-powered hardware (Sivakumar et al., 18 Nov 2025).

4. Experimental Performance

CViT models are benchmarked on ImageNet-1K, demonstrating competitive accuracy and superior efficiency:

Model	Top-1 (%)	FLOPs (M)	Params (M)	Energy (mJ/img)
CViT-L	73.0	249	7.0	588 ± 42
EfficientViT-M4	74.3	299	8.8	620 ± 45
CViT-XL	75.5	435	9.8	653 ± 16
EfficientViT-M5	77.1	522	12.4	675 ± 23

CViT-L delivers a Top-1 accuracy 2.2% higher than EfficientViT-M2 with comparable Accuracy-Per-FLOP (APF) scores.

5. Compute Efficiency: The APF Metric

To holistically quantify compute efficiency, the metric Accuracy-Per-FLOP (APF) is introduced:

$\mathrm{APF} = \frac{\text{Top-1 Accuracy (\%)}}{\log_{10}(\mathrm{FLOPs (M)})}$

Example APF values:

Model	Top-1 (%)	FLOPs (M)	APF
CViT-M	69.9	173	31.2
EfficientViT-M2	70.8	201	30.7
CViT-L	73.0	249	30.5
EfficientViT-M4	74.3	299	30.0
CViT-XL	75.5	435	28.6
EfficientViT-M5	77.1	522	28.4

CViT models consistently achieve state-of-the-art APF, especially at smaller to medium model sizes (Sivakumar et al., 18 Nov 2025).

6. Deployment, Practical Insights, and Future Directions

CViT represents a shift in the accuracy-efficiency trade-off for vision transformers, providing up to 15–20% savings in compute and energy for a 1–2% reduction in accuracy. Primary deployment targets include:

Mobile phones and wearables: Energy and memory savings impact battery life and thermal management.
Autonomous drones and UAVs: Lower per-frame inference energy extends operational time.
Edge clusters: Lower memory and compute requirements suit tight power budgets.

Limitations include kernel-launch overhead, as multiple small FFNs per transformer block can incur GPU latency penalties, and potential minor reduction in global feature capacity due to chunking. Future work is aimed at adaptive chunk sizing, intra-cascade convolutions, and integration of ultra-light attention heads to further optimize the trade-off between expressivity and efficiency (Sivakumar et al., 18 Nov 2025).

Markdown Report Issue Upgrade to Chat

References (1)

CascadedViT: Cascaded Chunk-FeedForward and Cascaded Group Attention Vision Transformer (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CascadedViT (CViT).