Inverted Residual Structure in CNNs

Updated 28 December 2025

Inverted residual structure is an architectural pattern that reverses traditional bottlenecks, placing narrow channel boundaries around an internally expanded, high-dimensional space.
It utilizes lightweight depthwise convolutions and a linear projection to efficiently preserve essential feature information without nonlinear distortions.
Variants like sandglass and attention-guided blocks build on this design to improve gradient flow, reduce parameters, and tailor architectures for specific mobile and real-time tasks.

The inverted residual structure is an architectural design pattern for convolutional neural networks that reverses the classical residual bottleneck layout, placing narrow channel counts at the block boundaries and wide, expressive expansions internally. It is a core building block of MobileNetV2 and has since influenced a range of efficient network designs for mobile and real-time applications. Key innovations include the deployment of lightweight depthwise convolutions in high-dimensional spaces, strict linearity in final projections, and decoupling of network capacity from nonlinear expressiveness. Subsequent research elaborates on information bottlenecks, gradient dynamics, and application-specific adaptations such as channel-spatial attention.

1. Canonical Inverted Residual Block: Definition and Structure

The inverted residual with linear bottleneck, introduced in MobileNetV2 (Sandler et al., 2018), is composed of three sequential operations on an input tensor $x \in \mathbb{R}^{h \times w \times k}$ , with $k$ input channels and spatial size $h \times w$ :

Expansion (Pointwise 1×1 Convolution + ReLU6):

$u = \mathrm{ReLU6}(\mathrm{BN}(\mathrm{Conv}_{1\times 1}(x; W_e))), \quad u \in \mathbb{R}^{h \times w \times (t k)}$

Here, $t > 1$ is the expansion factor.

Depthwise 3×3 Convolution + ReLU6:

$v = \mathrm{ReLU6}(\mathrm{BN}(\mathrm{DepthwiseConv}_{3\times 3, s}(u; W_d))), \quad v \in \mathbb{R}^{h/s \times w/s \times (t k)}$

$s$ is the stride.

Linear Projection (Pointwise 1×1 Convolution, No Activation):

$y = \mathrm{BN}(\mathrm{Conv}_{1\times 1}(v; W_p)), \quad y \in \mathbb{R}^{h/s \times w/s \times k'}$

A residual skip-connection is applied when $s=1$ and $k' = k$ : $k$ 0

The critical aspect is that nonlinearity is removed from the last projection, forming a linear bottleneck which theoretically preserves the signal's essential structure.

2. Theoretical Motivations: Linearity, Capacity, and Expressiveness

A linear bottleneck is essential to maintain representational power in low-dimensional outputs. Placing a nonlinearity (e.g., ReLU) after the final projection would irreversibly destroy information due to the contraction of the activation manifold in such constrained spaces, leading to degraded performance as empirically validated in ablation studies (Sandler et al., 2018). In the inverted residual structure, the main nonlinear transformations are performed in the high-dimensional, expanded space (middle of the block), where ReLU nonlinearity is mostly invertible and does not collapse significant portions of the data manifold.

The inverted residual block decouples capacity from expressiveness:

Capacity is set by the narrow input/output bottleneck size $k$ 1.
Expressiveness is governed by the internal expansion ratio $k$ 2, which allows multiple channels and nonlinearities in the transformation, independent of how many features are ultimately transported through residual paths.

3. Formal Description and Computational Profile

The inverted residual block can be formulated as: $k$ 3 where:

$k$ 4: Expansion via $k$ 5 conv + ReLU6,
$k$ 6: Depthwise $k$ 7 conv + ReLU6,
$k$ 8: Linear projection via $k$ 9 conv (no activation).

Parameter counts and FLOPs for spatial size $h \times w$ 0, input $h \times w$ 1, expansion $h \times w$ 2, and output $h \times w$ 3: $h \times w$ 4

$h \times w$ 5

where $h \times w$ 6 (Daquan et al., 2020).

4. Descendants and Alternatives: Sandglass Block and Attention-Guided Variants

Sandglass Block

The Sandglass block inverts the information layout: identity mapping resides in the high-dimensional space and spatial convolutions are applied to these high-dimensional features. Its structure is:

Initial $h \times w$ 7 depthwise conv (ReLU6),
$h \times w$ 8 linear bottleneck (channel reduction),
$h \times w$ 9 expansion (ReLU6),
Final $u = \mathrm{ReLU6}(\mathrm{BN}(\mathrm{Conv}_{1\times 1}(x; W_e))), \quad u \in \mathbb{R}^{h \times w \times (t k)}$ 0 depthwise conv (typically linear).

The shortcut bypasses high-dimensional features, theoretically alleviating information loss and improving gradient flow compared to the inverted residual block (Daquan et al., 2020). Empirical results on ImageNet classification and object detection demonstrate that sandglass structures can outperform inverted residuals at equal parameter and computation budget.

Attention-Guided Inverted Residuals (e.g., AIR Block)

The AIR block, as elaborated in YOLO-FireAD (Pan et al., 27 May 2025), modifies the pattern by:

Applying an initial “compression” via $u = \mathrm{ReLU6}(\mathrm{BN}(\mathrm{Conv}_{1\times 1}(x; W_e))), \quad u \in \mathbb{R}^{h \times w \times (t k)}$ 1 conv (e.g., $u = \mathrm{ReLU6}(\mathrm{BN}(\mathrm{Conv}_{1\times 1}(x; W_e))), \quad u \in \mathbb{R}^{h \times w \times (t k)}$ 2 reduction),
Depthwise $u = \mathrm{ReLU6}(\mathrm{BN}(\mathrm{Conv}_{1\times 1}(x; W_e))), \quad u \in \mathbb{R}^{h \times w \times (t k)}$ 3 conv,
Integrating a hybrid channel–spatial attention module (CAMT),
Projecting back to the original channel size via $u = \mathrm{ReLU6}(\mathrm{BN}(\mathrm{Conv}_{1\times 1}(x; W_e))), \quad u \in \mathbb{R}^{h \times w \times (t k)}$ 4 conv,
Adding a residual connection.

Quantitatively, the AIR block dramatically reduces parameter count and FLOPs compared to standard inverted residual blocks, e.g., in the “nano” backbone: parameter count drops from 3.01M to 1.84M (−39%), while maintaining or improving detection performance. The attention mechanism performs explicit channel and spatial gating, which has demonstrated application-specific benefits in fire detection tasks, improving detection under challenging illumination and for small objects (Pan et al., 27 May 2025).

5. Limitations and Theoretical Considerations

Information Loss via Linear Bottleneck

The output of the block is a low-dimensional linear projection; components of the activation orthogonal to the projection subspace are irrecoverably lost. When the expansion factor is large and the projection is aggressive (small $u = \mathrm{ReLU6}(\mathrm{BN}(\mathrm{Conv}_{1\times 1}(x; W_e))), \quad u \in \mathbb{R}^{h \times w \times (t k)}$ 5), the risk of losing critical information is heightened (Daquan et al., 2020).

Gradient Confusion and Training Stability

Narrow bottlenecks under residual connection (as in the canonical inverted residual) can increase gradient variance, impairing stable gradient flow—an effect denoted as “gradient confusion” [(Daquan et al., 2020), citing Sankararaman et al.]. Deep skip pathways that carry only low-dimensional residuals may not adequately support optimization in very deep or overparameterized networks.

6. Practical Impacts and Empirical Performance

In the original MobileNetV2, the inverted residual structure enables strong performance-parameter trade-offs on ImageNet, COCO, and VOC, with empirical results validating theoretical design choices (e.g., linear bottleneck necessity) (Sandler et al., 2018). Later studies empirically demonstrate that alternative structures (e.g., Sandglass) may offer further performance gains or parameter reductions, especially in mobile and embedded settings:

Model	Parameters	MAdds	Top-1 Acc. (ImageNet)
MobileNetV2 (IR)	3.5M	300M	72.3%
Sandglass	3.4M	300M	74.0%

(Daquan et al., 2020)

AIR block deployment in YOLO-FireAD yields a parameter reduction of −39% and comparable or improved mean average precision compared to baseline networks (Pan et al., 27 May 2025). This trend evidences the structural flexibility and task-adaptiveness of the inverted residual principle within the broader context of efficient deep architecture design.

7. Summary and Outlook

The inverted residual structure, characterized by high-dimensional nonlinear transformation between thin bottlenecks and enabled by efficient depthwise convolutions, set a foundational template for mobile networks. Its theoretical apparatus justifies the linear output, while empirical refinements encapsulate diverse variants such as sandglass and attention-guided blocks. Ongoing research interrogates and adapts this paradigm for information retention, gradient stability, and adaptive expressiveness, substantiating its critical role in state-of-the-art lightweight and task-adaptive neural architectures (Sandler et al., 2018, Daquan et al., 2020, Pan et al., 27 May 2025).

Markdown Report Issue Upgrade to Chat

References (3)

MobileNetV2: Inverted Residuals and Linear Bottlenecks (2018)

Rethinking Bottleneck Structure for Efficient Mobile Network Design (2020)

YOLO-FireAD: Efficient Fire Detection via Attention-Guided Inverted Residual Learning and Dual-Pooling Feature Preservation (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Inverted Residual Structure.