EfficientNetV2-S: Optimized Small CNN

Updated 10 February 2026

The paper introduces EfficientNetV2-S, a compact model that balances training speed, accuracy, and parameter efficiency using fused-MBConv and MBConv blocks.
EfficientNetV2-S employs compound scaling and progressive training, enhancing convergence and performance across ImageNet and transfer learning benchmarks.
Empirical results demonstrate state-of-the-art performance on datasets like CIFAR-10 and FER2013, making it ideal for resource-constrained and edge deployments.

EfficientNetV2-S is the "small" variant within the EfficientNetV2 family of convolutional neural networks, designed through a training-aware neural architecture search (NAS) to optimize for a balance of training speed, accuracy, and parameter efficiency. This model is particularly relevant for tasks where resource constraints and deployment efficiency are critical, offering substantial improvements over earlier EfficientNet and lightweight architectures. EfficientNetV2-S incorporates fused-MBConv and MBConv blocks, compound scaling, and progressive training regularization, resulting in state-of-the-art performance across a broad range of transfer learning and edge-computing benchmarks.

1. Architectural Specification

EfficientNetV2-S employs a stage-based design discovered via NAS, using a combination of convolutional block types and stage-dependent hyperparameters. The architecture consists of a stem, a sequence of fused-MBConv and MBConv stages, and a head with global average pooling and a projection to classification logits.

Stage	Operator	Stride	Channels	Repeats
0	Conv3×3	2	24	1
1	Fused-MBConv (e=1, k=3)	1	24	2
2	Fused-MBConv (e=4, k=3)	2	48	4
3	Fused-MBConv (e=4, k=3)	2	64	4
4	MBConv (e=4, k=3, SE=0.25)	2	128	6
5	MBConv (e=6, k=3, SE=0.25)	1	160	9
6	MBConv (e=6, k=3, SE=0.25)	2	256	15
7	Conv1×1 → Pool → FC	–	1280	1

Each fused-MBConv block integrates a regular convolution in place of separated expansion and depthwise steps, accelerating early-stage computation.
The MBConv blocks employ squeeze-and-excitation (SE) attention with a typical reduction ratio of 4 and utilize a linear bottleneck structure with SiLU/Swish activation.

With depth and width multipliers set to 1.0, EfficientNetV2-S contains approximately 22–24 million parameters, depending on the framework and any additional classification head, and requires 8.8 GFLOPs for a single ImageNet inference (Tan et al., 2021).

2. Compound Scaling and Progressive Training

EfficientNetV2-S adopts compound scaling to systematically balance model depth, width, and input image resolution, governed by scaling coefficients $(\alpha, \beta, \gamma)$ for depth, width, and resolution, respectively. For V2-S, the base resolution is 224×224 or 300×300 pixels, and the architecture omits additional scaling ( $\phi=0$ ). Later EfficientNetV2 variants (M/L) are derived by increasing $\phi$ and scaling channels, layers, and image size accordingly.

Progressive learning is a critical training paradigm for EfficientNetV2. The training process is partitioned into multiple stages with linearly increasing image size ( $S_i$ ) and regularization strength ( $\phi_i^k$ ), interpolated between initial and final values. For V2-S, image size progresses from 128 to 300, dropout rate from 0.10 to 0.30, and RandAugment magnitude from 5 to 15 over four stages. This method improves both convergence and generalization, offsetting the accuracy drop associated with rapid image-size ramp-up (Tan et al., 2021).

3. Training Hyperparameters and Optimization

EfficientNetV2-S is typically trained with the following configuration:

Optimizer: RMSProp (momentum 0.9, decay 0.9), batch size 4096 (distributed over 32 TPUv3 cores), weight decay 1e-5, batch norm momentum 0.99, and EMA decay 0.9999.
Learning Rate: Warm-up from 0 to 0.256 within the initial epochs, followed by multiplicative decay:

$\text{lr}(t) = \text{lr}_0 \cdot 0.97^{\lfloor t/2.4 \rfloor}$

where $t$ is the epoch index.

Regularization: Dropout (progressively ramped), RandAugment, Mixup, and stochastic depth (survival probability 0.8). No separate warm-up is required under cosine annealing in some transfer learning uses (Farabi et al., 3 Oct 2025).
Epochs: Up to 350 for ImageNet-scale pretraining. For downstream transfer learning (e.g., CIFAR, FER2013), epochs may be reduced to 50–100 with early stopping (Shahriar, 6 May 2025, Farabi et al., 3 Oct 2025).

Data augmentation includes random cropping/resizing, horizontal flip, rotation, color jitter, and normalization by ImageNet mean/std. For resource-constrained tasks, AutoAugment or lightweight augmentations are generally sufficient.

4. Empirical Performance and Transferability

EfficientNetV2-S achieves state-of-the-art efficiency and accuracy across diverse benchmarks and transfer learning tasks.

ImageNet (no extra pretraining):
- Top-1 Accuracy: 83.9%
- Params: 22M
- FLOPs: 8.8B
- Training time: 7.1 hours (32 TPUv3)
- Inference latency: 24 ms (V100 GPU) (Tan et al., 2021).
Downstream tasks (pretraining on ImageNet followed by fine-tuning):

Dataset	Accuracy (%)	Macro-F1	Inference (s/img)	Model Size (MB)
CIFAR-10	96.53	–	0.000123	76.97
CIFAR-100	90.82	–	0.000157	77.46
Tiny ImageNet	76.87	–	0.000109	79.80
FER2013 (InsideOut)	62.8	0.590	–	~24M params

Transfer learning leads to accelerated convergence and improved accuracy—for instance, pretrained EfficientNetV2-S exceeds scratch-trained variants by 3.4% absolute on CIFAR-10, and reaches stable performance in ~30 versus ~45 epochs (Shahriar, 6 May 2025). For FER2013, a task characterized by pronounced class imbalance and fine-grained differences, EfficientNetV2-S in the InsideOut framework attains 62.8% accuracy and macro-F1 of 0.590, outperforming deeper vanilla ResNet/VGG baselines of comparable or higher size (Farabi et al., 3 Oct 2025).

5. Adaptation for Imbalanced and Resource-Constrained Tasks

The architecture is frequently adapted for skewed class distributions and computational limits. In the context of InsideOut for FER2013, EfficientNetV2-S is extended with a custom classification head (dropout, fully connected layer, softmax), and trained using a class-weighted cross-entropy loss:

$L(x) = -\sum_{i=1}^7 w_i\, y_i(x) \log \hat{y}_i(x),$

where $w_i \propto 1 / n_i$ (with $n_i$ the sample count of class $i$ , normalized so $\sum w_i = n_\text{classes}$ ), $y_i(x)$ the one-hot true label, and $\hat{y}_i(x)$ the predicted probability.

For low-memory or bandwidth-limited deployment, recommendations include model pruning (removing up to 20% of channels in late stages at <1% accuracy loss), post-training quantization (e.g., to 8-bit integer precision), and freezing early layers during quantization. MobileNetV3-S may be considered as a lower-memory alternative when accuracy constraints are relaxed (Shahriar, 6 May 2025).

6. Comparative Analysis and Deployment

EfficientNetV2-S demonstrates favorable trade-offs compared to other lightweight models:

Model	CIFAR-10 Acc. (%)	GFLOPs	Size (MB)	Tiny IN Acc. (%)
EffNetV2-S	96.53	0.01	76.97	76.87
MobileNetV3-S	95.49	0.01	5.89	72.54
ResNet18	96.05	0.04	42.65	67.67
SqueezeNet	84.48	0.00	2.78	20.50
ShuffleNetV2	95.83	0.00	4.82	65.23

EfficientNetV2-S consistently achieves the highest accuracy at moderate computational cost. However, its memory footprint is the largest among lightweight contenders, suggesting that quantization or pruning may be necessary for microcontroller or edge deployments. For accuracy-critical applications (e.g., medical imaging), pretrained EfficientNetV2-S is recommended; for strict memory constraints, a quantized or pruned version is advocated (Shahriar, 6 May 2025).

7. Application-Specific Optimizations: Case Study—InsideOut FER Framework

When applied to facial emotion recognition on FER2013 in the InsideOut framework, EfficientNetV2-S is integrated as a backbone with a domain-specific classification head and robust data augmentation pipeline. Training employs transfer learning from ImageNet, progressive unfreezing of backbone layers, Adam optimizer (β₁=0.9, β₂=0.999), and a cosine-annealed learning rate schedule:

$\eta(t) = \eta_0 \cdot 0.5(1 + \cos(\pi t / T)),$

where $T$ is the total epoch budget. Data augmentation includes random resized cropping, horizontal flipping, rotation, color jitter, and normalization. This pipeline enables the model to handle inter-class imbalance and subtle expression variability, yielding competitive performance relative to heavier CNNs with substantially larger parameter counts (Farabi et al., 3 Oct 2025).

A plausible implication is that the EfficientNetV2-S design—balancing fused-MBConv acceleration, compound scaling, and tailored training strategies—positions it as a general-purpose backbone for both standard vision benchmarks and practical, real-world deployments in constrained settings.