EfficientNet Architecture Family
- EfficientNet is a family of CNN architectures designed for high classification accuracy and parameter efficiency through systematic scaling and neural architecture search.
- EfficientNet-V1 introduced compound scaling with a unified baseline, while EfficientNet-V2 refined this approach with training-aware search and fused block designs.
- Both generations achieve state-of-the-art ImageNet accuracy and excellent transfer learning performance, reducing computational cost and model size.
EfficientNet is a family of convolutional neural network (CNN) architectures optimized for classification accuracy, parameter efficiency, and training/inference speed, based on systematic network scaling and neural architecture search (NAS). Originating from work by Tan and Le at Google Research, EfficientNet has two primary generations: EfficientNetV1, which introduced compound scaling and a new baseline discovered via NAS, and EfficientNetV2, which advanced the methodology with training-aware architecture search, new fused block designs, and progressive training methods. Both generations have become standard references for high-performance, resource-efficient ConvNet design in computer vision applications (Tan et al., 2019, Tan et al., 2021).
1. Neural Architecture Search and EfficientNet-B0 Baseline
The foundation of the EfficientNet family is the EfficientNet-B0 architecture, obtained through neural architecture search on a variant of the MnasNet search space. The search space includes:
- Operator choices: mobile inverted bottleneck convolutions (MBConv), with varying kernel sizes (3×3, 5×5), expansion ratios (1 or 6), optional squeeze-and-excitation (SE) modules, and residual skips.
- Stage-wise variation: each of the eight main stages (following an initial 3×3 Conv stem) can select its operator configuration and number of block repeats.
The NAS process optimizes a multi-objective reward: where M FLOP target and , enforcing a trade-off between accuracy and computational cost. The resulting architecture (EfficientNet-B0) forms the template for subsequent scaling (Tan et al., 2019).
The structure of EfficientNet-B0 is shown below:
| Stage | Operator | Resolution | Channels | Repeats |
|---|---|---|---|---|
| 1 | Conv3×3 | 224×224 | 32 | 1 |
| 2 | MBConv1 k3 | 112×112 | 16 | 1 |
| 3 | MBConv6 k3 | 112×112 | 24 | 2 |
| 4 | MBConv6 k5 | 56×56 | 40 | 2 |
| 5 | MBConv6 k3 | 28×28 | 80 | 3 |
| 6 | MBConv6 k5 | 14×14 | 112 | 3 |
| 7 | MBConv6 k5 | 14×14 | 192 | 4 |
| 8 | MBConv6 k3 | 7×7 | 320 | 1 |
| 9 | Conv1×1 → Pool→FC | 7×7 | 1280 | 1 |
Key operator details:
- Each MBConv: expansion (BatchNorm + SiLU), depthwise (BatchNorm + SiLU), projection (BatchNorm), SE module between depthwise and projection, and skip if input/output shapes match.
- All activations use SiLU/Swish: .
2. Compound Model Scaling
EfficientNet introduces "compound scaling," a principled method to simultaneously scale the three architectural dimensions: network depth , width , and input resolution , controlled by a global user-specified coefficient :
The coefficients are chosen such that (doubling FLOPs for each ), reflecting the cost structure of Conv layers. For EfficientNet, grid search yields , , . Integral rounding is applied to block counts and resolution.
This compound method outperforms traditional strategies that scale only a single axis, avoiding diminishing returns caused by excessive depth, width, or resolution alone. Empirically, uniform compound scaling delivers substantial accuracy improvements for fixed computation budgets (Tan et al., 2019).
3. EfficientNet-V1 and V2 Families: Model Variants and Scaling Progression
The original EfficientNet-V1 family comprises B0 to B7, with each successive model scaled according to the compound coefficient :
| Model | Input | Params | FLOPs | Top-1 Acc | |
|---|---|---|---|---|---|
| B0 | 0 | 224×224 | 5.3M | 0.39B | 77.1% |
| B1 | 1 | 240×240 | 7.8M | 0.70B | 79.1% |
| B2 | 2 | 260×260 | 9.2M | 1.0B | 80.1% |
| B3 | 3 | 300×300 | 12M | 1.8B | 81.6% |
| B4 | 4 | 380×380 | 19M | 4.2B | 82.9% |
| B5 | 5 | 456×456 | 30M | 9.9B | 83.6% |
| B6 | 6 | 528×528 | 43M | 19B | 84.0% |
| B7 | 7 | 600×600 | 66M | 37B | 84.3% |
Regularization (dropout, stochastic depth, AutoAugment) increases with model scale. At ImageNet scale, B7 achieves 84.3% top-1 accuracy, matching far larger models (e.g., 556M-param GPipe) with an approximately reduction in size and increase in speed (Tan et al., 2019).
EfficientNetV2 refines model scaling by capping max resolution at , heuristically increasing block counts in later stages. It further reduces over-parameterization at high input resolutions and improves parameter efficiency by adding layers where useful (Tan et al., 2021).
4. Architectural Innovations: Fused-MBConv and Operator Search
EfficientNetV2 introduces the Fused-MBConv module. In standard MBConv, a expansion precedes a depthwise convolution. Fused-MBConv merges these into a single regular convolution, then optionally SE/projection. Parameters for Fused-MBConv are ; for MBConv, the total is [expand], [depthwise], [project].
Fused-MBConv reduces memory-access overhead in early stages (where input dimensions are large, and , are small). EfficientNetV2's architecture search operates in a space containing both MBConv and Fused-MBConv blocks, expansion ratios of , kernel sizes , and stagewise combinations. This approach systematically selects block types and placements for training and inference efficiency on modern accelerators (Tan et al., 2021).
5. Progressive Learning and Adaptive Regularization
EfficientNetV2 introduces progressive training in which input image size increases over training stages: . Training at smaller sizes early accelerates convergence and reduces computation, while later large images provide high final accuracy. To counteract accuracy drop typical of progressive resizing, EfficientNetV2 linearly interpolates regularization strength (dropout, RandAugment magnitude, mixup ) in lockstep with image size:
This "adaptive regularization" significantly improves accuracy during progressive learning, mitigating capacity overfitting in later stages and stabilizing training dynamics (Tan et al., 2021).
6. Performance, Transfer, and Empirical Findings
EfficientNet architectures show strong performance on both standard and transfer learning tasks:
- On ImageNet, B7 (V1) reaches 84.3% top-1 accuracy at 66M parameters and 37B FLOPs. EfficientNetV2-L achieves 85.7% at 120M parameters and 53B FLOPs.
- EfficientNetV2-M matches B7’s accuracy (84.7%) with much faster training (13h vs 139h on TPUv3), and 2–3 inference speedup at similar accuracy.
- Transfer learning results are consistently state-of-the-art: on CIFAR-100, EfficientNetV1-B7 and V2-L achieve 91.7% and 92.3%, respectively; Flowers-102 achieves 98.8% with V1-B7, matched by V2-L (Tan et al., 2019, Tan et al., 2021).
- With ImageNet21k pretraining, EfficientNetV2-L achieves 86.8% top-1, outperforming ViT-L/16 (85.3%) while requiring fewer parameters (120M vs 304M) and less compute (53B vs 192B FLOPs).
7. Impact and Directions
EfficientNet's compound scaling and NAS framework have set new standards for scaling ConvNet architectures and resource-efficient model design for both academia and production. The Fused-MBConv and adaptive progressive training of EfficientNetV2 demonstrate the impact of coupling architecture search with training dynamics and hardware-aware optimization.
A plausible implication is that the EfficientNet family’s principles—systematic multidimensional scaling, block-level heterogeneity, and training-aware search—represent an enduring blueprint for the design of high-accuracy, efficient backbones in computer vision, including beyond classification to detection and segmentation (Tan et al., 2019, Tan et al., 2021).