Papers
Topics
Authors
Recent
Search
2000 character limit reached

EfficientNet Architecture Family

Updated 22 February 2026
  • EfficientNet is a family of CNN architectures designed for high classification accuracy and parameter efficiency through systematic scaling and neural architecture search.
  • EfficientNet-V1 introduced compound scaling with a unified baseline, while EfficientNet-V2 refined this approach with training-aware search and fused block designs.
  • Both generations achieve state-of-the-art ImageNet accuracy and excellent transfer learning performance, reducing computational cost and model size.

EfficientNet is a family of convolutional neural network (CNN) architectures optimized for classification accuracy, parameter efficiency, and training/inference speed, based on systematic network scaling and neural architecture search (NAS). Originating from work by Tan and Le at Google Research, EfficientNet has two primary generations: EfficientNetV1, which introduced compound scaling and a new baseline discovered via NAS, and EfficientNetV2, which advanced the methodology with training-aware architecture search, new fused block designs, and progressive training methods. Both generations have become standard references for high-performance, resource-efficient ConvNet design in computer vision applications (Tan et al., 2019, Tan et al., 2021).

1. Neural Architecture Search and EfficientNet-B0 Baseline

The foundation of the EfficientNet family is the EfficientNet-B0 architecture, obtained through neural architecture search on a variant of the MnasNet search space. The search space includes:

  • Operator choices: mobile inverted bottleneck convolutions (MBConv), with varying kernel sizes (3×3, 5×5), expansion ratios (1 or 6), optional squeeze-and-excitation (SE) modules, and residual skips.
  • Stage-wise variation: each of the eight main stages (following an initial 3×3 Conv stem) can select its operator configuration and number of block repeats.

The NAS process optimizes a multi-objective reward: ACC(m)⋅[FLOPs(m)/T]w\text{ACC}(m) \cdot [\text{FLOPs}(m)/T]^w where T=400T=400M FLOP target and w=−0.07w=-0.07, enforcing a trade-off between accuracy and computational cost. The resulting architecture (EfficientNet-B0) forms the template for subsequent scaling (Tan et al., 2019).

The structure of EfficientNet-B0 is shown below:

Stage Operator Resolution Channels Repeats
1 Conv3×3 224×224 32 1
2 MBConv1 k3 112×112 16 1
3 MBConv6 k3 112×112 24 2
4 MBConv6 k5 56×56 40 2
5 MBConv6 k3 28×28 80 3
6 MBConv6 k5 14×14 112 3
7 MBConv6 k5 14×14 192 4
8 MBConv6 k3 7×7 320 1
9 Conv1×1 → Pool→FC 7×7 1280 1

Key operator details:

  • Each MBConv: 1×11 \times 1 expansion (BatchNorm + SiLU), k×kk \times k depthwise (BatchNorm + SiLU), 1×11 \times 1 projection (BatchNorm), SE module between depthwise and projection, and skip if input/output shapes match.
  • All activations use SiLU/Swish: f(x)=x⋅σ(x)f(x) = x \cdot \sigma(x).

2. Compound Model Scaling

EfficientNet introduces "compound scaling," a principled method to simultaneously scale the three architectural dimensions: network depth dd, width ww, and input resolution rr, controlled by a global user-specified coefficient Ï•\phi:

d=αϕ,w=βϕ,r=γϕd = \alpha^\phi, \quad w = \beta^\phi, \quad r = \gamma^\phi

The coefficients α,β,γ≥1\alpha, \beta, \gamma \geq 1 are chosen such that α⋅β2⋅γ2≈2\alpha \cdot \beta^2 \cdot \gamma^2 \approx 2 (doubling FLOPs for each ϕ\phi), reflecting the cost structure of Conv layers. For EfficientNet, grid search yields α=1.2\alpha=1.2, β=1.1\beta=1.1, γ=1.15\gamma=1.15. Integral rounding is applied to block counts and resolution.

This compound method outperforms traditional strategies that scale only a single axis, avoiding diminishing returns caused by excessive depth, width, or resolution alone. Empirically, uniform compound scaling delivers substantial accuracy improvements for fixed computation budgets (Tan et al., 2019).

3. EfficientNet-V1 and V2 Families: Model Variants and Scaling Progression

The original EfficientNet-V1 family comprises B0 to B7, with each successive model scaled according to the compound coefficient Ï•\phi:

Model Ï•\phi Input Params FLOPs Top-1 Acc
B0 0 224×224 5.3M 0.39B 77.1%
B1 1 240×240 7.8M 0.70B 79.1%
B2 2 260×260 9.2M 1.0B 80.1%
B3 3 300×300 12M 1.8B 81.6%
B4 4 380×380 19M 4.2B 82.9%
B5 5 456×456 30M 9.9B 83.6%
B6 6 528×528 43M 19B 84.0%
B7 7 600×600 66M 37B 84.3%

Regularization (dropout, stochastic depth, AutoAugment) increases with model scale. At ImageNet scale, B7 achieves 84.3% top-1 accuracy, matching far larger models (e.g., 556M-param GPipe) with an approximately 8×8\times reduction in size and 6×6\times increase in speed (Tan et al., 2019).

EfficientNetV2 refines model scaling by capping max resolution at 480×480480\times480, heuristically increasing block counts in later stages. It further reduces over-parameterization at high input resolutions and improves parameter efficiency by adding layers where useful (Tan et al., 2021).

EfficientNetV2 introduces the Fused-MBConv module. In standard MBConv, a 1×11\times1 expansion precedes a k×kk\times k depthwise convolution. Fused-MBConv merges these into a single k×kk\times k regular convolution, then optionally SE/projection. Parameters for Fused-MBConv are Min⋅Mout⋅k2M_{in} \cdot M_{out} \cdot k^2; for MBConv, the total is (Min⋅r)⋅Min⋅12(M_{in} \cdot r) \cdot M_{in} \cdot 1^2 [expand], +(Min⋅r)⋅k2+ (M_{in} \cdot r) \cdot k^2 [depthwise], +(Min⋅r)⋅Mout⋅12+ (M_{in} \cdot r) \cdot M_{out} \cdot 1^2 [project].

Fused-MBConv reduces memory-access overhead in early stages (where input dimensions are large, and MinM_{in}, MoutM_{out} are small). EfficientNetV2's architecture search operates in a space containing both MBConv and Fused-MBConv blocks, expansion ratios of {1,4,6}\{1,4,6\}, kernel sizes {3,5}\{3,5\}, and stagewise combinations. This approach systematically selects block types and placements for training and inference efficiency on modern accelerators (Tan et al., 2021).

5. Progressive Learning and Adaptive Regularization

EfficientNetV2 introduces progressive training in which input image size increases over MM training stages: Si=S0+(Se−S0)i/(M−1)S_i = S_0 + (S_e - S_0)i/(M-1). Training at smaller sizes early accelerates convergence and reduces computation, while later large images provide high final accuracy. To counteract accuracy drop typical of progressive resizing, EfficientNetV2 linearly interpolates regularization strength (dropout, RandAugment magnitude, mixup α\alpha) in lockstep with image size:

ϕik=ϕ0k+(ϕek−ϕ0k)iM−1\phi^k_i = \phi^k_0 + (\phi^k_e - \phi^k_0)\frac{i}{M-1}

This "adaptive regularization" significantly improves accuracy during progressive learning, mitigating capacity overfitting in later stages and stabilizing training dynamics (Tan et al., 2021).

6. Performance, Transfer, and Empirical Findings

EfficientNet architectures show strong performance on both standard and transfer learning tasks:

  • On ImageNet, B7 (V1) reaches 84.3% top-1 accuracy at 66M parameters and 37B FLOPs. EfficientNetV2-L achieves 85.7% at 120M parameters and 53B FLOPs.
  • EfficientNetV2-M matches B7’s accuracy (84.7%) with much faster training (13h vs 139h on TPUv3), and 2–3×\times inference speedup at similar accuracy.
  • Transfer learning results are consistently state-of-the-art: on CIFAR-100, EfficientNetV1-B7 and V2-L achieve 91.7% and 92.3%, respectively; Flowers-102 achieves 98.8% with V1-B7, matched by V2-L (Tan et al., 2019, Tan et al., 2021).
  • With ImageNet21k pretraining, EfficientNetV2-L achieves 86.8% top-1, outperforming ViT-L/16 (85.3%) while requiring fewer parameters (120M vs 304M) and less compute (53B vs 192B FLOPs).

7. Impact and Directions

EfficientNet's compound scaling and NAS framework have set new standards for scaling ConvNet architectures and resource-efficient model design for both academia and production. The Fused-MBConv and adaptive progressive training of EfficientNetV2 demonstrate the impact of coupling architecture search with training dynamics and hardware-aware optimization.

A plausible implication is that the EfficientNet family’s principles—systematic multidimensional scaling, block-level heterogeneity, and training-aware search—represent an enduring blueprint for the design of high-accuracy, efficient backbones in computer vision, including beyond classification to detection and segmentation (Tan et al., 2019, Tan et al., 2021).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to EfficientNet Architecture Family.