EfficientNet-B1 CNN Architecture
- EfficientNet-B1 is a CNN architecture derived from EfficientNet-B0 that applies compound scaling to balance depth, width, and resolution for improved performance.
- It utilizes specific scaling coefficients (alpha=1.20, beta=1.10, gamma=1.15) together with MBConv blocks and squeeze-and-excitation modules to enhance accuracy while reducing computational cost.
- The design offers a practical tradeoff between resource efficiency and performance, making it a benchmark in modern image recognition applications.
EfficientNet-B1 is a convolutional neural network (CNN) architecture characterized by compound scaling of depth, width, and input resolution, introduced within the EfficientNet model family. It is derived from systematic application of a compound coefficient to a highly optimized baseline (EfficientNet-B0) and demonstrates state-of-the-art parameter and computational efficiency across multiple image recognition benchmarks (Tan et al., 2019).
1. Compound Scaling Principle
The EfficientNet family is founded on the observation that careful, coordinated scaling of network depth (), width (), and input resolution () can produce superior accuracy and resource efficiency compared to conventional approaches that modify these dimensions in isolation. The compound scaling method controls the growth of each dimension by introducing a global scaling coefficient , subject to a resource constraint: with and . Empirically, , , and are selected. For EfficientNet-B1, is set to 1, yielding , , and .
2. Baseline EfficientNet-B0 Architecture
EfficientNet-B0 acts as the foundation for all derived variants, identified via neural architecture search. It is composed of a series of stages using standard convolutional and MBConv blocks. Each MBConv block is a mobile inverted bottleneck convolutional module with squeeze-and-excitation and Swish (SiLU) activation functions. The architectural design is compact with a strong accuracy-to-FLOPS ratio.
Summary of B0 stagewise structure:
| Stage | Operator | k×k | Input Res | Out Ch | Repeats | Stride |
|---|---|---|---|---|---|---|
| 1 | Conv3×3 | 3×3 | 224×224 | 32 | 1 | 2 |
| 2 | MBConv1 | 3×3 | 112×112 | 16 | 1 | 1 |
| 3 | MBConv6 | 3×3 | 112×112 | 24 | 2 | 1 |
| 4 | MBConv6 | 5×5 | 56×56 | 40 | 2 | 2 |
| 5 | MBConv6 | 3×3 | 28×28 | 80 | 3 | 2 |
| 6 | MBConv6 | 5×5 | 14×14 | 112 | 3 | 2 |
| 7 | MBConv6 | 5×5 | 14×14 | 192 | 4 | 1 |
| 8 | MBConv6 | 3×3 | 7×7 | 320 | 1 | 1 |
| 9 | Conv1×1→Pool→FC | — | 7×7 | 1280 | 1 | 1 |
All convolutions utilize batch normalization and Swish activation. Squeeze-and-excitation is applied to MBConv blocks. Dropout () precedes the final fully-connected layer.
3. EfficientNet-B1 Derivation
EfficientNet-B1 is constructed by applying compound scaling rules with to EfficientNet-B0. This yields:
- Resolution scaling: , rounded to nearest multiple of 8 .
- Depth scaling: Each stage’s repeat count is scaled by $1.20$ and rounded.
- Width scaling: Each output channel count is scaled by $1.10$ and rounded to nearest multiple of 8. For several stages (notably stage 4), rounding rules yield values that match those in the official implementation.
4. EfficientNet-B1 Stagewise Specification
The finalized stage specification for EfficientNet-B1, after compound scaling and rounding, is summarized below.
| Stage | Operator | k×k | Input Res | Out Ch | Repeats | Stride |
|---|---|---|---|---|---|---|
| 1 | Conv3×3 | 3×3 | 240×240 | 32 | 1 | 2 |
| 2 | MBConv1 | 3×3 | 120×120 | 16 | 1 | 1 |
| 3 | MBConv6 | 3×3 | 120×120 | 24 | 2 | 1 |
| 4 | MBConv6 | 5×5 | 60×60 | 40 | 2 | 2 |
| 5 | MBConv6 | 3×3 | 30×30 | 88 | 4 | 2 |
| 6 | MBConv6 | 5×5 | 15×15 | 120 | 4 | 2 |
| 7 | MBConv6 | 5×5 | 15×15 | 208 | 5 | 1 |
| 8 | MBConv6 | 3×3 | 8×8 | 352 | 1 | 2 |
| 9 | Conv1×1→Pool→FC | — | 8×8 | 1408 | 1 | 1 |
After the final convolution, global average pooling is applied, followed by a -dimensional fully-connected layer, where is the number of output classes.
5. Key Mathematical Formalisms
The two principal mathematical formulations defining EfficientNet’s compound scaling and resource-constrained optimization are:
with defined by repeated application of MBConv and other modules to scaled-resolution, scaled-width tensors.
6. Implementation Details and Architectural Characteristics
All convolutions incorporate batch normalization and utilize Swish (SiLU) activation. Each MBConv block includes a squeeze-and-excitation module to enhance channelwise feature recalibration. Dropout is typically applied with probability before the final fully-connected classifier. Downsampling is performed by the stride of the first block in each stage where spatial resolution is reduced; all other repeats operate with stride 1.
7. Significance and Performance Context
The compound scaling approach, when systematically applied to an optimized baseline, yields a set of networks (EfficientNet-B1 through B7) that, according to empirical evaluations, achieve superior accuracy efficiency—as measured by parameters and FLOPS—compared to prior architectures on ImageNet and other classification datasets. EfficientNet-B1 serves as a canonical instance of compound scaling at , embodying the design philosophy and engineering tradeoffs central to this model family (Tan et al., 2019).