Papers
Topics
Authors
Recent
Search
2000 character limit reached

VGGNet Architecture and ResSquVGG16 Variant

Updated 25 November 2025
  • VGGNet architecture is a deep convolutional network with orderly 3×3 filters and max-pooling layers, designed for robust visual recognition.
  • The ResSquVGG16 variant replaces standard convolutions with SqueezeNet Fire modules and adds residual connections to reduce model size and training time.
  • Both architectures highlight trade-offs between depth and parameter efficiency, providing practical insights for improving large-scale scene recognition.

The VGGNet architecture is a deep convolutional neural network architecture originally developed for visual recognition tasks. VGG-16, a widely adopted instantiation, comprises 13 convolutional layers and 3 fully connected layers, employing a uniform structure of small 3×3 convolution filters and 2×2 max-pooling throughout. Recent advances have sought to address the architecture’s parameter redundancy and training efficiency, leading to variants such as Residual-Squeeze-VGG16 (ResSquVGG16), which integrates parameter-efficient Fire modules adapted from SqueezeNet and residual skip connections to mitigate degradation effects. The ResSquVGG16 model achieves VGG-comparable accuracy on large-scale scene recognition with substantial reductions in model size and training time (Qassim et al., 2017).

1. Architectural Specification of VGG-16

VGG-16, introduced by Simonyan and Zisserman, is defined by a deep design with a fixed scheme of stacking multiple convolutional layers using 3×33 \times 3 filters—each stride 1, pad 1—and periodic max-pooling layers of 2×22 \times 2, stride 2. The sequence for an input 224×224224 \times 224 RGB image is as follows:

  • Block 1:
    • Conv1_1: 64 filters, 3×33 \times 3
    • Conv1_2: 64 filters, 3×33 \times 3
    • MaxPool1: 2×22 \times 2, stride 2
  • Block 2:
    • Conv2_1: 128 filters, 3×33 \times 3
    • Conv2_2: 128 filters, 3×33 \times 3
    • MaxPool2
  • Block 3:
    • Conv3_1, Conv3_2, Conv3_3: 256 filters, 3×33 \times 3
    • MaxPool3
  • Block 4:
    • Conv4_1, Conv4_2, Conv4_3: 512 filters, 3×33 \times 3
    • MaxPool4
  • Block 5:
    • Conv5_1, Conv5_2, Conv5_3: 512 filters, %%%%10%%%%
    • MaxPool5
  • Fully Connected:
    • FC6: 4096 units
    • FC7: 4096 units
    • FC8: 1000 units (originally for ImageNet)

The model exhibits a uniform architectural principle, leading to a total parameter count approaching $138$ million ((Qassim et al., 2017), Table 1).

2. Integration of SqueezeNet Fire Modules

Parameter compression in ResSquVGG16 is achieved by replacing VGG-16’s convolutional (and fully-connected) blocks with SqueezeNet Fire modules. The original first convolution is retained but modified to 3×33 \times 3, stride 2, 64 filters. All subsequent conv layers (Conv1_2 through Conv5_3) and FC layers (FC6–FC8) are replaced by 12 Fire modules and 3 1×11 \times 1 conv layers:

  • Fire Module (per Iandola et al. [10]):
    • Squeeze: 1×11 \times 1 conv, s1×1s_{1\times1} filters
    • Expand: Parallel 1×11 \times 1 conv (e1×1e_{1\times1}), 3×33 \times 3 conv (e3×3e_{3\times3}, pad 1).

The parameter economy is:

Pfire=Cs1×1+s1×1e1×1+9s1×1e3×3P_{\rm fire} = C\,s_{1\times 1} + s_{1\times 1}\,e_{1\times 1} + 9\,s_{1\times 1}\,e_{3\times 3}

Compared to a single 3×33 \times 3 conv layer with 9CM9\,C\,M parameters, the Fire module provides a substantial reduction, with the ratio rr:

r=Cs+se1+9se39CMr = \frac{C\,s + s\,e_1 + 9\,s\,e_3} {9\,C\,M}

Pooling is inserted after each group of 1–3 Fire modules to mirror VGG’s max-pooling schedule. The three 1×11 \times 1 conv layers at the tail end substitute for the fully connected layers, tailored here for 365 scene classes (Qassim et al., 2017).

3. Residual (Shortcut) Connections

To suppress degradation with increased depth, ResSquVGG16 supplements its compressed architecture with shortcut (residual) connections. These are inserted after sequences of two or more consecutive Fire modules without intervening pooling. The residual mapping is:

y=F(x,{W})+x,F(x,{W})=(Fire module stack)y_\ell = F\bigl(x_\ell, \{W_\ell\}\bigr) + x_\ell, \qquad F(x_\ell, \{W_\ell\}) = \text{(Fire module stack)}

If channel dimensions are mismatched, a projection via WsW_s is performed:

y=F(x)+Wsxy_\ell = F(x_\ell) + W_s x_\ell

with shape(Wsx)=shape(F(x))\mathrm{shape}(W_s x_\ell) = \mathrm{shape}(F(x_\ell)).

Concretely, four skip connections are instantiated:

Skip Source Destination Channels Projection
1 Pool1 Output Fire3 Out 64 No
2 Pool2 Output Fire6 Out 128 No
3 Pool3 Output Fire9 Out 256 No
4 Pool4 Output Fire12 Out 512 No

All are element-wise additions (Qassim et al., 2017).

4. Complete Layer-Wise Layout: ResSquVGG16

The composition of ResSquVGG16 is as follows:

  1. Conv1: 3×33 \times 3, stride 2, $64$ filters → ReLU
  2. Fire1: s=16s=16, e1=64e_1=64, e3=64e_3=64 → Scale → ReLU
  3. Pool1: 3×33 \times 3, stride 2
  4. Fire2: s=16s=16, e1=64e_1=64, e3=64e_3=64 → ReLU Fire3: s=16s=16, e1=64e_1=64, e3=64e_3=64 → ReLU
  5. Pool2: 3×33 \times 3, stride 2
  6. Fire4: s=32s=32, e1=128e_1=128, e3=128e_3=128 → ReLU Fire5: s=32s=32, e1=128e_1=128, e3=128e_3=128 → ReLU Fire6: s=32s=32, e1=128e_1=128, e3=128e_3=128 → ReLU
  7. Pool3: 3×33 \times 3, stride 2
  8. Fire7: s=48s=48, e1=192e_1=192, e3=192e_3=192 → ReLU Fire8: s=48s=48, e1=192e_1=192, e3=192e_3=192 → ReLU Fire9: s=48s=48, e1=192e_1=192, e3=192e_3=192 → ReLU
  9. Pool4: 3×33 \times 3, stride 2
  10. Fire10: s=64s=64, e1=256e_1=256, e3=256e_3=256 → ReLU Fire11: s=64s=64, e1=256e_1=256, e3=256e_3=256 → ReLU Fire12: s=64s=64, e1=256e_1=256, e3=256e_3=256 → ReLU
  11. Pool5: 3×33 \times 3, stride 2
  12. Conv6: 1×11 \times 1, $4096$ units → ReLU Conv7: 1×11 \times 1, $4096$ units → ReLU Conv8: 1×11 \times 1, $365$ units → Softmax

Scale layers (Caffe-type BatchNorm replacements) and ReLU are applied after each Fire or conv module (Qassim et al., 2017).

5. Performance and Empirical Comparison

ResSquVGG16 was trained on MIT Places365-Standard (1.8M images, 365 classes) from scratch, using 4 GTX Titan X GPUs (Caffe + DIGITS), over 50 epochs. Metrics:

Metric VGG16 Fine-tuned ResSquVGG16
Training Time 3d 16h 2d 19h
Model Size 10.6 GB 1.23 GB
Top-1 Accuracy 54.00% 51.68%
Top-5 Accuracy 84.30% 82.04%

ResSquVGG16 matches VGG16 within \sim2.3 pp in both Top-1 and Top-5 validation accuracy, while reducing training time by approximately 23.9%23.9\% and storage by 88.4%88.4\% ((Qassim et al., 2017), Table 2).

6. Implications and Contributions

The integration of Fire modules with residual connections in ResSquVGG16 demonstrates that compression strategies can replace large-weight networks such as VGG-16 while incurring minimal performance loss. This suggests further exploration in deep network design may emphasize parameter efficiency and residual learning, both in supervised training from scratch and transfer scenarios. The architectural modifications retain the internal macro-structure of VGG-16, preserving its depth, while exploiting sub-layer parameter sharing and shortcut propagation for practical gains in computational and memory efficiency (Qassim et al., 2017).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to VGGNet Architecture.