Papers
Topics
Authors
Recent
Search
2000 character limit reached

VGGNet Architecture and ResSquVGG16 Variant

Updated 25 November 2025
  • VGGNet architecture is a deep convolutional network with orderly 3×3 filters and max-pooling layers, designed for robust visual recognition.
  • The ResSquVGG16 variant replaces standard convolutions with SqueezeNet Fire modules and adds residual connections to reduce model size and training time.
  • Both architectures highlight trade-offs between depth and parameter efficiency, providing practical insights for improving large-scale scene recognition.

The VGGNet architecture is a deep convolutional neural network architecture originally developed for visual recognition tasks. VGG-16, a widely adopted instantiation, comprises 13 convolutional layers and 3 fully connected layers, employing a uniform structure of small 3×3 convolution filters and 2×2 max-pooling throughout. Recent advances have sought to address the architecture’s parameter redundancy and training efficiency, leading to variants such as Residual-Squeeze-VGG16 (ResSquVGG16), which integrates parameter-efficient Fire modules adapted from SqueezeNet and residual skip connections to mitigate degradation effects. The ResSquVGG16 model achieves VGG-comparable accuracy on large-scale scene recognition with substantial reductions in model size and training time (Qassim et al., 2017).

1. Architectural Specification of VGG-16

VGG-16, introduced by Simonyan and Zisserman, is defined by a deep design with a fixed scheme of stacking multiple convolutional layers using 3×33 \times 3 filters—each stride 1, pad 1—and periodic max-pooling layers of 2×22 \times 2, stride 2. The sequence for an input 224×224224 \times 224 RGB image is as follows:

  • Block 1:
    • Conv1_1: 64 filters, 3×33 \times 3
    • Conv1_2: 64 filters, 3×33 \times 3
    • MaxPool1: 2×22 \times 2, stride 2
  • Block 2:
    • Conv2_1: 128 filters, 3×33 \times 3
    • Conv2_2: 128 filters, 3×33 \times 3
    • MaxPool2
  • Block 3:
    • Conv3_1, Conv3_2, Conv3_3: 256 filters, 3×33 \times 3
    • MaxPool3
  • Block 4:
    • Conv4_1, Conv4_2, Conv4_3: 512 filters, 3×33 \times 3
    • MaxPool4
  • Block 5:
    • Conv5_1, Conv5_2, Conv5_3: 512 filters, 2×22 \times 20
    • MaxPool5
  • Fully Connected:
    • FC6: 4096 units
    • FC7: 4096 units
    • FC8: 1000 units (originally for ImageNet)

The model exhibits a uniform architectural principle, leading to a total parameter count approaching 2×22 \times 21 million ((Qassim et al., 2017), Table 1).

2. Integration of SqueezeNet Fire Modules

Parameter compression in ResSquVGG16 is achieved by replacing VGG-16’s convolutional (and fully-connected) blocks with SqueezeNet Fire modules. The original first convolution is retained but modified to 2×22 \times 22, stride 2, 64 filters. All subsequent conv layers (Conv1_2 through Conv5_3) and FC layers (FC6–FC8) are replaced by 12 Fire modules and 3 2×22 \times 23 conv layers:

  • Fire Module (per Iandola et al. [10]):
    • Squeeze: 2×22 \times 24 conv, 2×22 \times 25 filters
    • Expand: Parallel 2×22 \times 26 conv (2×22 \times 27), 2×22 \times 28 conv (2×22 \times 29, pad 1).

The parameter economy is:

224×224224 \times 2240

Compared to a single 224×224224 \times 2241 conv layer with 224×224224 \times 2242 parameters, the Fire module provides a substantial reduction, with the ratio 224×224224 \times 2243:

224×224224 \times 2244

Pooling is inserted after each group of 1–3 Fire modules to mirror VGG’s max-pooling schedule. The three 224×224224 \times 2245 conv layers at the tail end substitute for the fully connected layers, tailored here for 365 scene classes (Qassim et al., 2017).

3. Residual (Shortcut) Connections

To suppress degradation with increased depth, ResSquVGG16 supplements its compressed architecture with shortcut (residual) connections. These are inserted after sequences of two or more consecutive Fire modules without intervening pooling. The residual mapping is:

224×224224 \times 2246

If channel dimensions are mismatched, a projection via 224×224224 \times 2247 is performed:

224×224224 \times 2248

with 224×224224 \times 2249.

Concretely, four skip connections are instantiated:

Skip Source Destination Channels Projection
1 Pool1 Output Fire3 Out 64 No
2 Pool2 Output Fire6 Out 128 No
3 Pool3 Output Fire9 Out 256 No
4 Pool4 Output Fire12 Out 512 No

All are element-wise additions (Qassim et al., 2017).

4. Complete Layer-Wise Layout: ResSquVGG16

The composition of ResSquVGG16 is as follows:

  1. Conv1: 3×33 \times 30, stride 2, 3×33 \times 31 filters → ReLU
  2. Fire1: 3×33 \times 32, 3×33 \times 33, 3×33 \times 34 → Scale → ReLU
  3. Pool1: 3×33 \times 35, stride 2
  4. Fire2: 3×33 \times 36, 3×33 \times 37, 3×33 \times 38 → ReLU Fire3: 3×33 \times 39, 3×33 \times 30, 3×33 \times 31 → ReLU
  5. Pool2: 3×33 \times 32, stride 2
  6. Fire4: 3×33 \times 33, 3×33 \times 34, 3×33 \times 35 → ReLU Fire5: 3×33 \times 36, 3×33 \times 37, 3×33 \times 38 → ReLU Fire6: 3×33 \times 39, 2×22 \times 20, 2×22 \times 21 → ReLU
  7. Pool3: 2×22 \times 22, stride 2
  8. Fire7: 2×22 \times 23, 2×22 \times 24, 2×22 \times 25 → ReLU Fire8: 2×22 \times 26, 2×22 \times 27, 2×22 \times 28 → ReLU Fire9: 2×22 \times 29, 3×33 \times 30, 3×33 \times 31 → ReLU
  9. Pool4: 3×33 \times 32, stride 2
  10. Fire10: 3×33 \times 33, 3×33 \times 34, 3×33 \times 35 → ReLU Fire11: 3×33 \times 36, 3×33 \times 37, 3×33 \times 38 → ReLU Fire12: 3×33 \times 39, 3×33 \times 30, 3×33 \times 31 → ReLU
  11. Pool5: 3×33 \times 32, stride 2
  12. Conv6: 3×33 \times 33, 3×33 \times 34 units → ReLU Conv7: 3×33 \times 35, 3×33 \times 36 units → ReLU Conv8: 3×33 \times 37, 3×33 \times 38 units → Softmax

Scale layers (Caffe-type BatchNorm replacements) and ReLU are applied after each Fire or conv module (Qassim et al., 2017).

5. Performance and Empirical Comparison

ResSquVGG16 was trained on MIT Places365-Standard (1.8M images, 365 classes) from scratch, using 4 GTX Titan X GPUs (Caffe + DIGITS), over 50 epochs. Metrics:

Metric VGG16 Fine-tuned ResSquVGG16
Training Time 3d 16h 2d 19h
Model Size 10.6 GB 1.23 GB
Top-1 Accuracy 54.00% 51.68%
Top-5 Accuracy 84.30% 82.04%

ResSquVGG16 matches VGG16 within 3×33 \times 392.3 pp in both Top-1 and Top-5 validation accuracy, while reducing training time by approximately 3×33 \times 30 and storage by 3×33 \times 31 ((Qassim et al., 2017), Table 2).

6. Implications and Contributions

The integration of Fire modules with residual connections in ResSquVGG16 demonstrates that compression strategies can replace large-weight networks such as VGG-16 while incurring minimal performance loss. This suggests further exploration in deep network design may emphasize parameter efficiency and residual learning, both in supervised training from scratch and transfer scenarios. The architectural modifications retain the internal macro-structure of VGG-16, preserving its depth, while exploiting sub-layer parameter sharing and shortcut propagation for practical gains in computational and memory efficiency (Qassim et al., 2017).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to VGGNet Architecture.