VGGNet Architecture and ResSquVGG16 Variant
- VGGNet architecture is a deep convolutional network with orderly 3×3 filters and max-pooling layers, designed for robust visual recognition.
- The ResSquVGG16 variant replaces standard convolutions with SqueezeNet Fire modules and adds residual connections to reduce model size and training time.
- Both architectures highlight trade-offs between depth and parameter efficiency, providing practical insights for improving large-scale scene recognition.
The VGGNet architecture is a deep convolutional neural network architecture originally developed for visual recognition tasks. VGG-16, a widely adopted instantiation, comprises 13 convolutional layers and 3 fully connected layers, employing a uniform structure of small 3×3 convolution filters and 2×2 max-pooling throughout. Recent advances have sought to address the architecture’s parameter redundancy and training efficiency, leading to variants such as Residual-Squeeze-VGG16 (ResSquVGG16), which integrates parameter-efficient Fire modules adapted from SqueezeNet and residual skip connections to mitigate degradation effects. The ResSquVGG16 model achieves VGG-comparable accuracy on large-scale scene recognition with substantial reductions in model size and training time (Qassim et al., 2017).
1. Architectural Specification of VGG-16
VGG-16, introduced by Simonyan and Zisserman, is defined by a deep design with a fixed scheme of stacking multiple convolutional layers using filters—each stride 1, pad 1—and periodic max-pooling layers of , stride 2. The sequence for an input RGB image is as follows:
- Block 1:
- Conv1_1: 64 filters,
- Conv1_2: 64 filters,
- MaxPool1: , stride 2
- Block 2:
- Conv2_1: 128 filters,
- Conv2_2: 128 filters,
- MaxPool2
- Block 3:
- Conv3_1, Conv3_2, Conv3_3: 256 filters,
- MaxPool3
- Block 4:
- Conv4_1, Conv4_2, Conv4_3: 512 filters,
- MaxPool4
- Block 5:
- Conv5_1, Conv5_2, Conv5_3: 512 filters, %%%%10%%%%
- MaxPool5
- Fully Connected:
- FC6: 4096 units
- FC7: 4096 units
- FC8: 1000 units (originally for ImageNet)
The model exhibits a uniform architectural principle, leading to a total parameter count approaching $138$ million ((Qassim et al., 2017), Table 1).
2. Integration of SqueezeNet Fire Modules
Parameter compression in ResSquVGG16 is achieved by replacing VGG-16’s convolutional (and fully-connected) blocks with SqueezeNet Fire modules. The original first convolution is retained but modified to , stride 2, 64 filters. All subsequent conv layers (Conv1_2 through Conv5_3) and FC layers (FC6–FC8) are replaced by 12 Fire modules and 3 conv layers:
- Fire Module (per Iandola et al. [10]):
- Squeeze: conv, filters
- Expand: Parallel conv (), conv (, pad 1).
The parameter economy is:
Compared to a single conv layer with parameters, the Fire module provides a substantial reduction, with the ratio :
Pooling is inserted after each group of 1–3 Fire modules to mirror VGG’s max-pooling schedule. The three conv layers at the tail end substitute for the fully connected layers, tailored here for 365 scene classes (Qassim et al., 2017).
3. Residual (Shortcut) Connections
To suppress degradation with increased depth, ResSquVGG16 supplements its compressed architecture with shortcut (residual) connections. These are inserted after sequences of two or more consecutive Fire modules without intervening pooling. The residual mapping is:
If channel dimensions are mismatched, a projection via is performed:
with .
Concretely, four skip connections are instantiated:
| Skip | Source | Destination | Channels | Projection |
|---|---|---|---|---|
| 1 | Pool1 Output | Fire3 Out | 64 | No |
| 2 | Pool2 Output | Fire6 Out | 128 | No |
| 3 | Pool3 Output | Fire9 Out | 256 | No |
| 4 | Pool4 Output | Fire12 Out | 512 | No |
All are element-wise additions (Qassim et al., 2017).
4. Complete Layer-Wise Layout: ResSquVGG16
The composition of ResSquVGG16 is as follows:
- Conv1: , stride 2, $64$ filters → ReLU
- Fire1: , , → Scale → ReLU
- Pool1: , stride 2
- Fire2: , , → ReLU Fire3: , , → ReLU
- Pool2: , stride 2
- Fire4: , , → ReLU Fire5: , , → ReLU Fire6: , , → ReLU
- Pool3: , stride 2
- Fire7: , , → ReLU Fire8: , , → ReLU Fire9: , , → ReLU
- Pool4: , stride 2
- Fire10: , , → ReLU Fire11: , , → ReLU Fire12: , , → ReLU
- Pool5: , stride 2
- Conv6: , $4096$ units → ReLU Conv7: , $4096$ units → ReLU Conv8: , $365$ units → Softmax
Scale layers (Caffe-type BatchNorm replacements) and ReLU are applied after each Fire or conv module (Qassim et al., 2017).
5. Performance and Empirical Comparison
ResSquVGG16 was trained on MIT Places365-Standard (1.8M images, 365 classes) from scratch, using 4 GTX Titan X GPUs (Caffe + DIGITS), over 50 epochs. Metrics:
| Metric | VGG16 Fine-tuned | ResSquVGG16 |
|---|---|---|
| Training Time | 3d 16h | 2d 19h |
| Model Size | 10.6 GB | 1.23 GB |
| Top-1 Accuracy | 54.00% | 51.68% |
| Top-5 Accuracy | 84.30% | 82.04% |
ResSquVGG16 matches VGG16 within 2.3 pp in both Top-1 and Top-5 validation accuracy, while reducing training time by approximately and storage by ((Qassim et al., 2017), Table 2).
6. Implications and Contributions
The integration of Fire modules with residual connections in ResSquVGG16 demonstrates that compression strategies can replace large-weight networks such as VGG-16 while incurring minimal performance loss. This suggests further exploration in deep network design may emphasize parameter efficiency and residual learning, both in supervised training from scratch and transfer scenarios. The architectural modifications retain the internal macro-structure of VGG-16, preserving its depth, while exploiting sub-layer parameter sharing and shortcut propagation for practical gains in computational and memory efficiency (Qassim et al., 2017).