VGGNet Architecture and ResSquVGG16 Variant

Updated 25 November 2025

VGGNet architecture is a deep convolutional network with orderly 3×3 filters and max-pooling layers, designed for robust visual recognition.
The ResSquVGG16 variant replaces standard convolutions with SqueezeNet Fire modules and adds residual connections to reduce model size and training time.
Both architectures highlight trade-offs between depth and parameter efficiency, providing practical insights for improving large-scale scene recognition.

The VGGNet architecture is a deep convolutional neural network architecture originally developed for visual recognition tasks. VGG-16, a widely adopted instantiation, comprises 13 convolutional layers and 3 fully connected layers, employing a uniform structure of small 3×3 convolution filters and 2×2 max-pooling throughout. Recent advances have sought to address the architecture’s parameter redundancy and training efficiency, leading to variants such as Residual-Squeeze-VGG16 (ResSquVGG16), which integrates parameter-efficient Fire modules adapted from SqueezeNet and residual skip connections to mitigate degradation effects. The ResSquVGG16 model achieves VGG-comparable accuracy on large-scale scene recognition with substantial reductions in model size and training time (Qassim et al., 2017).

1. Architectural Specification of VGG-16

VGG-16, introduced by Simonyan and Zisserman, is defined by a deep design with a fixed scheme of stacking multiple convolutional layers using $3 \times 3$ filters—each stride 1, pad 1—and periodic max-pooling layers of $2 \times 2$ , stride 2. The sequence for an input $224 \times 224$ RGB image is as follows:

Block 1:
- Conv1_1: 64 filters, $3 \times 3$
- Conv1_2: 64 filters, $3 \times 3$
- MaxPool1: $2 \times 2$ , stride 2
Block 2:
- Conv2_1: 128 filters, $3 \times 3$
- Conv2_2: 128 filters, $3 \times 3$
- MaxPool2
Block 3:
- Conv3_1, Conv3_2, Conv3_3: 256 filters, $3 \times 3$
- MaxPool3
Block 4:
- Conv4_1, Conv4_2, Conv4_3: 512 filters, $3 \times 3$
- MaxPool4
Block 5:
- Conv5_1, Conv5_2, Conv5_3: 512 filters, $2 \times 2$ 0
- MaxPool5
Fully Connected:
- FC6: 4096 units
- FC7: 4096 units
- FC8: 1000 units (originally for ImageNet)

The model exhibits a uniform architectural principle, leading to a total parameter count approaching $2 \times 2$ 1 million ((Qassim et al., 2017), Table 1).

2. Integration of SqueezeNet Fire Modules

Parameter compression in ResSquVGG16 is achieved by replacing VGG-16’s convolutional (and fully-connected) blocks with SqueezeNet Fire modules. The original first convolution is retained but modified to $2 \times 2$ 2, stride 2, 64 filters. All subsequent conv layers (Conv1_2 through Conv5_3) and FC layers (FC6–FC8) are replaced by 12 Fire modules and 3 $2 \times 2$ 3 conv layers:

Fire Module (per Iandola et al. [10]):
- Squeeze: $2 \times 2$ 4 conv, $2 \times 2$ 5 filters
- Expand: Parallel $2 \times 2$ 6 conv ( $2 \times 2$ 7), $2 \times 2$ 8 conv ( $2 \times 2$ 9, pad 1).

The parameter economy is:

$224 \times 224$ 0

Compared to a single $224 \times 224$ 1 conv layer with $224 \times 224$ 2 parameters, the Fire module provides a substantial reduction, with the ratio $224 \times 224$ 3:

$224 \times 224$ 4

Pooling is inserted after each group of 1–3 Fire modules to mirror VGG’s max-pooling schedule. The three $224 \times 224$ 5 conv layers at the tail end substitute for the fully connected layers, tailored here for 365 scene classes (Qassim et al., 2017).

3. Residual (Shortcut) Connections

To suppress degradation with increased depth, ResSquVGG16 supplements its compressed architecture with shortcut (residual) connections. These are inserted after sequences of two or more consecutive Fire modules without intervening pooling. The residual mapping is:

$224 \times 224$ 6

If channel dimensions are mismatched, a projection via $224 \times 224$ 7 is performed:

$224 \times 224$ 8

with $224 \times 224$ 9.

Concretely, four skip connections are instantiated:

Skip	Source	Destination	Channels	Projection
1	Pool1 Output	Fire3 Out	64	No
2	Pool2 Output	Fire6 Out	128	No
3	Pool3 Output	Fire9 Out	256	No
4	Pool4 Output	Fire12 Out	512	No

All are element-wise additions (Qassim et al., 2017).

4. Complete Layer-Wise Layout: ResSquVGG16

The composition of ResSquVGG16 is as follows:

Conv1: $3 \times 3$ 0, stride 2, $3 \times 3$ 1 filters → ReLU
Fire1: $3 \times 3$ 2, $3 \times 3$ 3, $3 \times 3$ 4 → Scale → ReLU
Pool1: $3 \times 3$ 5, stride 2
Fire2: $3 \times 3$ 6, $3 \times 3$ 7, $3 \times 3$ 8 → ReLU Fire3: $3 \times 3$ 9, $3 \times 3$ 0, $3 \times 3$ 1 → ReLU
Pool2: $3 \times 3$ 2, stride 2
Fire4: $3 \times 3$ 3, $3 \times 3$ 4, $3 \times 3$ 5 → ReLU Fire5: $3 \times 3$ 6, $3 \times 3$ 7, $3 \times 3$ 8 → ReLU Fire6: $3 \times 3$ 9, $2 \times 2$ 0, $2 \times 2$ 1 → ReLU
Pool3: $2 \times 2$ 2, stride 2
Fire7: $2 \times 2$ 3, $2 \times 2$ 4, $2 \times 2$ 5 → ReLU Fire8: $2 \times 2$ 6, $2 \times 2$ 7, $2 \times 2$ 8 → ReLU Fire9: $2 \times 2$ 9, $3 \times 3$ 0, $3 \times 3$ 1 → ReLU
Pool4: $3 \times 3$ 2, stride 2
Fire10: $3 \times 3$ 3, $3 \times 3$ 4, $3 \times 3$ 5 → ReLU Fire11: $3 \times 3$ 6, $3 \times 3$ 7, $3 \times 3$ 8 → ReLU Fire12: $3 \times 3$ 9, $3 \times 3$ 0, $3 \times 3$ 1 → ReLU
Pool5: $3 \times 3$ 2, stride 2
Conv6: $3 \times 3$ 3, $3 \times 3$ 4 units → ReLU Conv7: $3 \times 3$ 5, $3 \times 3$ 6 units → ReLU Conv8: $3 \times 3$ 7, $3 \times 3$ 8 units → Softmax

Scale layers (Caffe-type BatchNorm replacements) and ReLU are applied after each Fire or conv module (Qassim et al., 2017).

5. Performance and Empirical Comparison

ResSquVGG16 was trained on MIT Places365-Standard (1.8M images, 365 classes) from scratch, using 4 GTX Titan X GPUs (Caffe + DIGITS), over 50 epochs. Metrics:

Metric	VGG16 Fine-tuned	ResSquVGG16
Training Time	3d 16h	2d 19h
Model Size	10.6 GB	1.23 GB
Top-1 Accuracy	54.00%	51.68%
Top-5 Accuracy	84.30%	82.04%

ResSquVGG16 matches VGG16 within $3 \times 3$ 92.3 pp in both Top-1 and Top-5 validation accuracy, while reducing training time by approximately $3 \times 3$ 0 and storage by $3 \times 3$ 1 ((Qassim et al., 2017), Table 2).

6. Implications and Contributions

The integration of Fire modules with residual connections in ResSquVGG16 demonstrates that compression strategies can replace large-weight networks such as VGG-16 while incurring minimal performance loss. This suggests further exploration in deep network design may emphasize parameter efficiency and residual learning, both in supervised training from scratch and transfer scenarios. The architectural modifications retain the internal macro-structure of VGG-16, preserving its depth, while exploiting sub-layer parameter sharing and shortcut propagation for practical gains in computational and memory efficiency (Qassim et al., 2017).

Markdown Report Issue Upgrade to Chat

References (1)

Residual Squeeze VGG16 (2017)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to VGGNet Architecture.

VGGNet Architecture and ResSquVGG16 Variant

1. Architectural Specification of VGG-16

2. Integration of SqueezeNet Fire Modules

3. Residual (Shortcut) Connections

4. Complete Layer-Wise Layout: ResSquVGG16

5. Performance and Empirical Comparison

6. Implications and Contributions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

VGGNet Architecture and ResSquVGG16 Variant

1. Architectural Specification of VGG-16

2. Integration of SqueezeNet Fire Modules

3. Residual (Shortcut) Connections

4. Complete Layer-Wise Layout: ResSquVGG16

5. Performance and Empirical Comparison

6. Implications and Contributions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research