VGG-16 Net: Deep Convolutional Architecture

Updated 20 January 2026

VGG-16 is a deep convolutional neural network featuring 13 convolutional and 3 fully connected layers, designed for hierarchical feature extraction in image recognition.
Its architecture relies on homogeneous 3x3 filters and max-pooling to systematically expand the receptive field while preserving spatial resolution.
Adaptations like converting fully connected layers to convolutional ones and implementing local attention regularization enhance its performance for segmentation and scene recognition tasks.

VGG-16 Net is a deep convolutional neural network architecture originally proposed for large-scale image recognition tasks and subsequently adapted for scene classification and specialized segmentation applications. Characterized by its homogeneous stacking of small ( $3 \times 3$ ) convolutional filters, consistent use of max-pooling for spatial downsampling, and a deeply-layered structure, VGG-16 demonstrates high capacity for hierarchical feature extraction. Its baseline configuration includes thirteen convolutional layers followed by three fully connected layers, totaling approximately 138 million parameters. VGG-16 has served as a backbone for numerous image recognition pipelines and has been modified for tasks requiring fine-grained spatial inference, such as lesion boundary segmentation and domain-specific scene recognition (Wang et al., 2015, Wen et al., 2018).

1. Layerwise Architecture and Parameterization

The canonical VGG-16 framework comprises a fixed, blockwise sequence of convolutional and max-pooling operations:

Input: $224 \times 224$ RGB image (after cropping and mean subtraction).
Block 1: Two $3 \times 3$ convolutions with 64 filters each, stride 1, padding 1, followed by $2 \times 2$ max-pooling (output: $112 \times 112$ ).
Block 2: Two $3 \times 3$ convolutions, 128 filters each, followed by pooling ( $56 \times 56$ ).
Block 3: Three $3 \times 3$ convolutions, 256 filters each, followed by pooling ( $28 \times 28$ ).
Block 4: Three $3 \times 3$ convolutions, 512 filters each, followed by pooling ( $14 \times 14$ ).
Block 5: Three $3 \times 3$ convolutions, 512 filters each, followed by pooling ( $7 \times 7$ ).
Classifier Head: Three fully connected layers—fc6 (4096), fc7 (4096), and fc8 (label count, e.g., 1000 for ImageNet, 205 for Places205)—each preceded by ReLU; dropout (0.5) applied to fc6 and fc7.

Total parameter count is approximately 138 million, dominated by the fully connected layers. All convolutions employ stride-one and padding-one to maintain spatial dimensionality, and ReLU activations are used throughout (Wang et al., 2015).

2. Convolutional Formulation and Receptive Field Analysis

The standard VGG convolutional operation for feature map $y_{i,j}^{k}$ is:

$y_{i,j}^{k} = b^{k} + \sum_{c=1}^{C}\sum_{u=-1}^{1}\sum_{v=-1}^{1} w_{u,v}^{c,k}\,x_{i+u,j+v}^{c}$

Where $w_{u,v}^{c,k}$ denotes the kernel weights, $b^{k}$ the bias for channel $k$ , and $x_{i+u, j+v}^c$ the input activation patch. Stacking multiple $3 \times 3$ convolutions facilitates gradual growth of the receptive field while preserving spatial resolution before pooling. Following five blocks, each output feature unit corresponds to a contextual region spanning the entire input ( $224 \times 224$ ), ensuring that deep networks encode global context critical for complex visual tasks (Wang et al., 2015).

3. Training Regimen and Data Augmentation Protocols

VGG-16 models are trained with mini-batch stochastic gradient descent (SGD) and momentum ($0.9$), weight decay ($0.0005$), and dropout ($0.5$) on fully connected layers fc6 and fc7. Input images are resized to $256 \times 256$ , then randomly cropped to various square sizes ($256$, $224$, $198$, $168$), subsequently rescaled to $224 \times 224$ as input. Corner cropping (center and four corners plus their horizontal flips, totaling 10 views) is employed for test-time augmentation. Training exploits multi-GPU distribution (batch size 256, 64 per GPU), and learning rates are initialized at $0.01$ and reduced by $0.1$ every $10,\!000$ iterations, converging in approximately two weeks for large datasets such as Places205 (Wang et al., 2015).

For segmentation applications, such as lesion boundary detection, further augmentations include horizontal/vertical flips, random brightness and contrast adjustment, additive Gaussian noise, and resizing/cropping. Input normalization consists of per-pixel RGB mean subtraction (ImageNet mean) (Wen et al., 2018).

4. Architectural Modifications for Segmentation

For dense prediction tasks, VGG-16 undergoes structural adaptation:

The three fully connected layers are removed, converting the network to a fully convolutional pipeline.
After pool5, additional $3 \times 3$ convolutional layers are appended (number and channel count not always explicitly specified), each with ReLU activation.
The terminal convolution produces a heatmap with two output channels (e.g., lesion/background in medical segmentation).
A bilinear upsampling operation restores the resolution to match the input grid (e.g., $320 \times 320$ ).
If $F_5$ is the pool5 output, the segmentation head is abstracted as $F_6 = \mathrm{ReLU}(\mathrm{Conv}_{3 \times 3}(F_5)), \ldots, H = \mathrm{Conv}_{3 \times 3}(F_{last}) \in \mathbb{R}^{2 \times h' \times w'}, \hat{S} = \mathrm{Upsample}_{\text{bilinear}}(H)$ .

Pretraining is typically performed on ImageNet, with domain-specific fine-tuning for segmentation heads and box-based regularization modules initialized at random (Wen et al., 2018).

5. Local Attention Regularization via Box Sampling

For enhanced robustness in segmentation, local attention is enforced through box-based regularization:

For each training image, $K=12$ axis-aligned boxes are sampled at random scales/locations.
For each box, compute the proportion $\tau$ $τ$ of ground-truth lesion pixels.
- Boxes with $\tau \geq 0.7$ are positive; $\tau \leq 0.1$ are background; others are discarded.
For each retained box, crop the corresponding feature map region, apply a small convolutional head ( $g(\cdot)$ ), which outputs a two-channel score map (segmentation probabilities), and upsample to original grid resolution.
Define local cross-entropy losses for positive/background boxes:
- $L_{box}^{+}(b^+) = - \sum_{i \in b^+} \log S_{1,i}$
- $L_{box}^{-}(b^-) = - \sum_{i \in b^-} \log S_{0,i}$
Global pixelwise cross-entropy loss on the image is:
- $L_{global} = - \sum_{i} [Y_i \log S_{1,i} + (1-Y_i) \log S_{0,i}]$
The final loss is a weighted sum:
- $L_{total} = L_{global} + \lambda L_{boxes}$ , with $\lambda \ll 1$ (typical value: $0.1$).

This regularization increases training data diversity and model robustness, focusing learning on strongly discriminative regions (Wen et al., 2018).

6. Empirical Performance and Domain Transfer

When retrained for scene recognition on the Places205 dataset ($2.5M$ images), VGG-16 achieves top-1 accuracy of $60.3\%$ and top-5 accuracy of $88.8\%$ . When transferred to MIT67 and SUN397 scene benchmarks using L2-normalized fc6 features and linear SVM:

MIT67: $81.2\%$ accuracy
SUN397: $66.9\%$ accuracy

These results substantially outperform ImageNet-only VGG-16 models, confirming that scene-centric pretraining on Places205 adapts early and deep features for layout and texture cues critical in scene analysis. Gains over baseline networks exceed $10\%$ on MIT67 and $15\%$ on SUN397. On segmentation tasks (e.g., ISIC 2018 Task 1), quantitative metrics for the VGG-16 branch alone are not reported, but qualitative overlays and model robustness are documented, and joint models integrating VGG-16 with complementary branches yield stronger results (Wang et al., 2015, Wen et al., 2018).

7. Significance and Variants

VGG-16’s deep, homogeneous architecture and large parameter count facilitate high-capacity feature representations. Scene-adapted training and fully convolutional conversion extend its applicability beyond object classification. The introduction of local attention regularization (box-based sampling) further evidences its adaptability for fine-grained spatial inference tasks. A plausible implication is that, for novel domains (medical, remote sensing, dense prediction), removal of the classifier head and augmentation with segmentation-specific regularization is essential for competitive performance. Results show that architectural depth yields only marginal improvement without extensive domain adaptation and augmentation (Wang et al., 2015, Wen et al., 2018).

Markdown Report Issue Upgrade to Chat

References (2)

Places205-VGGNet Models for Scene Recognition (2015)

ISIC 2018-A Method for Lesion Segmentation (2018)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to VGG-16 Net.