Papers
Topics
Authors
Recent
Search
2000 character limit reached

Stylized-ImageNet: Reducing Texture Bias

Updated 11 December 2025
  • The paper introduces Stylized-ImageNet (SIN), a large-scale dataset that replaces natural textures with artistic styles to force CNNs to rely on global shape cues.
  • SIN employs AdaIN-based style transfer to modify each image, boosting ResNet-50's shape bias from 22% to 81% and improving resistance to domain shifts.
  • While SIN enhances robustness and transferability, it also raises concerns about reduced image fidelity and limited style diversity compared to natural images.

Stylized-ImageNet (SIN) is a large-scale, stylized variant of the canonical ImageNet dataset constructed with the explicit goal of reducing the texture bias prevalent in convolutional neural networks (CNNs) trained on natural images. Developed originally in the context of probing and remediating the tendency of standard architectures such as ResNet-50 to recognize objects based primarily on local texture rather than global shape, SIN replaces the texture of every ImageNet image with the style statistics of an artistic painting but preserves the global shape content. This construction compels CNNs to rely on shape cues for classification, promoting human-like invariances and yielding improved robustness to domain shift and image corruptions (Geirhos et al., 2018, Chun et al., 2021). SIN has catalyzed a new line of research into architectural and data-driven de-biasing strategies.

1. The Texture Bias Problem in ImageNet-Trained CNNs

ImageNet-trained CNNs, despite their high performance, are empirically "texture-biased": they learn to recognize fine-grained local image statistics over more global, shape-related descriptors. This contrasts sharply with human vision, which exhibits a shape bias in object categorization. Geirhos et al. quantitatively demonstrated this phenomenon by presenting both humans and CNNs with cue-conflict images in which shape and texture imply mutually exclusive categories. Standard ResNet-50 models trained on ImageNet exhibit a mere 22% shape bias versus 77.9% texture bias, whereas humans attain approximately 95.9% shape bias. This “shortcut learning” by neural networks undermines robustness, transfer performance, and domain adaptation capabilities.

2. Construction of Stylized-ImageNet

The central insight underpinning SIN is that removing reliable texture cues during supervised training will force modern CNN architectures to rely on global shape information. SIN is generated by applying AdaIN (Adaptive Instance Normalization)—a fast style-transfer algorithm—to every image in the ImageNet dataset. The stylization process proceeds as follows:

  • For each ImageNet content image IcI_c, a random painting IsI_s from the "Painter by Numbers" Kaggle dataset (≈79,434 images) is sampled as the style reference.
  • Both images are encoded using a fixed VGG-19 encoder EE to obtain feature maps.
  • Channel-wise means and variances are aligned via the AdaIN operator:

Tl=AdaIN(Fcl,Fsl)=σ(Fsl)Fclμ(Fcl)σ(Fcl)+μ(Fsl),T^l = \mathrm{AdaIN}(F_c^l, F_s^l) = \sigma(F_s^l) \frac{F_c^l - \mu(F_c^l)}{\sigma(F_c^l)} + \mu(F_s^l),

where FclF_c^l and FslF_s^l denote the layer-ll features of content and style images.

  • The result is decoded by a learned decoder DD, trained to reconstruct images that preserve feature-space content but adopt the style statistics of the painting.
  • A mixing coefficient α=1.0\alpha=1.0 is used in blending, yielding full adoption of the painting’s statistics.

Every image in the SIN dataset is stylized exactly once, leading to ≈1.28 million stylized training images and 50,000 validation images, distributed over the standard ImageNet-1k classes (Geirhos et al., 2018).

3. Training Protocols and Evaluation

The canonical protocol for SIN involves training a standard ImageNet architecture (e.g., ResNet-50) from an ImageNet-pretrained checkpoint using the SIN dataset. Key hyperparameters are as follows:

  • Optimizer: SGD with momentum 0.9, weight decay 1×1041 \times 10^{-4}.
  • Initial learning rate: 0.1, decaying at epochs 20, 40 (for 60 total epochs).
  • Batch size: 256.
  • "SIN + IN" protocol: Combine SIN and original ImageNet data (~2.56M images), train for 45 epochs, then optionally fine-tune on IN.

Performance is evaluated on both stylized and original images. The principal metric for bias is the shape-bias score:

ShapeBias=NshapeNshape+Ntexture,\mathrm{ShapeBias} = \frac{N_\text{shape}}{N_\text{shape} + N_\text{texture}},

where NshapeN_\text{shape} counts samples classified according to shape and NtextureN_\text{texture} according to texture, when shape–texture cues are in conflict (Geirhos et al., 2018).

4. Empirical Findings: Impact of SIN on CNN Representations

Training on SIN substantially alters the representational bias of CNNs:

  • Shape bias of ResNet-50 after SIN training increases from 22% to 81%, approaching human-level values.
  • Classic ResNet-50 trained solely on ImageNet achieves 76.13% top-1 and 92.86% top-5 accuracy; retraining solely on SIN yields lower accuracy (60.18% / 82.62%), reflecting the added difficulty of the task. However, “SIN + IN” and fine-tuned “Shape-ResNet” recover and slightly surpass vanilla performance (top-1 76.72%).
  • Transfer benefits include Pascal VOC 2007 object detection mAP@50 of 75.1% (Shape-ResNet) vs. 70.7% (vanilla), and MS COCO mAP@50 of 55.2% vs. 52.3%.
  • Significant robustness gains are documented on corrupted, occluded, and adversarial samples. For instance, the SIN-trained ResNet-50 maintains higher accuracy under additive uniform noise and other image distortions, and mean Corruption Error on ImageNet-C improves from 76.7 to 69.3.

5. Limitations: Fidelity and Diversity Constraints

Despite the clear shift in representational geometry and robustness, SIN construction exhibits two key deficiencies (Chun et al., 2021):

  • Low Fidelity: Artistic paintings typically lie outside the natural image manifold. The stylized images generated via AdaIN may lose fine shape details and introduce artifacts characteristic of paintings (e.g., brushstrokes), which can degrade in-distribution classification accuracy (ResNet-50 clean top-1 drops from 76.1% to 60.2% when trained only on SIN).
  • Limited Diversity: Each SIN content image is paired with a single, randomly selected painting. As a result, across all training epochs, a given image is never exposed to the full distribution of possible style variations, and the “style distribution” per content is narrowly defined. This may limit the efficacy of the intended de-biasing on complex data distributions.

6. Benchmarking SIN Against Contemporary Debiasing Methods

The practical implications of SIN are best understood in comparative benchmarks. On ImageNet-9, the following top-1 accuracy scores are observed (Chun et al., 2021):

Method Clean Unbiased Corruption Adversarial Occlusion
ResNet-18 (vanilla) 90.8 88.8 54.2 24.9 71.3
Stylized-ImageNet (SIN) 88.4 86.6 61.1 24.6 64.4

Training on SIN robustifies models against synthetic corruptions (54.2→61.1), but this comes with a drop in performance on clean and occluded samples (90.8→88.4, 71.3→64.4 respectively).

StyleAugment, a subsequent method that dynamically resamples style from within the mini-batch using only natural images, achieves higher clean accuracy, unbiased score, and greater robustness than static SIN stylizations. StyleAugment’s strategy addresses both fidelity (by eschewing paintings in favor of natural images for style reference) and diversity (by generating a new stylization per image per epoch). A plausible implication is that fully on-the-fly and in-domain style distribution is critical for practical de-biasing (Chun et al., 2021).

7. Broader Implications and Future Directions

Stylized-ImageNet has established itself as a canonical benchmark for studying and mitigating texture bias in CNNs. The shift towards shape-based representations induced by SIN improves generalization to unseen texts, adversarial robustness, and downstream detection transfer, aligning these networks more closely with human perceptual strategies.

However, the limitations in image realism and texture–shape diversity motivate further research. Approaches such as StyleAugment, which leverage on-the-fly style transfers drawn from the natural image manifold, demonstrate superior performance and practicality. The SIN paradigm thus frames a continuing research program: understanding the interaction of data augmentations, inductive biases, and adversarial robustness in deep vision models.

The broader implication is that data-driven manipulations of input statistics—especially those that disrupt “shortcut” features—can serve as a simple, architecture-agnostic means for aligning neural networks with desired invariances, domain robustness, and cognitive features observed in biological vision (Geirhos et al., 2018, Chun et al., 2021).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Stylized-ImageNet (SIN).