Diffused Redundancy in Pre-trained Representations

Published 31 May 2023 in cs.LG and cs.AI | (2306.00183v3)

Abstract: Representations learned by pre-training a neural network on a large dataset are increasingly used successfully to perform a variety of downstream tasks. In this work, we take a closer look at how features are encoded in such pre-trained representations. We find that learned representations in a given layer exhibit a degree of diffuse redundancy, ie, any randomly chosen subset of neurons in the layer that is larger than a threshold size shares a large degree of similarity with the full layer and is able to perform similarly as the whole layer on a variety of downstream tasks. For example, a linear probe trained on $20\%$ of randomly picked neurons from the penultimate layer of a ResNet50 pre-trained on ImageNet1k achieves an accuracy within $5\%$ of a linear probe trained on the full layer of neurons for downstream CIFAR10 classification. We conduct experiments on different neural architectures (including CNNs and Transformers) pre-trained on both ImageNet1k and ImageNet21k and evaluate a variety of downstream tasks taken from the VTAB benchmark. We find that the loss and dataset used during pre-training largely govern the degree of diffuse redundancy and the "critical mass" of neurons needed often depends on the downstream task, suggesting that there is a task-inherent redundancy-performance Pareto frontier. Our findings shed light on the nature of representations learned by pre-trained deep neural networks and suggest that entire layers might not be necessary to perform many downstream tasks. We investigate the potential for exploiting this redundancy to achieve efficient generalization for downstream tasks and also draw caution to certain possible unintended consequences. Our code is available at \url{https://github.com/nvedant07/diffused-redundancy}.

Abstract PDF Upgrade to Chat

Citations (2)

View on Semantic Scholar

Summary

The paper demonstrates that random subsets of neurons can retain nearly full-layer performance, with 20% of neurons achieving within 5% accuracy on tasks like CIFAR10.
It analyzes the role of architecture, dataset size, and adversarial training in enhancing redundancy, highlighting that wider layers and larger datasets increase this effect.
The study underscores the potential for efficient transfer learning and computational savings, while also addressing fairness concerns in task performance.

Diffused Redundancy in Pre-trained Representations

Introduction

The paper "Diffused Redundancy in Pre-trained Representations" (2306.00183) investigates the phenomenon of how features are encoded in pre-trained neural network representations. These networks, particularly CNNs and Transformers trained on datasets like ImageNet, exhibit a peculiar property termed as diffuse redundancy. The essential idea is that random subsets of neurons from a layer can perform nearly as well as the entire layer on various downstream tasks, suggesting a redundancy inherent in the trained representations.

Methodology and Results

Diffused Redundancy Hypothesis

The authors propose the hypothesis that within a layer of a pre-trained neural network, there exists a substantial overlap of information among neurons such that many random subsets can suffice for downstream task performance:

Example Demonstration: With 20% of neurons randomly selected from the penultimate layer of a ResNet50 pre-trained on ImageNet1K, classification accuracy on CIFAR10 remains within 5% of using the full layer.
Empirical Testing: Various architectures and pre-training setups (including adversarial training) were evaluated against standard datasets such as CIFAR10, CIFAR100, Oxford-IIIT-Pets, and Flowers.

Figure 1: Downstream accuracy of random neuron subsets compared to full layer usages.

Factors Influencing Redundancy

Several factors were analyzed for their influence on the degree of diffused redundancy:

Architecture

Among different architectures tested (ResNet variants, VGG, ViT), each displayed varying degrees of diffuse redundancy. Notably, architectures with wider layers exhibited more redundancy.

Pre-training Dataset and Loss Functions

The redundancy was more apparent when networks were trained on larger datasets like ImageNet-21k compared to ImageNet-1k. Moreover, adversarially trained models demonstrated a significant increase in redundant features compared to standard trained models.

Downstream Tasks

The task-specific nature of redundancy indicates a Pareto frontier—the minimal subset of neurons necessary matches task performance closely. Notably, the redundancy's impact varied depending on the complexity of the downstream task.

Figure 2: CKA measurements showing similarity between random neuron subsets and their whole.

Theoretical Insights and Comparisons

Comparison with PCA

The paper draws insight by comparing random neuron selections with projections onto a reduced number of principal components (PCA). It was observed that the performance of subsets of neurons quickly aligns with that of top PCA components, reinforcing the hypothesis that redundancy captures essential feature variations.

Structured vs. Natural Redundancy

The structured redundancy intentionally induced by techniques like Matryoshka Representation Learning (MRL) contrasts with naturally occurring diffuse redundancy. While MRL optimizations are explicit, the natural redundancy phenomena as observed occur without purposeful restructuring, still maintaining high task efficacy.

Figure 3: Effects of dropout regularization level on redundancy phenomena.

Implications and Future Directions

Efficient Transfer Learning

The identified redundancy suggests potential for more computationally efficient transfer learning applications, wherein substantial portions of model capacity can be preserved without significant accuracy losses. This could lead to reduced computation and faster adaptation in resource-constrained environments.

Fairness Considerations

Exploiting this redundancy might result in uneven performance drops across different task classes. Potential fairness implications arise, as certain classes might disproportionately suffer when neuron subsets are utilized, hence necessitating careful consideration in practical applications of such optimization techniques.

Conclusion

This study uncovers a critical aspect of pre-trained neural networks, highlighting diffused redundancy as a distinct characteristic of wide-layered models. While promising in terms of improved computational efficiency, careful implementation must consider potential variations in fairness across task categories. Future research might focus on better understanding the optimal bounds and methods to harness redundancy without unintended side effects.