Improved Feature Distillation via Projector Ensemble

Published 27 Oct 2022 in cs.CV and cs.AI | (2210.15274v2)

Abstract: In knowledge distillation, previous feature distillation methods mainly focus on the design of loss functions and the selection of the distilled layers, while the effect of the feature projector between the student and the teacher remains under-explored. In this paper, we first discuss a plausible mechanism of the projector with empirical evidence and then propose a new feature distillation method based on a projector ensemble for further performance improvement. We observe that the student network benefits from a projector even if the feature dimensions of the student and the teacher are the same. Training a student backbone without a projector can be considered as a multi-task learning process, namely achieving discriminative feature extraction for classification and feature matching between the student and the teacher for distillation at the same time. We hypothesize and empirically verify that without a projector, the student network tends to overfit the teacher's feature distributions despite having different architecture and weights initialization. This leads to degradation on the quality of the student's deep features that are eventually used in classification. Adding a projector, on the other hand, disentangles the two learning tasks and helps the student network to focus better on the main feature extraction task while still being able to utilize teacher features as a guidance through the projector. Motivated by the positive effect of the projector in feature distillation, we propose an ensemble of projectors to further improve the quality of student features. Experimental results on different datasets with a series of teacher-student pairs illustrate the effectiveness of the proposed method.

Abstract PDF Upgrade to Chat

Citations (27)

View on Semantic Scholar

Summary

The paper introduces a novel projector ensemble technique to decouple feature alignment and classification, enhancing knowledge distillation.
It demonstrates up to 1% improvement in top-1 accuracy on CIFAR-100 and strong performance on ImageNet compared to conventional methods.
The ensemble of diverse projector initializations reduces overfitting and improves the student network's generalization in CNN compression.

Improved Feature Distillation via Projector Ensemble

Knowledge distillation has emerged as a pivotal methodology for compressing large CNN architectures by transferring knowledge from a high-capacity "teacher" model to a more efficient "student" network. The paper "Improved Feature Distillation via Projector Ensemble" (2210.15274) introduces a novel approach by exploiting the role of feature projectors in the distillation process.

Projector Mechanisms in Feature Distillation

The study identifies a critical gap in existing feature distillation techniques: the insufficient exploration of projectors, which are often necessary to map differently dimensioned feature spaces of student and teacher networks. The authors postulate that even when student and teacher share identical feature dimensions, the inclusion of a projector can ameliorate performance by mitigating overfitting and enhancing feature discrimination.

Multi-task Learning Perspective

The authors posit that without a projector, the student is engaged in multi-task learning, simultaneously attending to discriminative feature extraction for classification and feature alignment for distillation. This dual focus can lead to feature entanglement, resulting in suboptimal classification performance. Adding a projector introduces an intermediary layer that segregates these learning tasks, allowing the student to prioritize feature extraction while adhering to the guidance from the teacher's feature space.

Ensemble of Projectors

Emphasizing the positive role of projectors, the paper further investigates the potential of projector ensembles to boost performance. The ensemble strategy leverages diverse initializations which create variance in feature transformation, a principle grounded in ensemble learning theory, thus enhancing the student's generalization abilities.

Experimental Validation

The method exhibits consistent improvements across a plethora of dataset and model configurations, notably with CIFAR-100 and ImageNet datasets. Using various teacher-student pairs such as ResNet and VGG architectures, the paper demonstrates not only superior performance over existing distillation methods but also robustness and convergence efficiency.

Figure 1: Illustration of (a) feature distillation without a projector when the feature dimensions of the student and the teacher are the same, (b) the general feature-based distillation with a single projector.

Evaluation and Comparison

Extensive experiments delineated in the paper reveal that the proposed framework surpasses conventional distillation methods including KD, CRD, and SRRL, with performance increments of up to 1% in top-1 accuracy on CIFAR-100. Moreover, results on ImageNet indicate that employing a projector ensemble can bridge the gap when distilling knowledge from high-performance teachers, like DenseNet201, to compact students like MobileNet.

Figure 2: The left figure displays the direction alignment loss between teacher and student features with and without a projector. The right figure displays the average between-class cosine similarities in students' feature spaces.

Performance Metrics

The study underscores the efficacy of direction alignment loss, simplifying the distillation process by focusing solely on feature disparities. A comparative analysis reflecting top-1/top-5 accuracies robustly positions the projector ensemble method as a state-of-the-art solution.

Figure 3: Top-1 accuracy of different methods on ImageNet with different number of epochs and different teacher-student pairs.

Projector Diversity

An intriguing observation is the convergence of projector diversity during training. The results suggest scalable potential in crafting diverse projectors via varied initialization schemes, further refined by regularization techniques.

Conclusion

The investigation into projector roles in knowledge distillation reveals substantial opportunities to enhance model compression techniques. By advancing the discourse with projector ensembles, the method paves a feasible path forward for improved performance and efficiency in neural network training and deployment. While the study primarily concentrates on image classification, broadening the application to other domains such as object detection remains an exciting future avenue.