Papers
Topics
Authors
Recent
Search
2000 character limit reached

A Simple and Generic Framework for Feature Distillation via Channel-wise Transformation

Published 23 Mar 2023 in cs.CV | (2303.13212v2)

Abstract: Knowledge distillation is a popular technique for transferring the knowledge from a large teacher model to a smaller student model by mimicking. However, distillation by directly aligning the feature maps between teacher and student may enforce overly strict constraints on the student thus degrade the performance of the student model. To alleviate the above feature misalignment issue, existing works mainly focus on spatially aligning the feature maps of the teacher and the student, with pixel-wise transformation. In this paper, we newly find that aligning the feature maps between teacher and student along the channel-wise dimension is also effective for addressing the feature misalignment issue. Specifically, we propose a learnable nonlinear channel-wise transformation to align the features of the student and the teacher model. Based on it, we further propose a simple and generic framework for feature distillation, with only one hyper-parameter to balance the distillation loss and the task specific loss. Extensive experimental results show that our method achieves significant performance improvements in various computer vision tasks including image classification (+3.28% top-1 accuracy for MobileNetV1 on ImageNet-1K), object detection (+3.9% bbox mAP for ResNet50-based Faster-RCNN on MS COCO), instance segmentation (+2.8% Mask mAP for ResNet50-based Mask-RCNN), and semantic segmentation (+4.66% mIoU for ResNet18-based PSPNet in semantic segmentation on Cityscapes), which demonstrates the effectiveness and the versatility of the proposed method. The code will be made publicly available.

Citations (3)

Summary

  • The paper introduces a novel, MLP-based channel-wise transformation that aligns teacher and student features for more effective knowledge distillation.
  • It integrates the transformation with an L2 loss, achieving over 3% Top-1 accuracy improvements in image classification and notable gains in object detection and segmentation.
  • The approach simplifies the distillation process by reducing reliance on complex spatial transformations while requiring minimal hyper-parameter tuning.

A Simple and Generic Framework for Feature Distillation via Channel-wise Transformation

Introduction

Knowledge distillation is a methodology that transfers knowledge from a larger model (teacher) to a smaller one (student). The conventional approach focuses on aligning feature maps spatially, which can produce overly strict constraints on the student. This paper introduces channel-wise transformations to align features along the channel dimension, thereby addressing misalignments effectively and proposing a more flexible distillation framework.

Methodology

Channel-wise Transformation

The framework leverages a learnable nonlinear channel-wise transformation to align the features between teacher and student. This transformation utilizes an MLP with one hidden layer, facilitated by 1×11 \times 1 convolutions to adjust channel-wise discrepancies. The transformation uses the following function:

MLP(F)=W2(σ(W1(F)))\operatorname{MLP}(F) = W_{2}\left(\sigma\left(W_{1}(F)\right)\right)

where W1W_1 and W2W_2 are the parameters defined as 1×11 \times 1 convolution layers, and σ\sigma denotes the ReLU activation function. The transformation only applies to the student features, enabling the adaptation and retention of significant knowledge. Figure 1

Figure 1

Figure 1: Non-Local Block used in FKD~\cite{zhang2020fkd} compared with our channel-wise transformation.

Integration into Distillation Framework

The proposed method consists of integrating this transformation into a generic feature distillation framework. The framework uses a simple L2L_2 distance to calculate the loss for transformed student features against teacher features:

Lfeat=iN(MLP(Fs)Ft)2L_{feat} = \sum_{i}^{N} (MLP(\boldsymbol{F}_s ) - \boldsymbol{F}_t)^2

The overall loss combines task-specific losses with the distillation loss, adjusted by a hyper-parameter α\alpha, facilitating easy adaptation across various tasks:

Ltotal=Ltask+αLfeatL_{total} = L_{task} + \alpha L_{feat}

Experiments and Results

Image Classification

The method achieves notable improvements in image classification tasks, as evidenced by increases in Top-1 accuracy for models like ResNet34 distilled into ResNet18. Significant gains of over 3% Top-1 accuracy were observed, outperforming prior logits-based and feature-based distillation methods.

Object Detection and Instance Segmentation

In object detection, the method demonstrates average improvements of +3.5% in mAP across various detectors, including two-stage and single-stage architectures. Specifically, for instance segmentation, the bounding box AP saw a rise of +3.2% using the proposed framework. Figure 2

Figure 2

Figure 2

Figure 2: Our proposed method uses a learnable channel-wise transformation only for the student model, avoiding complex spatial transformations.

Semantic Segmentation

Extensive testing on the CityScapes dataset shows improvements in mIoU by +4% on the homogeneous setting and significant gains with heterogeneous sets. These results underscore the flexibility and broad applicability of the channel-wise transformation in various dense prediction tasks.

Ablation Studies

Through empirical analysis, the study highlights the significance of channel-wise transformations. Specifically, replacing sophisticated spatial transformations with a simple MLP produces better results with lower L2L_2 distances in training while maintaining higher task-specific outputs.

Conclusion

This study underscores the feasibility of channel-wise feature alignment as an effective strategy in knowledge distillation frameworks. The proposed MLP-based transformation simplifies the distillation process and requires minimal hyper-parameter tuning, making it adaptable for various computer vision tasks. Future work may explore further optimization in balancing distillation and task-specific losses, as well as extending the framework's application to other modalities beyond vision.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Authors (3)

Collections

Sign up for free to add this paper to one or more collections.