Function-Consistent Feature Distillation

Published 24 Apr 2023 in cs.CV | (2304.11832v1)

Abstract: Feature distillation makes the student mimic the intermediate features of the teacher. Nearly all existing feature-distillation methods use L2 distance or its slight variants as the distance metric between teacher and student features. However, while L2 distance is isotropic w.r.t. all dimensions, the neural network's operation on different dimensions is usually anisotropic, i.e., perturbations with the same 2-norm but in different dimensions of intermediate features lead to changes in the final output with largely different magnitude. Considering this, we argue that the similarity between teacher and student features should not be measured merely based on their appearance (i.e., L2 distance), but should, more importantly, be measured by their difference in function, namely how later layers of the network will read, decode, and process them. Therefore, we propose Function-Consistent Feature Distillation (FCFD), which explicitly optimizes the functional similarity between teacher and student features. The core idea of FCFD is to make teacher and student features not only numerically similar, but more importantly produce similar outputs when fed to the later part of the same network. With FCFD, the student mimics the teacher more faithfully and learns more from the teacher. Extensive experiments on image classification and object detection demonstrate the superiority of FCFD to existing methods. Furthermore, we can combine FCFD with many existing methods to obtain even higher accuracy. Our codes are available at https://github.com/LiuDongyang6/FCFD.

Abstract PDF Upgrade to Chat

Citations (14)

View on Semantic Scholar

Summary

The paper introduces a novel function-based loss that goes beyond L2 appearance metrics to align teacher and student feature roles for improved distillation.
It integrates both appearance and function perspectives using specialized loss components and strategic path sampling to optimize model performance.
Experimental results across CIFAR-100, ImageNet, and MS-COCO demonstrate consistent performance gains over existing knowledge distillation techniques.

Function-Consistent Feature Distillation

In this essay, I present a detailed overview and analysis of the paper titled "Function-Consistent Feature Distillation" (2304.11832). This work introduces a novel approach to feature distillation in neural networks, focusing on optimizing the functional similarity between teacher and student features instead of relying solely on appearance-based metrics like L2 distance.

Introduction

The deployment of deep neural networks (DNNs) on edge devices is often constrained by their large storage and computational requirements. Knowledge Distillation (KD) is a popular technique to transfer the knowledge from a large, well-trained teacher model to a smaller student model, with the aim of enhancing the student's performance. Traditional feature-distillation methods primarily use L2 distance to measure similarity between teacher and student features. However, this isotropic approach fails to recognize that neural operations often have anisotropic effects across different dimensions of the feature space.

Motivation

The paper argues for a paradigm shift from appearance-based metrics to a function-based framework for measuring feature similarity. The proposed Function-Consistent Feature Distillation (FCFD) aims to ensure that teacher and student features not only appear similar but also serve similar roles in subsequent network layers, thereby producing similar outputs. The authors provide compelling theoretical and empirical evidence that function-based feature similarity provides a more meaningful and task-relevant basis for distillation.

Methodology

Function-Consistent Feature Distillation (FCFD)

Figure 1: An overview of FCFD. Top: illustration of the traditional KD loss ( $\mathcal{L}_{kd}$ ) and appearance-based feature matching loss ( $\mathcal{L}_{app}$ ). Bottom: illustration of our proposed function matching losses $\mathcal{L}_{func}$ and $\mathcal{L}_{func'}$ .

FCFD introduces a novel loss function designed to optimize functional consistency:

Appearance Perspective ( $\mathcal{L}_{app}$ ): Maintains the traditional L2 distance to ensure numerical similarity.
Function Perspective:
- $\mathcal{L}_{func}$ : Utilizes the teacher's later layers to measure functional similarity.
- $\mathcal{L}_{func'}$ : Employs the student's later layers to achieve functional alignment from the student's viewpoint.

The complete FCFD loss is a weighted sum of these components, achieved through strategic path sampling during training to manage computational costs effectively.

Experiments

The authors conduct comprehensive experiments across image classification (CIFAR-100, ImageNet) and object detection (MS-COCO) tasks, demonstrating the superiority of FCFD over existing approaches. The results consistently show that FCFD outperforms state-of-the-art KD methods, particularly when teacher and student models have different architectures.

Performance Highlights:

CIFAR-100: FCFD improves student performance by an average of 6.78% over non-distilled baselines for teacher-student pairs with different architectures.
ImageNet: Achieves significant gains, with a notable 0.7% improvement in top-1 accuracy over the best competing method for the ResNet50-MobileNet pair.
MS-COCO: Outperforms existing KD techniques, with average improvements exceeding 4% in mean Average Precision (mAP).

Ablation Studies

The paper provides detailed ablation studies validating the synergistic integration of appearance and function perspectives in FCFD. Notably, both $\mathcal{L}_{func}$ and $\mathcal{L}_{func'}$ independently contribute to performance gains, highlighting their complementary roles.

Conclusions and Implications

The FCFD approach underscores the importance of function-consistent feature matching in knowledge distillation. By aligning the functional roles of features in addition to their appearance, FCFD offers a more robust framework for enhancing model compression techniques. The paper's findings suggest potential for broader applications across different network architectures and tasks, warranting further exploration into function-based metrics in other areas of deep learning.

In summary, this work provides a significant contribution to the domain of model compression and KD, offering a methodological advance that effectively addresses the limitations of traditional feature-distillation methods. Future research might explore integrating FCFD with orthogonal techniques or investigating its applicability to other complex tasks beyond classification and detection.