Wasserstein Distance Rivals Kullback-Leibler Divergence for Knowledge Distillation

Published 11 Dec 2024 in cs.CV, cs.AI, and cs.LG | (2412.08139v1)

Abstract: Since pioneering work of Hinton et al., knowledge distillation based on Kullback-Leibler Divergence (KL-Div) has been predominant, and recently its variants have achieved compelling performance. However, KL-Div only compares probabilities of the corresponding category between the teacher and student while lacking a mechanism for cross-category comparison. Besides, KL-Div is problematic when applied to intermediate layers, as it cannot handle non-overlapping distributions and is unaware of geometry of the underlying manifold. To address these downsides, we propose a methodology of Wasserstein Distance (WD) based knowledge distillation. Specifically, we propose a logit distillation method called WKD-L based on discrete WD, which performs cross-category comparison of probabilities and thus can explicitly leverage rich interrelations among categories. Moreover, we introduce a feature distillation method called WKD-F, which uses a parametric method for modeling feature distributions and adopts continuous WD for transferring knowledge from intermediate layers. Comprehensive evaluations on image classification and object detection have shown (1) for logit distillation WKD-L outperforms very strong KL-Div variants; (2) for feature distillation WKD-F is superior to the KL-Div counterparts and state-of-the-art competitors. The source code is available at https://peihuali.org/WKD

Abstract PDF Upgrade to Chat

Summary

The paper introduces a novel KD approach using discrete and continuous Wasserstein Distance to evaluate logit and feature distributions.
It demonstrates significant improvements in image classification on ImageNet and object detection on MS-COCO relative to traditional KL divergence methods.
While the approach increases computational cost, it offers a more accurate geometric interpretation of data and addresses limitations of standard KD paradigms.

"Wasserstein Distance Rivals Kullback-Leibler Divergence for Knowledge Distillation"

Abstract

The paper proposes a novel approach to Knowledge Distillation (KD) using the Wasserstein Distance (WD) as an alternative to the Kullback-Leibler Divergence (KL-Div), addressing several limitations in the traditional KD paradigm. Through rigorous evaluations on image classification and object detection benchmarks, including ImageNet, CIFAR-100, and MS-COCO, the research demonstrates that WD can effectively account for cross-category relations in feature space, thus enhancing the distillation process.

Introduction

Knowledge Distillation is a technique for transferring knowledge from a high-performance teacher model to a smaller, more efficient student model. Conventionally, KL-Div has been the predominant method used for KD, focusing solely on the matching of category probabilities between teacher and student models. The paper highlights two major issues with KL-Div: its inability to compare probabilities across different categories and its ineffectiveness when applied to intermediate layers due to non-overlapping distributions and lack of geometric understanding of the data manifold.

Proposed Method

Discrete WD for Logit Distillation

Discrete WD is introduced for logit distillation, allowing for cross-category comparison of probabilities. This method leverages rich interrelations among categories by using concepts like Centered Kernel Alignment (CKA) to quantify similarities between features of different categories.

Figure 1: Real-world categories exhibit rich interrelations (IRs) in feature space.

Continuous WD for Feature Distillation

The paper also presents continuous WD for distilling intermediate layer information. Gaussian distributions are used to model the feature distributions within layers, allowing WD to measure dissimilarities across these distributions effectively using geometric properties inherent to the data's manifold.

Figure 2: Features are projected to 2D space using tSNE. Different categories are indicated by different colors.

Experimental Results

Extensive experiments validate the effectiveness of WD against KL-Div. Notably, the proposed discrete WD method outperformed existing KD techniques across various benchmarks. The continuous WD approach for feature distillation also showed significant improvements, particularly in capturing the geometric structure of data.

Figure 3: Visualization for WKD-L.

Figure 4: Visualization of CAM.

Performance Analysis

Image Classification: On ImageNet, the proposed methods showed substantial gains, outperforming KL-Div-based methods like DKD, NKD, and OFA.
Object Detection: WD techniques improved upon standard KD methods when applied to detection tasks on MS-COCO, showcasing better utilization of spatial feature maps for object localization.
Computational Efficiency: Despite a higher computational cost, WD's performance gains justify the trade-off, and its application is highly beneficial in resource-limited settings.

Challenges and Trade-offs

While WD provides numerous advantages over KL-Div, it comes with increased computational demands, especially when computing discrete WD. However, these costs are mitigated by the potential improvements in model accuracy and reliability. The Gaussian assumption for feature distribution may limit applicability in scenarios where feature manifolds deviate significantly from this model.

Conclusion

The introduction of Wasserstein Distance as a rival to KL-Div in knowledge distillation paradigms is a promising advancement. It not only addresses previous limitations by leveraging cross-category interrelations but also provides enhanced performance metrics across varying benchmarks. Future research may focus on optimizing computational efficiency and exploring alternative statistical models beyond Gaussian assumptions for richer feature representation across neural networks.

In summary, the paper paves the way for more robust and theoretically sound approaches to KD, opening avenues for further exploration into advanced metrics like WD in model training and deployment in real-world applications.