Multi-Level Decoupled Relational Distillation for Heterogeneous Architectures

Published 10 Feb 2025 in cs.CV | (2502.06189v1)

Abstract: Heterogeneous distillation is an effective way to transfer knowledge from cross-architecture teacher models to student models. However, existing heterogeneous distillation methods do not take full advantage of the dark knowledge hidden in the teacher's output, limiting their performance.To this end, we propose a novel framework named Multi-Level Decoupled Relational Knowledge Distillation (MLDR-KD) to unleash the potential of relational distillation in heterogeneous distillation. Concretely, we first introduce Decoupled Finegrained Relation Alignment (DFRA) in both logit and feature levels to balance the trade-off between distilled dark knowledge and the confidence in the correct category of the heterogeneous teacher model. Then, Multi-Scale Dynamic Fusion (MSDF) module is applied to dynamically fuse the projected logits of multiscale features at different stages in student model, further improving performance of our method in feature level. We verify our method on four architectures (CNNs, Transformers, MLPs and Mambas), two datasets (CIFAR-100 and Tiny-ImageNet). Compared with the best available method, our MLDR-KD improves student model performance with gains of up to 4.86% on CIFAR-100 and 2.78% on Tiny-ImageNet datasets respectively, showing robustness and generality in heterogeneous distillation. Code will be released soon.

Abstract PDF Upgrade to Chat

Summary

The paper introduces MLDR-KD, which decouples class-wise and sample-wise relations to retain dark knowledge and bolster classification confidence.
It employs DFRA and MSDF modules to align both logits and feature maps, enabling effective knowledge transfer across heterogeneous architectures.
Experiments on CIFAR-100 and Tiny-ImageNet demonstrate up to 4.86% performance improvement on diverse model types.

Multi-Level Decoupled Relational Distillation for Heterogeneous Architectures

The paper "Multi-Level Decoupled Relational Distillation for Heterogeneous Architectures" (2502.06189) addresses the challenge of effectively transferring knowledge from teachers to students with differing architectures in the context of knowledge distillation (KD). The proposed Multi-Level Decoupled Relational Knowledge Distillation (MLDR-KD) framework introduces novel techniques for balancing dark knowledge and the confidence of correct categories across heterogeneous architectures.

Introduction to Heterogeneous Knowledge Distillation

Traditional knowledge distillation focuses on homogeneous architectures, where both teacher and student models have similar structures. This paper expands on this by exploring KD across heterogeneous architectures. It proposes multi-level alignment between teacher and student models through Decoupled Finegrained Relation Alignment (DFRA) and Multi-Scale Dynamic Fusion (MSDF) modules.

Figure 1: Conceptual comparisons of different knowledge distillation methods.

Many traditional KD methods fail with heterogeneous architectures due to differences in model structures and the potential loss of vital information termed "dark knowledge." The paper identifies limitations of previous methods that either fail to retain dark knowledge or reduce confidence in the correct classification category, affecting the student model's performance.

MLDR-KD Framework

The MLDR-KD framework consists of two key components designed to operate at both logit and feature levels: DFRA and MSDF.

Decoupled Finegrained Relation Alignment (DFRA)

DFRA tackles the key challenge of aligning relationships within the student and teacher logits. It decouples these relationships into Class-Wise Relation and Sample-Wise Relation for improved alignment.

Class-Wise Relation Decoupling: It compares similarities across different categories, helping the student capture nuanced inter-category knowledge from the teacher.
Sample-Wise Relation Decoupling: It focuses on intra-category information, allowing the student to infer subtle differences among samples of the same category.

These decoupled relationships are aligned via Kullback-Leibler divergence, enhancing both the confidence in correct classifications and the retention of dark knowledge.

Multi-Scale Dynamic Fusion (MSDF)

MSDF further refines the student's understanding by dynamically fusing feature maps across different stages of the student model. This process involves:

Splitting the student into multiple stages and balancing each stage's learning via dynamically computed weights.
Feature maps are projected to a latent logit space and fused through a gating mechanism, leveraging class tokens to modulate their influence.
Figure 2: Overview of the proposed MLDR-KD framework.

Experimental Results

Extensive experiments on CIFAR-100 and Tiny-ImageNet datasets demonstrated the efficacy of MLDR-KD, with enhancements across CNNs, Transformers, MLPs, and Mambas architectures. Key findings include:

On CIFAR-100, MLDR-KD outperformed baseline methods, with improvements up to 4.86% on Transformers.
On Tiny-ImageNet, significant accuracy improvements were observed, especially in larger student models, with increases up to 2.78%.

The ablation studies confirmed the critical role of DFRA and MSDF components in enhancing performance metrics. The method also demonstrated robustness in transferring knowledge from heterogeneous architectures, exhibiting improved focus on target information in feature visualizations.

Figure 3: Comparisons of feature visualizations between OFA-KD and our MLDR-KD.

Conclusion

The MLDR-KD approach introduces significant advancements in the field of heterogeneous distillation by proposing fine-grained relation alignments and dynamic fusion techniques that bolster both dark knowledge transfer and classification confidence. The framework demonstrates generality and robustness across various architectures and datasets, marking an essential step towards generalized application of knowledge distillation across diverse model types.

Future research directions include optimizing feature-level knowledge transfer for further improving the distillation process across heterogeneous architectures.

Markdown