Why logit distillation works: A novel knowledge distillation technique by deriving target augmentation and logits distortion

Md Imtiaz Hossain, Sharmen Akhter, Nosin Ibna Mahbub, Choong Seon Hong, Eui Nam Huh

Research output: Contribution to journalArticlepeer-review

Abstract

Although logit distillation aims to transfer knowledge from a large teacher network to a student, the underlying mechanisms and reasons for its effectiveness are unclear. This article explains the effectiveness of knowledge distillation (KD). Based on the observations, this paper proposes a novel distillation technique called TALD-KD that performs through Target Augmentation and a novel concept of dynamic Logits Distortion technique. The proposed TALD-KD unraveled the intricate relationships of dark knowledge semantics, randomness, flexibility, and augmentation with logits-level KD via three different investigations, hypotheses, and observations. TALD-KD improved student generalization through the linear combination of the teacher logits and random noise. Among the three versions assessed (TALD-A, TALD-B, and TALD-C), TALD-B improved the performance of KD on a large-scale ImageNet-1K dataset from 68.87% to 69.58% for top-1 accuracy, and from 88.76% to 90.13% for top-5 accuracy. Similarly, for the state-of-the-art approach, DKD, the performance improvements by the TALD-B ranged from 72.05% to 72.81% for top-1 accuracy and from 91.05% to 92.04% for top-5 accuracy. The other versions revealed the secrets of logit-level KD. Extensive ablation studies confirmed the superiority of the proposed approach over existing state-of-the-art approaches in diverse scenarios.

Original languageEnglish
Article number104056
JournalInformation Processing and Management
Volume62
Issue number3
DOIs
Publication statusPublished - May 2025

Bibliographical note

Publisher Copyright:
© 2025 Elsevier Ltd

Keywords

  • Dynamic logits distortion
  • Knowledge distillation
  • Logits distillation
  • TALD-KD
  • Target augmentation

Fingerprint

Dive into the research topics of 'Why logit distillation works: A novel knowledge distillation technique by deriving target augmentation and logits distortion'. Together they form a unique fingerprint.

Cite this