TY - JOUR
T1 - Why logit distillation works
T2 - A novel knowledge distillation technique by deriving target augmentation and logits distortion
AU - Hossain, Md Imtiaz
AU - Akhter, Sharmen
AU - Mahbub, Nosin Ibna
AU - Hong, Choong Seon
AU - Huh, Eui Nam
N1 - Publisher Copyright:
© 2025 Elsevier Ltd
PY - 2025/5
Y1 - 2025/5
N2 - Although logit distillation aims to transfer knowledge from a large teacher network to a student, the underlying mechanisms and reasons for its effectiveness are unclear. This article explains the effectiveness of knowledge distillation (KD). Based on the observations, this paper proposes a novel distillation technique called TALD-KD that performs through Target Augmentation and a novel concept of dynamic Logits Distortion technique. The proposed TALD-KD unraveled the intricate relationships of dark knowledge semantics, randomness, flexibility, and augmentation with logits-level KD via three different investigations, hypotheses, and observations. TALD-KD improved student generalization through the linear combination of the teacher logits and random noise. Among the three versions assessed (TALD-A, TALD-B, and TALD-C), TALD-B improved the performance of KD on a large-scale ImageNet-1K dataset from 68.87% to 69.58% for top-1 accuracy, and from 88.76% to 90.13% for top-5 accuracy. Similarly, for the state-of-the-art approach, DKD, the performance improvements by the TALD-B ranged from 72.05% to 72.81% for top-1 accuracy and from 91.05% to 92.04% for top-5 accuracy. The other versions revealed the secrets of logit-level KD. Extensive ablation studies confirmed the superiority of the proposed approach over existing state-of-the-art approaches in diverse scenarios.
AB - Although logit distillation aims to transfer knowledge from a large teacher network to a student, the underlying mechanisms and reasons for its effectiveness are unclear. This article explains the effectiveness of knowledge distillation (KD). Based on the observations, this paper proposes a novel distillation technique called TALD-KD that performs through Target Augmentation and a novel concept of dynamic Logits Distortion technique. The proposed TALD-KD unraveled the intricate relationships of dark knowledge semantics, randomness, flexibility, and augmentation with logits-level KD via three different investigations, hypotheses, and observations. TALD-KD improved student generalization through the linear combination of the teacher logits and random noise. Among the three versions assessed (TALD-A, TALD-B, and TALD-C), TALD-B improved the performance of KD on a large-scale ImageNet-1K dataset from 68.87% to 69.58% for top-1 accuracy, and from 88.76% to 90.13% for top-5 accuracy. Similarly, for the state-of-the-art approach, DKD, the performance improvements by the TALD-B ranged from 72.05% to 72.81% for top-1 accuracy and from 91.05% to 92.04% for top-5 accuracy. The other versions revealed the secrets of logit-level KD. Extensive ablation studies confirmed the superiority of the proposed approach over existing state-of-the-art approaches in diverse scenarios.
KW - Dynamic logits distortion
KW - Knowledge distillation
KW - Logits distillation
KW - TALD-KD
KW - Target augmentation
UR - http://www.scopus.com/inward/record.url?scp=85214666660&partnerID=8YFLogxK
U2 - 10.1016/j.ipm.2024.104056
DO - 10.1016/j.ipm.2024.104056
M3 - Article
AN - SCOPUS:85214666660
SN - 0306-4573
VL - 62
JO - Information Processing and Management
JF - Information Processing and Management
IS - 3
M1 - 104056
ER -