SinKD: Sinkhorn Distance Minimization for Knowledge Distillation.

Xiao Cui Yulei Qin Yuting Gao Enwei Zhang Zihan Xu Tong Wu Ke Li Xing Sun Wengang Zhou Houqiang Li

IEEE Trans Neural Netw Learn Syst

Published: December 2024

Knowledge distillation (KD) has been widely adopted to compress large language models (LLMs). Existing KD methods investigate various divergence measures including the Kullback-Leibler (KL), reverse KL (RKL), and Jensen-Shannon (JS) divergences. However, due to limitations inherent in their assumptions and definitions, these measures fail to deliver effective supervision when a distribution overlap exists between the teacher and the student. In this article, we show that the aforementioned KL, RKL, and JS divergences, respectively, suffer from issues of mode-averaging, mode-collapsing, and mode-underestimation, which deteriorates logits-based KD for diverse natural language processing (NLP) tasks. We propose the Sinkhorn KD (SinKD) that exploits the Sinkhorn distance to ensure a nuanced and precise assessment of the disparity between distributions of teacher and student models. Besides, thanks to the properties of the Sinkhorn metric, we get rid of sample-wise KD that restricts the perception of divergences inside each teacher-student sample pair. Instead, we propose a batch-wise reformulation to capture the geometric intricacies of distributions across samples in the high-dimensional space. A comprehensive evaluation of GLUE and SuperGLUE, in terms of comparability, validity, and generalizability, highlights our superiority over state-of-the-art (SOTA) methods on all kinds of LLMs with encoder-only, encoder-decoder, and decoder-only architectures. Codes and models are available at https://github.com/2018cx/SinKD.

Download full-text PDF	Source
http://dx.doi.org/10.1109/TNNLS.2024.3501335	DOI Listing

Publication Analysis

Top Keywords

sinkhorn distance

knowledge distillation

teacher student

sinkd sinkhorn

distance minimization

minimization knowledge

distillation knowledge

distillation adopted

adopted compress

compress large

Similar Publications

SinKD: Sinkhorn Distance Minimization for Knowledge Distillation.

IEEE Trans Neural Netw Learn Syst

December 2024

Xiao Cui Yulei Qin Yuting Gao Enwei Zhang Zihan Xu

View Article and Find Full Text PDF

Similar Publications

Imputation of Missing Data in Materials Science through Nearest Neighbors and Iterative Predictions.

J Chem Theory Comput

January 2025

Department of Polymer Materials and Engineering, College of Materials and Metallurgy, Guizhou University, Guiyang 550025, P. R. China.

Chunhui Xie Rui Li Yunqi Li Haibo Xie Qibin Liu

Missing data in tabular data sets is ubiquitous in statistical analysis, big data analysis, and machine learning studies. Many strategies have been proposed to impute missing data, but their reliability has not been stringently assessed in materials science. Here, we carried out a benchmark test for six imputation strategies: Mean, MissForest, HyperImpute, Gain, Sinkhorn, and a newly proposed MatImpute on seven representative data sets in materials science.

View Article and Find Full Text PDF

Similar Publications

Multidimensional Measure Matching for Crowd Counting.

IEEE Trans Neural Netw Learn Syst

August 2024

Hui Lin Xiaopeng Hong Zhiheng Ma Yaowei Wang Deyu Meng

This article addresses the challenge of scale variations in crowd-counting problems from a multidimensional measure-theoretic perspective. We start by formulating crowd counting as a measure-matching problem, based on the assumption that discrete measures can express the scattered ground truth and the predicted density map. In this context, we introduce the Sinkhorn counting loss and extend it to the semi-balanced form, which alleviates the problems including entropic bias, distance destruction, and amount constraints.

View Article and Find Full Text PDF

Similar Publications

Approximate Bayesian computation for inferring Waddington landscapes from single-cell data.

R Soc Open Sci

July 2024

School of Mathematics and Statistics, University of Melbourne, Melbourne, Australia.

Yujing Liu Stephen Y Zhang Istvan T Kleijn Michael P H Stumpf

Single-cell technologies allow us to gain insights into cellular processes at unprecedented resolution. In stem cell and developmental biology snapshot data allow us to characterize how the transcriptional states of cells change between successive cell types. Here, we show how approximate Bayesian computation (ABC) can be employed to calibrate mathematical models against single-cell data.

View Article and Find Full Text PDF

Similar Publications

Similarity measure method of near-infrared spectrum combined with multi-attribute information.

Spectrochim Acta A Mol Biomol Spectrosc

December 2024

R&D Center, China Tobacco Yunnan Industrial Co., Ltd, No. 367 Hongjin Road, Kunming 650231, China.

Jinfeng Zhang Yuhua Qin Rongkun Tian Xiaoli Bai Jing Liu

Due to the high-dimensionality, redundancy, and non-linearity of the near-infrared (NIR) spectra data, as well as the influence of attributes such as producing area and grade of the sample, which can all affect the similarity measure between samples. This paper proposed a t-distributed stochastic neighbor embedding algorithm based on Sinkhorn distance (St-SNE) combined with multi-attribute data information. Firstly, the Sinkhorn distance was introduced which can solve problems such as KL divergence asymmetry and sparse data distribution in high-dimensional space, thereby constructing probability distributions that make low-dimensional space similar to high-dimensional space.

View Article and Find Full Text PDF

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!