To efficiently train large-scale models, low-bit gradient communication compresses full-precision gradients on local GPU nodes into low-precision ones for higher gradient synchronization efficiency among GPU nodes. However, it often degrades training quality due to compression information loss. To address this, we propose the Low-bit Communication Adaptor (LoCo), which compensates gradients on local GPU nodes before compression, ensuring efficient synchronization without compromising training quality. Specifically, LoCo designs a moving average of historical compensation errors to stably estimate concurrent compression error and then adopts it to compensate for the concurrent gradient compression, yielding a less lossless compression. This mechanism allows it to be compatible with general optimizers like Adam and sharding strategies like FSDP. Theoretical analysis shows that integrating LoCo into full-precision optimizers like Adam and SGD does not impair their convergence speed on non-convex problems. Experimental results show that across large-scale model training frameworks like Megatron-LM and PyTorch's FSDP, LoCo significantly improves communication efficiency, e.g., improving Adam's training speed by 14% to 40% without performance degradation on large language models like LLAMAs and MoEs.
Download full-text PDF |
Source |
---|---|
http://dx.doi.org/10.1109/TPAMI.2025.3544764 | DOI Listing |
IEEE Trans Pattern Anal Mach Intell
February 2025
To efficiently train large-scale models, low-bit gradient communication compresses full-precision gradients on local GPU nodes into low-precision ones for higher gradient synchronization efficiency among GPU nodes. However, it often degrades training quality due to compression information loss. To address this, we propose the Low-bit Communication Adaptor (LoCo), which compensates gradients on local GPU nodes before compression, ensuring efficient synchronization without compromising training quality.
View Article and Find Full Text PDFIEEE Trans Neural Netw Learn Syst
December 2024
Graph neural network (GNN) ushered in a new era of machine learning with interconnected datasets. While traditional neural networks can only be trained on independent samples, GNN allows for the inclusion of intersample interactions in the training process. This gain, however, incurs additional memory cost, rendering most GNNs unscalable for real-world applications involving vast and complicated networks with tens of millions of nodes (e.
View Article and Find Full Text PDFJ Chem Phys
February 2025
Dipartimento di Chimica, Università di Torino, via Giuria 5, 10125 Torino, Italy.
We discuss the implementation strategy, numerical accuracy, and computational performance of the acceleration of linear algebra operations through graphics processing units (GPUs) for the self-consistent field driver of the Crystal electronic structure package for solid state density functional theory simulations. Accelerated tasks include matrix multiplication, diagonalization, and inversion, as well as Cholesky decomposition. The scaling of the implemented strategy over multiple accelerating devices is assessed in the range of 1-8 GPUs per node and found to be remarkably regular.
View Article and Find Full Text PDFSci Rep
February 2025
Research Institute of Electrical Communication, Tohoku University, Sendai, 980-8577, Japan.
Probabilistic computing using probabilistic bits (p-bits) presents an efficient alternative to traditional CMOS logic for complex problem-solving, including simulated annealing and machine learning. Realizing p-bits with emerging devices such as magnetic tunnel junctions introduces device variability, which was expected to negatively impact computational performance. However, this study reveals an unexpected finding: device variability can not only degrade but also enhance algorithm performance, particularly by leveraging timing variability.
View Article and Find Full Text PDFNat Comput Sci
November 2024
Université de Lille, CNRS, UMR 8576 - UGSF - Unité de Glycobiologie Structurale et Fonctionnelle, Université de Lille, CNRS, Lille, France.
Massive sampling in AlphaFold enables access to increased structural diversity. In combination with its efficient confidence ranking, this unlocks elevated modeling capabilities for monomeric structures and foremost for protein assemblies. However, the approach struggles with GPU cost and data storage.
View Article and Find Full Text PDFEnter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!