LoCo: Low-Bit Communication Adaptor for Large-Scale Model Training.

Xingyu Xie Zhijie Lin Kim-Chuan Toh Pan Zhou

IEEE Trans Pattern Anal Mach Intell

Published: February 2025

To efficiently train large-scale models, low-bit gradient communication compresses full-precision gradients on local GPU nodes into low-precision ones for higher gradient synchronization efficiency among GPU nodes. However, it often degrades training quality due to compression information loss. To address this, we propose the Low-bit Communication Adaptor (LoCo), which compensates gradients on local GPU nodes before compression, ensuring efficient synchronization without compromising training quality. Specifically, LoCo designs a moving average of historical compensation errors to stably estimate concurrent compression error and then adopts it to compensate for the concurrent gradient compression, yielding a less lossless compression. This mechanism allows it to be compatible with general optimizers like Adam and sharding strategies like FSDP. Theoretical analysis shows that integrating LoCo into full-precision optimizers like Adam and SGD does not impair their convergence speed on non-convex problems. Experimental results show that across large-scale model training frameworks like Megatron-LM and PyTorch's FSDP, LoCo significantly improves communication efficiency, e.g., improving Adam's training speed by 14% to 40% without performance degradation on large language models like LLAMAs and MoEs.

Download full-text PDF	Source
http://dx.doi.org/10.1109/TPAMI.2025.3544764	DOI Listing

Publication Analysis

Top Keywords

gpu nodes

low-bit communication

communication adaptor

large-scale model

model training

gradients local

local gpu

training quality

optimizers adam

loco

Similar Publications

LoCo: Low-Bit Communication Adaptor for Large-Scale Model Training.

IEEE Trans Pattern Anal Mach Intell

February 2025

Xingyu Xie Zhijie Lin Kim-Chuan Toh Pan Zhou

View Article and Find Full Text PDF

Similar Publications

FALCON: Feature-Label Constrained Graph Net Collapse for Memory-Efficient GNNs.

IEEE Trans Neural Netw Learn Syst

December 2024

Christopher Adnel Islem Rekik

Graph neural network (GNN) ushered in a new era of machine learning with interconnected datasets. While traditional neural networks can only be trained on independent samples, GNN allows for the inclusion of intersample interactions in the training process. This gain, however, incurs additional memory cost, rendering most GNNs unscalable for real-world applications involving vast and complicated networks with tens of millions of nodes (e.

View Article and Find Full Text PDF

Similar Publications

Accelerated linear algebra for large scale DFT calculations of materials on CPU/GPU architectures with CRYSTAL.

J Chem Phys

February 2025

Dipartimento di Chimica, Università di Torino, via Giuria 5, 10125 Torino, Italy.

Giacomo Ambrogio Lorenzo Donà Jacques K Desmarais Chiara Ribaldone Silvia Casassa

We discuss the implementation strategy, numerical accuracy, and computational performance of the acceleration of linear algebra operations through graphics processing units (GPUs) for the self-consistent field driver of the Crystal electronic structure package for solid state density functional theory simulations. Accelerated tasks include matrix multiplication, diagonalization, and inversion, as well as Cholesky decomposition. The scaling of the implemented strategy over multiple accelerating devices is assessed in the range of 1-8 GPUs per node and found to be remarkably regular.

View Article and Find Full Text PDF

Similar Publications

GPU-accelerated simulated annealing based on p-bits with real-world device-variability modeling.

Sci Rep

February 2025

Research Institute of Electrical Communication, Tohoku University, Sendai, 980-8577, Japan.

Naoya Onizawa Takahiro Hanyu

Probabilistic computing using probabilistic bits (p-bits) presents an efficient alternative to traditional CMOS logic for complex problem-solving, including simulated annealing and machine learning. Realizing p-bits with emerging devices such as magnetic tunnel junctions introduces device variability, which was expected to negatively impact computational performance. However, this study reveals an unexpected finding: device variability can not only degrade but also enhance algorithm performance, particularly by leveraging timing variability.

View Article and Find Full Text PDF

Similar Publications

MassiveFold: unveiling AlphaFold's hidden potential with optimized and parallelized massive sampling.

Nat Comput Sci

November 2024

Université de Lille, CNRS, UMR 8576 - UGSF - Unité de Glycobiologie Structurale et Fonctionnelle, Université de Lille, CNRS, Lille, France.

Nessim Raouraoua Claudio Mirabello Thibaut Véry Christophe Blanchet Björn Wallner

Massive sampling in AlphaFold enables access to increased structural diversity. In combination with its efficient confidence ranking, this unlocks elevated modeling capabilities for monomeric structures and foremost for protein assemblies. However, the approach struggles with GPU cost and data storage.

View Article and Find Full Text PDF

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!