Elastor: Elastic and Efficient Model Partitioning and Checkpointing for Fault-Tolerant Distributed Training
Distributed deep learning (DL) training faces instability from GPU/node failures of multi-GPU clusters, necessitating robust fault recovery from model checkpoints. However, we find that existing works only considers node failures but fails to handle partial GPU unavailability, and suffers from inefficient model checkpointing saving and loading, particularly when the GPU availability changes.
This work presents Elastor, a fault-tolerant distributed DL training system featuring elastic and efficient model checkpointing. Firstly, to accommodates partial GPU unavailability, we manage to support heterogeneous model parallel partitioning to elastically resume training with any number of GPUs. Secondly, we devise a partition-agnostic and efficient model checkpointing method via fine-grained tensor splits to achieve seamless transitions across arbitrary partitioning. In addition, Elastor equips with a strategy searching algorithm that automatically discovers optimal model partitioning upon recovery as well as a meticulous overlapping design that minimizes the overhead caused by periodic model checkpointing and data preprocessing. Experimental results show that Elastor facilitates quick model checkpointing and failure recovery, while maintaining consistent training efficiency across varying GPU availability. Source code is available at https://github.com/PKU-DAIR/Hetu.
Tue 3 FebDisplayed time zone: Hobart change
14:10 - 15:30 | |||
14:10 20mTalk | COCCL: A Collective Communication Library Supporting Easy Integration and Configuration of Customized Compression for Scalable LLM Training Main Conference Xingchen Liu University of Chinese Academy of Sciences, Haoran Kong Chinese University of Hong Kong, Shenzhen, Hairui Zhao Jilin University, Shengkai Lyu University of Chinese Academy of Sciences, Zheng Wei University of Chinese Academy of Sciences, Man Liu University of Chinese Academy of Sciences, Xingjian Tian University of Chinese Academy of Sciences, Liyang Zhao University of Chinese Academy of Sciences, Zhuohan Chen University of Chinese Academy of Sciences, Fakang Wang Ant Group, Zizhong Chen Chinese University of Hong Kong, Shenzhen, Zhan Wang University of Chinese Academy of Sciences, Guangming Tan University of Chinese Academy of Sciences, Dingwen Tao Institute of Computing Technology, Chinese Academy of Sciences DOI | ||
14:30 20mTalk | Elastor: Elastic and Efficient Model Partitioning and Checkpointing for Fault-Tolerant Distributed Training Main Conference Xuanyu Wang Peking University, Fangcheng FU Shanghai Jiao Tong University, Haoyang Li Peking University, Hao Ge Peking University, Sheng Lin Peking University, Jiawen Niu Peking University, Bin Cui Peking University DOI | ||
14:50 20mTalk | HelixPipe: Efficient Distributed Training of Long Sequence Transformers with Attention Parallel Pipeline Parallelism Main Conference Geng Zhang National University of Singapore, Shenggan Cheng National University of Singapore, Xuanlei Zhao National University of Singapore, Ziming Liu , Yang You National University of Singapore DOI | ||
15:10 20mTalk | CCL-D: A High-Precision Diagnostic System for Slow and Hang Anomalies in Large-Scale Model TrainingBest Paper Nominee Main Conference Yida Gu University of Chinese Academy of Sciences, Fakang Wang AntGroup, Jianhao Fu AntGroup, Zhenhang Sun Ant Group, Qianyu Zhang Ant Group, Hairui Zhao Jilin University, Xingchen Liu University of Chinese Academy of Sciences, Yang Tian Ant Group, Wenjing Huang University of Chinese Academy of Sciences, Zedong Liu University of Chinese Academy of Sciences, Yifan Chen Ant Group, Jinwu Yang University of Chinese Academy of Sciences, Yueyuan Zhou University of Chinese Academy of Sciences, Qian Zhao Ant Group, Haoxu Li University of Chinese Academy of Sciences, Tao Wang Ant Group, Feng Yu Ant Group, Zhan Wang University of Chinese Academy of Sciences, Guangming Tan University of Chinese Academy of Sciences, Dingwen Tao Institute of Computing Technology, Chinese Academy of Sciences DOI | ||