PPoPP 2026
Sat 31 January - Wed 4 February 2026 Sydney, Australia
co-located with HPCA/CGO/PPoPP/CC 2026

Distributed deep learning (DL) training faces instability from GPU/node failures of multi-GPU clusters, necessitating robust fault recovery from model checkpoints. However, we find that existing works only considers node failures but fails to handle partial GPU unavailability, and suffers from inefficient model checkpoing saving and loading, particularly when the GPU availability changes. This work presents Elastor, a fault-tolerant distributed DL trainng system featuring elastic and efficient model checkpointing. Firstly, to accommodates partial GPU unavailability, we manage to support heterogeneous model parallel partitioning to elastically resume training with any number of GPUs. Secondly, we devise a partition-agnostic and efficient model checkpointing method via fine-grained tensor splits to achieve seamless transitions across arbitrary partitioning. In addition, Elastor equips with a strategy searching algorithm that automatically discovers optimal model partitioning upon recovery as well as a meticulous overlapping design that minimizes the overhead caused by periodic model checkpointing and data preprocessing. Experimental results show that Elastor facilitates quick model checkpointing and failure recovery, while maintaining consistent training efficiency across varying GPU availability.