PPoPP 2026
Sat 31 January - Wed 4 February 2026 Sydney, Australia
co-located with HPCA/CGO/PPoPP/CC 2026
Tue 3 Feb 2026 15:10 - 15:30 at Balmoral - Distributed Training Chair(s): Bo Fang

As training scales grow, collective communication libraries (CCL) increasingly face anomalies arising from complex interactions among hardware, software, and environmental factors. These anomalies typically manifest as slow/hang communication, the most frequent and time-consuming category to diagnose. However, traditional diagnostic methods remain inaccurate and inefficient, frequently requiring hours or even days for root cause analysis. To address this, we propose CCL-D, a high-precision diagnostic system designed to detect and locate slow/hang anomalies in large-scale distributed training. CCL-D integrates a rank-level real-time probe with an intelligent decision analyzer. The probe measures cross-layer anomaly metrics using a lightweight distributed tracing framework to monitor communication traffic. The analyzer performs automated anomaly detection and root-cause location, precisely identifying the faulty GPU rank. Deployed on a 4,000-GPU cluster over one year, CCL-D achieved near-complete coverage of known slow/hang anomalies and pinpointed affected ranks within 6 minutes—substantially outperforming existing solutions.

Tue 3 Feb

Displayed time zone: Hobart change

14:10 - 15:30
Distributed TrainingMain Conference at Balmoral
Chair(s): Bo Fang University of Texas at Arlington
14:10
20m
Talk
COCCL: A Collective Communication Library Supporting Easy Integration and Configuration of Customized Compression for Scalable LLM Training
Main Conference
Xingchen Liu University of Chinese Academy of Sciences, Haoran Kong Chinese University of Hong Kong, Shenzhen, Hairui Zhao Jilin University, Shengkai Lyu University of Chinese Academy of Sciences, Zheng Wei University of Chinese Academy of Sciences, Man Liu University of Chinese Academy of Sciences, Xingjian Tian University of Chinese Academy of Sciences, Liyang Zhao University of Chinese Academy of Sciences, Zhuohan Chen University of Chinese Academy of Sciences, Fakang Wang Ant Group, Zizhong Chen Chinese University of Hong Kong, Shenzhen, Zhan Wang University of Chinese Academy of Sciences, Guangming Tan University of Chinese Academy of Sciences, Dingwen Tao Institute of Computing Technology, Chinese Academy of Sciences
DOI
14:30
20m
Talk
Elastor: Elastic and Efficient Model Partitioning and Checkpointing for Fault-Tolerant Distributed Training
Main Conference
Xuanyu Wang Peking University, Fangcheng FU Shanghai Jiao Tong University, Haoyang Li Peking University, Hao Ge Peking University, Sheng Lin Peking University, Jiawen Niu Peking University, Bin Cui Peking University
DOI
14:50
20m
Talk
HelixPipe: Efficient Distributed Training of Long Sequence Transformers with Attention Parallel Pipeline Parallelism
Main Conference
Geng Zhang National University of Singapore, Shenggan Cheng National University of Singapore, Xuanlei Zhao National University of Singapore, Ziming Liu , Yang You National University of Singapore
DOI
15:10
20m
Talk
CCL-D: A High-Precision Diagnostic System for Slow and Hang Anomalies in Large-Scale Model TrainingBest Paper Nominee
Main Conference
Yida Gu University of Chinese Academy of Sciences, Fakang Wang AntGroup, Jianhao Fu AntGroup, Zhenhang Sun Ant Group, Qianyu Zhang Ant Group, Hairui Zhao Jilin University, Xingchen Liu University of Chinese Academy of Sciences, Yang Tian Ant Group, Wenjing Huang University of Chinese Academy of Sciences, Zedong Liu University of Chinese Academy of Sciences, Yifan Chen Ant Group, Jinwu Yang University of Chinese Academy of Sciences, Yueyuan Zhou University of Chinese Academy of Sciences, Qian Zhao Ant Group, Haoxu Li University of Chinese Academy of Sciences, Tao Wang Ant Group, Feng Yu Ant Group, Zhan Wang University of Chinese Academy of Sciences, Guangming Tan University of Chinese Academy of Sciences, Dingwen Tao Institute of Computing Technology, Chinese Academy of Sciences
DOI