COCCL: A Collective Communication Library Supporting Easy Integration and Configuration of Customized Compression for Scalable LLM Training (PPoPP 2026 - Main Conference)

Who

Xingchen Liu, Haoran Kong, Hairui Zhao, Shengkai Lyu, Zheng Wei, Man Liu, Xingjian Tian, Liyang Zhao, Zhuohan Chen, Fakang Wang, Zizhong Chen, Zhan Wang, Guangming Tan, Dingwen Tao

Track

PPoPP 2026 Main Conference

Time Zone

The program is currently displayed in (GMT+11:00) Hobart.

Use conference time zone: (GMT+11:00) HobartSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Tue 3 Feb 2026 14:10 - 14:30 at Balmoral - Distributed Training Chair(s): Bo Fang

Abstract

Collective communication is critical to scaling large language model (LLM) training across various parallelism strategies, including data, tensor, and pipeline parallelism on GPU clusters. However, as model sizes and training scales increase, communication overhead is emerging as a major performance bottleneck. While compression is a promising mitigation strategy, existing solutions often lack user-transparency, hinder deployment and extensibility, and are not co-designed with communication algorithms. To address these limitations, we present COCCL, a high-performance collective communication library built on top of NCCL. COCCL introduces a novel programming model that can easily integrate compression into communication workflows with flexible configurability. It features a suite of compression-aware collective algorithms and runtime overlap mechanisms that mitigate error propagation and reduce computational overhead. We integrate well-established compression techniques into COCCL and tune the compression configurations during 3D-parallel training on GPT and Qwen models with up to 7 billion parameters. Using the optimal configuration (COCCL-3D), we achieve 1.24$\times$ throughput improvement while maintaining training accuracy.

DOI

https://doi.org/10.1145/3774934.3786432

Xingchen Liu

University of Chinese Academy of Sciences

Haoran Kong

Chinese University of Hong Kong, Shenzhen

Hairui Zhao

Jilin University

Shengkai Lyu

University of Chinese Academy of Sciences

Zheng Wei

University of Chinese Academy of Sciences

Man Liu

University of Chinese Academy of Sciences

Xingjian Tian

University of Chinese Academy of Sciences

Liyang Zhao

University of Chinese Academy of Sciences

Zhuohan Chen

University of Chinese Academy of Sciences

Fakang Wang

Ant Group

Zizhong Chen

Chinese University of Hong Kong, Shenzhen

Zhan Wang

University of Chinese Academy of Sciences

Guangming Tan

University of Chinese Academy of Sciences

Dingwen Tao

Institute of Computing Technology, Chinese Academy of Sciences

China

Time Zone

The program is currently displayed in (GMT+11:00) Hobart.

Use conference time zone: (GMT+11:00) HobartSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Tue 3 Feb
Displayed time zone: Hobart change

14:10 - 15:30	Distributed TrainingMain Conference at Balmoral Chair(s): Bo Fang University of Texas at Arlington

14:10 20m Talk		COCCL: A Collective Communication Library Supporting Easy Integration and Configuration of Customized Compression for Scalable LLM Training Main Conference Xingchen Liu University of Chinese Academy of Sciences, Haoran Kong Chinese University of Hong Kong, Shenzhen, Hairui Zhao Jilin University, Shengkai Lyu University of Chinese Academy of Sciences, Zheng Wei University of Chinese Academy of Sciences, Man Liu University of Chinese Academy of Sciences, Xingjian Tian University of Chinese Academy of Sciences, Liyang Zhao University of Chinese Academy of Sciences, Zhuohan Chen University of Chinese Academy of Sciences, Fakang Wang Ant Group, Zizhong Chen Chinese University of Hong Kong, Shenzhen, Zhan Wang University of Chinese Academy of Sciences, Guangming Tan University of Chinese Academy of Sciences, Dingwen Tao Institute of Computing Technology, Chinese Academy of Sciences DOI
14:30 20m Talk		Elastor: Elastic and Efficient Model Partitioning and Checkpointing for Fault-Tolerant Distributed Training Main Conference Xuanyu Wang Peking University, Fangcheng FU Shanghai Jiao Tong University, Haoyang Li Peking University, Hao Ge Peking University, Sheng Lin Peking University, Jiawen Niu Peking University, Bin Cui Peking University DOI
14:50 20m Talk		HelixPipe: Efficient Distributed Training of Long Sequence Transformers with Attention Parallel Pipeline Parallelism Main Conference Geng Zhang National University of Singapore, Shenggan Cheng National University of Singapore, Xuanlei Zhao National University of Singapore, Ziming Liu , Yang You National University of Singapore DOI
15:10 20m Talk		CCL-D: A High-Precision Diagnostic System for Slow and Hang Anomalies in Large-Scale Model TrainingBest Paper Nominee Main Conference Yida Gu University of Chinese Academy of Sciences, Fakang Wang AntGroup, Jianhao Fu AntGroup, Zhenhang Sun Ant Group, Qianyu Zhang Ant Group, Hairui Zhao Jilin University, Xingchen Liu University of Chinese Academy of Sciences, Yang Tian Ant Group, Wenjing Huang University of Chinese Academy of Sciences, Zedong Liu University of Chinese Academy of Sciences, Yifan Chen Ant Group, Jinwu Yang University of Chinese Academy of Sciences, Yueyuan Zhou University of Chinese Academy of Sciences, Qian Zhao Ant Group, Haoxu Li University of Chinese Academy of Sciences, Tao Wang Ant Group, Feng Yu Ant Group, Zhan Wang University of Chinese Academy of Sciences, Guangming Tan University of Chinese Academy of Sciences, Dingwen Tao Institute of Computing Technology, Chinese Academy of Sciences DOI