ParDiff: Efficiently Parallelizing Reverse-Mode Automatic Differentiation with Direct Indexing (PPoPP 2026 - Main Conference)

Who

Shuhong Huang, Shizhi Tang, Yuan Wen, Huanqi Cao, Ruibai Tang, yidong chen, Jiping Yu, Yang Li, Chao Jiang, Limin Xiao, Jidong Zhai

Track

PPoPP 2026 Main Conference

Time Zone

The program is currently displayed in (GMT+11:00) Hobart.

Use conference time zone: (GMT+11:00) HobartSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Tue 3 Feb 2026 14:30 - 14:50 at Pyrmont - Parallel Algorithms Chair(s): Kenjiro Taura

Abstract

Automatic Differentiation (AD) is a technique that computes the derivatives of numerical programs by systematically applying the chain rule, playing a critical role in domains such as machine learning, simulation, and control systems. However, parallelizing differentiated programs remains a significant challenge due to the \textbf{conflict between tapes (a data structure for intermediate variable storage) and summations}: the differentiation process inherently introduces inter-thread summation patterns, which require prohibitively expensive atomic operations; and traditional tape designs tightly couple data retrieval with the program’s control flow, preventing code restructuring needed to eliminate these costly dependencies.

To address these challenges, we present ParDiff, a novel AD system with a direct-indexed tape design, which enables summation-aware loop transformations and various parallel schemes for differentiated programs. This results in a higher degree of parallelization, less synchronization, and reduced inter-thread data movement. We conduct comprehensive experiments on both multi-core CPUs and GPUs. Results show that ParDiff delivers up to $483.21\times$ (geometric mean: $30.88\times$) speedup over the state-of-the-art fully-AD system, Enzyme. It also achieves a speedup of $2.05\times$ and $2.06\times$ over PyTorch on CPU and GPU, respectively. The source code is publicly available at \url{https://github.com/roastduck/FreeTensor}.

DOI

https://doi.org/10.1145/3774934.3786418

Shuhong Huang

Tsinghua University

Shizhi Tang

Qingcheng.AI

Yuan Wen

University of Aberdeen

United Kingdom

Huanqi Cao

Tsinghua University

China

Ruibai Tang

Tsinghua University

yidong chen

Jiping Yu

Tsinghua University

Yang Li

Lenovo Research

Chao Jiang

Lenovo Research

Limin Xiao

Lenovo Research

Jidong Zhai

Tsinghua University

Time Zone

The program is currently displayed in (GMT+11:00) Hobart.

Use conference time zone: (GMT+11:00) HobartSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Tue 3 Feb
Displayed time zone: Hobart change

14:10 - 15:30	Parallel AlgorithmsMain Conference at Pyrmont Chair(s): Kenjiro Taura The University of Tokyo

14:10 20m Talk		Pipelonk: Accelerating End-to-End Zero-Knowledge Proof Generation on GPUs for PLONK-Based Protocols Main Conference Zhiyuan Zhang Shandong University, Yanxin Cai Shandong University, Wenhao Yin Shandong University, Xueyu Wu The University of Hong Kong, Yi Wang Shenzhen University, Lei Ju Shandong University, Zhuoran Ji Shandong University DOI
14:30 20m Talk		ParDiff: Efficiently Parallelizing Reverse-Mode Automatic Differentiation with Direct Indexing Main Conference Shuhong Huang Tsinghua University, Shizhi Tang Qingcheng.AI, Yuan Wen University of Aberdeen, Huanqi Cao Tsinghua University, Ruibai Tang Tsinghua University, yidong chen , Jiping Yu Tsinghua University, Yang Li Lenovo Research, Chao Jiang Lenovo Research, Limin Xiao Lenovo Research, Jidong Zhai Tsinghua University DOI
14:50 20m Talk		Faster and Cheaper: Pushing the Sequence Alignment Throughput with Commercial CPUs Main Conference Zhonghai Zhang Institute of Computing Technology, Chinese Academy of Sciences / University of Chinese Academy of Sciences, Yewen Li The Hong Kong University of Science and Technology, Ke Meng Chinese Academy of Sciences, Chunming Zhang Institute of Computing Technology, Chinese Academy of Sciences, Guangming Tan University of Chinese Academy of Sciences DOI
15:10 20m Talk		PIM-zd-tree: A Fast Space-Partitioning Index Leveraging Processing-in-Memory Main Conference Yiwei Zhao Carnegie Mellon University, Hongbo Kang Tsinghua University, Ziyang Men University of California, Riverside, Yan Gu University of California, Riverside, Guy E. Blelloch Carnegie Mellon University, Laxman Dhulipala University of Maryland, College Park, Charles McGuffey Reed College, Phil Gibbons Carnegie Mellon University DOI