ParDiff: Efficiently Parallelizing Reverse-Mode Automatic Differentiation with Direct Indexing
Automatic Differentiation (AD) is a technique that computes the derivatives of numerical programs by systematically applying the chain rule, playing a critical role in domains such as machine learning, simulation, and control systems. However, parallelizing differentiated programs remains a significant challenge due to the \textbf{conflict between tapes (a data structure for intermediate variable storage) and summations}: the differentiation process inherently introduces inter-thread summation patterns, which require prohibitively expensive atomic operations; and traditional tape designs tightly couple data retrieval with the program’s control flow, preventing code restructuring needed to eliminate these costly dependencies.
To address these challenges, we present ParDiff, a novel AD system with a direct-indexed tape design, which enables summation-aware loop transformations and various parallel schemes for differentiated programs. This results in a higher degree of parallelization, less synchronization, and reduced inter-thread data movement. We conduct comprehensive experiments on both multi-core CPUs and GPUs. Results show that ParDiff delivers up to $483.21\times$ (geometric mean: $30.88\times$) speedup over the state-of-the-art fully-AD system, Enzyme. It also achieves a speedup of $2.05\times$ and $2.06\times$ over PyTorch on CPU and GPU, respectively. The source code is publicly available at \url{https://github.com/roastduck/FreeTensor}.
Tue 3 FebDisplayed time zone: Hobart change
14:10 - 15:30 | |||
14:10 20mTalk | Pipelonk: Accelerating End-to-End Zero-Knowledge Proof Generation on GPUs for PLONK-Based Protocols Main Conference Zhiyuan Zhang Shandong University, Yanxin Cai Shandong University, Wenhao Yin Shandong University, Xueyu Wu The University of Hong Kong, Yi Wang Shenzhen University, Lei Ju Shandong University, Zhuoran Ji Shandong University DOI | ||
14:30 20mTalk | ParDiff: Efficiently Parallelizing Reverse-Mode Automatic Differentiation with Direct Indexing Main Conference Shuhong Huang Tsinghua University, Shizhi Tang Qingcheng.AI, Yuan Wen University of Aberdeen, Huanqi Cao Tsinghua University, Ruibai Tang Tsinghua University, yidong chen , Jiping Yu Tsinghua University, Yang Li Lenovo Research, Chao Jiang Lenovo Research, Limin Xiao Lenovo Research, Jidong Zhai Tsinghua University DOI | ||
14:50 20mTalk | Faster and Cheaper: Pushing the Sequence Alignment Throughput with Commercial CPUs Main Conference Zhonghai Zhang Institute of Computing Technology, Chinese Academy of Sciences / University of Chinese Academy of Sciences, Yewen Li The Hong Kong University of Science and Technology, Ke Meng Chinese Academy of Sciences, Chunming Zhang Institute of Computing Technology, Chinese Academy of Sciences, Guangming Tan University of Chinese Academy of Sciences DOI | ||
15:10 20mTalk | PIM-zd-tree: A Fast Space-Partitioning Index Leveraging Processing-in-Memory Main Conference Yiwei Zhao Carnegie Mellon University, Hongbo Kang Tsinghua University, Ziyang Men University of California, Riverside, Yan Gu University of California, Riverside, Guy E. Blelloch Carnegie Mellon University, Laxman Dhulipala University of Maryland, College Park, Charles McGuffey Reed College, Phil Gibbons Carnegie Mellon University DOI | ||