Trojan Horse: Aggregate-and-Batch for Scaling Up Sparse Direct Solvers on GPU ClustersBest Paper Nominee
Sparse direct solvers are critical building blocks in a range of scientific applications on heterogeneous supercomputers. However, existing sparse direct solvers have not been able to well leverage the high bandwidth and floating-point performance of modern GPUs. The primary challenges are twofold: (1) the absence of a mechanism for aggregating small tasks to saturate the GPU, and (2) the lack of a mechanism for executing a diverse set of small tasks in batch mode on a single GPU.
We in this paper propose a strategy called Trojan Horse, which significantly enhances the execution efficiency of sparse direct solvers on GPU clusters. This mechanism divides each process's work into two stages: Aggregate (with two modules Prioritizer and Container) and Batch (with two modules Collector and Executor). In the Aggregate stage, a process first assesses the urgency of the input tasks through the Prioritizer module, and based on their priority, sends them to the Collector module or the Container module. In the batch stage, the Collector module receives high-priority heterogeneous tasks from the Prioritizer module and retrieves enough tasks from the Container module to send them to the Executor module for batch execution on GPU.In addition, our strategy is independent of solver libraries, and is integrated into SuperLU_DIST and PanguLU.
In the scale-up evaluation on a single NVIDIA A100 GPU, the Trojan Horse strategy delivers speedups of up to 418.79x (5.47x on average) for SuperLU_DIST and up to 5.59x (2.84x on average) for PanguLU. In the scale-out evaluation on two 16-GPU clusters from NVIDIA and AMD, respectively, Trojan Horse continues to deliver strong performance gains for both SuperLU_DIST and PanguLU across different GPU counts.
Tue 3 FebDisplayed time zone: Hobart change
11:30 - 12:50 | Cluster and Cloud ComputingMain Conference at Pyrmont Chair(s): Ruslan Nikolaev Pennsylvania State University | ||
11:30 20mTalk | Cacheman: A Comprehensive Last-Level Cache Management System for Multi-tenant Clouds Main Conference Xiaokang Hu Alibaba Cloud Computing, Yuchao Cao Alibaba Cloud Computing, Naixuan Guan Alibaba Cloud Computing, Yifan Wu Alibaba Cloud Computing, Xishi Qiu Alibaba Cloud Computing, Shengdong Dai Alibaba Cloud Computing, Ben Luo Alibaba Cloud Computing, Sanchuan Cheng Alibaba Cloud Computing, Fudong Qiu Alibaba Cloud Computing, Yibin Shen Alibaba Cloud, Jiesheng Wu Alibaba Cloud Computing DOI | ||
11:50 20mTalk | zBuffer: Zero-Copy and Metadata-Free Serialization for Fast RPC with Scatter-Gather Reflection Main Conference Xiangyu Liu Xiamen University, Huiba Li Alibaba, Shun Gai Alibaba, Youmin Chen Shanghai Jiao Tong University, Yiming Zhang Xiamen University DOI | ||
12:10 20mTalk | Scaling GPU-to-CPU Migration for Efficient Distributed Execution on CPU Clusters Main Conference DOI | ||
12:30 20mTalk | Trojan Horse: Aggregate-and-Batch for Scaling Up Sparse Direct Solvers on GPU ClustersBest Paper Nominee Main Conference Yida Li China University of Petroleum-Beijing, Siwei Zhang China University of Petroleum-Beijing, Yiduo Niu China University of Petroleum-Beijing, Yang Du China University of Petroleum-Beijing, Qingxiao Sun China University of Petroleum-Beijing, Zhou Jin China University of Petroleum-Beijing, Weifeng Liu China University of Petroleum-Beijing DOI | ||