ElasGNN: An Elastic Training Framework for Distributed GNN Training
Graph Neural Networks (GNNs) have emerged as powerful machine learning models for numerous graph-based applications. However, existing GNN training frameworks cannot scale the training process elastically, resulting in poor training throughput and low cluster utilization. Although elastic training has been proposed for Deep Neural Networks (DNNs), it cannot be directly adopted to GNNs due to the prohibitive scaling cost and inefficient scheduling. In this paper, we present ElasGNN, an elastic GNN training framework that achieves efficient dynamic resource allocation for GNN jobs. ElasGNN proposes an efficient elastic training engine to achieve high-performant GNN job scaling and introduces novel graph repartitioning algorithms for both scale-in and scale-out processes to further minimize the scaling cost. Moreover, ElasGNN designs an efficient elastic scheduler, utilizing a scaling-cost-aware scheduling policy to improve the GPU utilization and system throughput. The experimental results show that the ElasGNN can achieve shorter job completion time and makespan for training jobs of diverse GNN models.
Tue 3 FebDisplayed time zone: Hobart change
15:50 - 17:10 | Graphs and Graph Neural NetworksMain Conference at Pyrmont Chair(s): Ali Jannesari Iowa State University | ||
15:50 20mTalk | ElasGNN: An Elastic Training Framework for Distributed GNN Training Main Conference Siqi Wang Beihang University, Hailong Yang Beihang University, Pengbo Wang Beihang University, Hongliang Cao Beihang University, Yufan Xu Independent Researcher, Xuezhu Wang Beihang University, Zhongzhi Luan Beihang University, Yi Liu Beihang University, Depei Qian Beihang University DOI | ||
16:10 20mTalk | APERTURE: Algorithm-System Co-optimization for Temporal Graph Network Inference Main Conference Yiqing Wang Beihang University, Hailong Yang Beihang University, Enze Yu Beihang University, Qingxiao Sun Beihang University, Kejie Ma Beihang University, Kaige Zhang Beihang University, chenhao xie Beihang University, Depei Qian Beihang University DOI | ||
16:30 20mTalk | TAC: Cache-Based System for Accelerating Billion-Scale GNN Training on Multi-GPU Platform Main Conference Zhiqiang Liang , Hongyu Gao , Fang Liu Computer Network Information Center, Chinese Academy of Sciences,University of Chinese Academy of Sciences, Jue Wang Computer Network Information Center, Chinese Academy of Sciences;University of Chinese Academy of Sciences, Xingguo Shi University of Chinese Academy of Sciences, Juyu Gu University of Chinese Academy of Sciences, Peng Di Ant Group & UNSW, San Li University of Chinese Academy of Sciences, Lei Tang University of Chinese Academy of Sciences, Chunbao Zhou University of Chinese Academy of Sciences, Lian Zhao University of Chinese Academy of Sciences, yangang wang University of Chinese Academy of Sciences, Xuebin Chi University of Chinese Academy of Sciences DOI | ||
16:50 20mTalk | DTMiner: A Data-Centric System for Efficient Temporal Motif Mining Main Conference hou yinbo Huazhong University of Science and Technology, Hao Qi Huazhong University of Science and Technology, Ligang He University of Warwick, Jin Zhao Huazhong University of Science and Technology, Yu Zhang School of Computer Science and Technology, Huazhong University of Science and Technology, Hui Yu Hong Kong University of Science and Technology, Longlong Lin Southwest University, Lin Gu Huazhong University of Science and Technology, Wenbin Jiang Huazhong University of Science and Technology, XIAOFEI LIAO Huazhong University of Science and Technology, Hai Jin Huazhong University of Science and Technology DOI | ||