Exploiting Efficient Mapping and Pipelined Execution for Accelerating SpMV on Tensor Cores (PPoPP 2026 - Main Conference)

Who

Kaige Zhang, Hailong Yang, Xin You, Tianyu Feng, Yufan Xu, Zhongzhi Luan, Yi Liu, Depei Qian

Track

PPoPP 2026 Main Conference

Time Zone

The program is currently displayed in (GMT+11:00) Hobart.

Use conference time zone: (GMT+11:00) HobartSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Tue 3 Feb 2026 10:30 - 10:50 at Pyrmont - Stencil and Sparse Matrix Computation Chair(s): Shoaib Kamil

Abstract

Sparse matrix-vector multiplication (SpMV) is a fundamental operation in scientific computing, machine learning, and graph analytics, demanding efficient execution on modern hardware. Recent advances in hardware accelerators, such as Tensor Cores, have significantly improved the performance of many compute-intensive workloads. However, effectively utilizing Tensor Cores for SpMV remains challenging due to its irregular sparsity patterns and the mismatch between SpMV's computational characteristics and constrained architecture design, leading to suboptimal performance and underutilization of Tensor Cores. In this paper, we systematically analyze the state-of-the-art SpMV optimizations on Tensor Cores, identify key performance bottlenecks, and propose Drawloom, a Tensor-Core-aware framework for SpMV with efficient Tensor Core mapping and optimized pipeline execution. Drawloom leverages a redesigned Tensor Core mapping strategy with a zig-zag chained sparse storage format, as well as a multi-stage register pipeline to better exploit hardware parallelism. Our evaluation on SuiteSparse dataset demonstrates that Drawloom outperforms cuSPARSE by 2.71$\times$/1.90$\times$ (in FP16), {2.95$\times$/2.39$\times$ (in FP32)}, and 2.47$\times$/1.54$\times$ (in FP64) on A100 and H100 GPUs, respectively. Compared to the state-of-the-art SpMV implementations, Drawloom achieves a performance speedup of 1.26$\times$/1.18$\times$ (in FP16) and 1.49$\times$/1.56$\times$ (in FP64) on A100 and H100 GPUs, respectively.

DOI

https://doi.org/10.1145/3774934.3786441

Kaige Zhang

Beihang University

Hailong Yang

Beihang University

Xin You

Beihang University

Tianyu Feng

Beihang University

Yufan Xu

Independent Researcher

Zhongzhi Luan

Beihang University

Yi Liu

Beihang University

Depei Qian

Beihang University

Time Zone

The program is currently displayed in (GMT+11:00) Hobart.

Use conference time zone: (GMT+11:00) HobartSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Tue 3 Feb
Displayed time zone: Hobart change

09:50 - 11:10	Stencil and Sparse Matrix ComputationMain Conference at Pyrmont Chair(s): Shoaib Kamil Adobe Research

09:50 20m Talk		SPIDER: Unleashing Sparse Tensor Cores for Stencil Computation via Strided Swapping Main Conference Qiqi Gu Shanghai Jiao Tong University, Chenpeng Wu Shanghai Jiao Tong University, Heng Shi , Jianguo Yao Shanghai Jiao Tong University; Shanghai Enflame Technology DOI
10:10 20m Talk		ASM-SpMM: Unleashing the Potential of Arm SME for Sparse Matrix Multiplication Acceleration Main Conference Jiazhi Jiang Sun Yat-sen University, Xijia Yao Sun Yat-sen University, Jiayu Chen Sun Yat-sen University, jinhui wei Sun Yat-sen University, Dan Huang , Yutong Lu Sun Yat-sen University DOI
10:30 20m Talk		Exploiting Efficient Mapping and Pipelined Execution for Accelerating SpMV on Tensor Cores Main Conference Kaige Zhang Beihang University, Hailong Yang Beihang University, Xin You Beihang University, Tianyu Feng Beihang University, Yufan Xu Independent Researcher, Zhongzhi Luan Beihang University, Yi Liu Beihang University, Depei Qian Beihang University DOI
10:50 20m Talk		VDHA: Vector-Driven Hash Aggregation for Sparse Matrix-Sparse Vector Multiplication on GPUs Main Conference Yuchen Li Tsinghua University, Zhe Pan Tsinghua University, Peng Qu Tsinghua University, Youhui Zhang Tsinghua University DOI