A Distributed Matrix-Block-Vector Multiplication in Presence of System Performance Variability (PPoPP 2026 - Main Conference)

Sat 31 January - Wed 4 February 2026 Sydney, Australia

co-located with HPCA/CGO/PPoPP/CC 2026

Who

Yuchen Ma, Bin Ren, Andreas Stathopoulos

Track

PPoPP 2026 Main Conference

Time Zone

The program is currently displayed in (GMT+11:00) Hobart.

Use conference time zone: (GMT+11:00) HobartSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Wed 4 Feb 2026 10:30 - 10:50 at Pyrmont - Matrix and Linear Algebra Algorithms Chair(s): William S. Moses

Abstract

Distributed matrix-block-vector multiplication (Matvec) algorithm is a critical component of many applications, but can be computationally challenging for dense matrices of dimension O(10^6–10^7) and blocks of O(10–100) vectors. We present performance analysis, implementation, and optimization of our SMatVec library for Matvec under the effect of system variability. Our modeling shows that 1D pipelining Matvec is as efficient as 2D algorithms at small to medium clusters, which are sufficient for these problem sizes. We develop a performance tracing framework and a simulator that reveal pipeline bubbles caused by modest ~5% system variability. To tolerate such variability, our SMatVec library, which combines on-the-fly kernel matrix generation and Matvec, integrates four optimizations: inter-process data preloading, unconventional static thread scheduling, cache-aware tiling, and multi-version unrolling. In our benchmarks on O(10^5) Matvec problems, SMatVec achieves up to 1.85× speedup over COSMA and 17× over ScaLAPACK. For O(10^6) problems, where COSMA and ScaLAPACK exceed memory capacity, SMatVec maintains linear strong scaling and achieves peak performance of 75% FMA Flop/s. Its static scheduling policy has a 2.27× speedup compared to the conventional work-stealing dynamic scheduler, and is predicted to withstand up to 108% performance variability under exponential distributed variability simulation.

DOI

https://doi.org/10.1145/3774934.3786453

Yuchen Ma

College of William & Mary

United States

Bin Ren

College of William & Mary

United States

Andreas Stathopoulos

College of William & Mary

Time Zone

The program is currently displayed in (GMT+11:00) Hobart.

Use conference time zone: (GMT+11:00) HobartSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Wed 4 Feb
Displayed time zone: Hobart change

09:50 - 11:10	Matrix and Linear Algebra AlgorithmsMain Conference at Pyrmont Chair(s): William S. Moses University of Illinois Urbana-Champaign

09:50 20m Talk		Towards Singular Value Decomposition for Rank-Deficient Matrices: An Efficient and Accurate Algorithm on GPU Architectures Main Conference Lu Shi University of Electronic Science and Technology of China, WeiWei Xu Nanjing University of Information Science and Technology, Shaoshuai Zhang University of Electronic Science and Technology of China DOI
10:10 20m Talk		A Diagonal Block Memory-Aware Polynomial Preconditioner for Linear and Eigenvalue Solvers Main Conference Xiaojian Yang National University of Defense Technology, Yuhui Ni National University of Defense Technology, Fan Yuan Xiangtan University, Shengguo Li National University of Defense Technology, Dezun Dong NUDT, xuchuanfu National University of Defense Technology, Haipeng Jia Jia, Jie Liu National University of Defense Technology DOI
10:30 20m Talk		A Distributed Matrix-Block-Vector Multiplication in Presence of System Performance Variability Main Conference Yuchen Ma College of William & Mary, Bin Ren College of William & Mary, Andreas Stathopoulos College of William & Mary DOI
10:50 20m Talk		Characterizing Matrix Multiplication Units across General Parallel Patterns in Scientific Computing Main Conference Yuechen Lu China University of Petroleum-Beijing, Hongwei Zeng , Marc Casas Barcelona Supercomputing Center, Weifeng Liu China University of Petroleum-Beijing DOI