Characterizing Matrix Multiplication Units across General Parallel Patterns in Scientific Computing
Matrix multiplication units (MMUs) in modern parallel processors enable efficient execution of tiled matrix multiplications at varying precisions. While their effectiveness in AI workloads has been well demonstrated, their utility in scientific computing lacks systematic analysis. In this work, we characterize MMUs across a broad range of scientific computing patterns by evaluating performance, power consumption, numerical precision, and memory access behavior. To support this analysis, we develop Cubie, a comprehensive benchmark suite comprising ten MMU-optimized kernels of key parallel patterns. We also categorize MMU utilization patterns into four quadrants and identify the MMU limitations that arise in scientific computing. Through detailed comparisons with vector units, we provide nine key observations on the behavior and implications of MMUs in general scientific workloads, offering valuable insights for architecture, algorithm, and application researchers.
Wed 4 FebDisplayed time zone: Hobart change
09:50 - 11:10 | Matrix and Linear Algebra AlgorithmsMain Conference at Pyrmont Chair(s): William S. Moses University of Illinois Urbana-Champaign | ||
09:50 20mTalk | Towards Singular Value Decomposition for Rank-Deficient Matrices: An Efficient and Accurate Algorithm on GPU Architectures Main Conference Lu Shi University of Electronic Science and Technology of China, WeiWei Xu Nanjing University of Information Science and Technology, Shaoshuai Zhang University of Electronic Science and Technology of China DOI | ||
10:10 20mTalk | A Diagonal Block Memory-Aware Polynomial Preconditioner for Linear and Eigenvalue Solvers Main Conference Xiaojian Yang National University of Defense Technology, Yuhui Ni National University of Defense Technology, Fan Yuan Xiangtan University, Shengguo Li National University of Defense Technology, Dezun Dong NUDT, xuchuanfu National University of Defense Technology, Haipeng Jia Jia, Jie Liu National University of Defense Technology DOI | ||
10:30 20mTalk | A Distributed Matrix-Block-Vector Multiplication in Presence of System Performance Variability Main Conference Yuchen Ma College of William & Mary, Bin Ren College of William & Mary, Andreas Stathopoulos College of William & Mary DOI | ||
10:50 20mTalk | Characterizing Matrix Multiplication Units across General Parallel Patterns in Scientific Computing Main Conference Yuechen Lu China University of Petroleum-Beijing, Hongwei Zeng , Marc Casas Barcelona Supercomputing Center, Weifeng Liu China University of Petroleum-Beijing DOI | ||