MetaAttention: A Unified and Performant Attention Framework Across Hardware Backends
Computing attention is the backbone of transformer-based models like large language models. However, the increasing diversity of attention algorithms presents significant challenges for unleashing hardware performance. State-of-the-art variants like FlashAttention target a specific attention algorithm or hardware platform, which fail to generalize to other algorithms and platforms.
We present MetaAttention, a framework that automatically derives the optimal implementation of an attention algorithm given a hardware platform.
Our key insight is that variants of attention can be abstracted into two operations:
relevance scoring and aggregation, complemented by customizable functions and configurations like the input shape.
Based on it, we systematically design a cross-backend attention runtime around these operations that generalizes to variants of attention with customizable operators.
To unleash the hardware performance, we further propose an IntermediateTensor-based search method to find the optimal tiling strategy and the parallelism scheme according to the attention customization and hardware features.
MetaAttention delivers up to a 10.4$\times$ speedup for configurations previously unsupported by state-of-the-art systems.
Additionally, MetaAttention achieves performance comparable to manually-optimized libraries such as FlashMLA while significantly reducing the amount of code required.
Tue 3 FebDisplayed time zone: Hobart change
17:15 - 18:15 | Optimizing TransformersMain Conference at Pyrmont Chair(s): Shaoshuai Zhang University of Electronic Science and Technology of China | ||
17:15 20mTalk | FlashAttention-T: Towards Fully Tensorized Attention by Exploiting Tensor-Vector Parallelism Main Conference Jianxing Xu University of Science and Technology of China, Yuanbo Wen , Jun Bi Chinese Academy of Sciences, Ruibai Xu University of Science and Technology of China, Guanglin Xu Chinese Academy of Sciences, Rui Zhang Chinese Academy of Sciences, Wei Li Chinese Academy of Sciences, Ling Li Institute of Software, Chinese Academy of Sciences, Tianshi Chen Cambricon Technologies, Qi Guo Chinese Academy of Sciences, Yunji Chen Chinese Academy of Sciences DOI | ||
17:35 20mTalk | Accelerating Sparse Transformer Inference on GPU Main Conference Wenhao Dai China University of Petroleum-Beijing, Haodong Deng China University of Petroleum, Mengfei Rong China University of Petroleum, Xinyu Yang Beihang University, Hongyu Liu Baidu Inc., Fangxin Liu Shanghai Jiao Tong University, Hailong Yang Beihang University, Qianwen Cao China University of Petroleum, Qingxiao Sun Beihang University DOI | ||
17:55 20mTalk | MetaAttention: A Unified and Performant Attention Framework Across Hardware Backends Main Conference Feiyang Chen Shanghai Jiao Tong University, Yu Cheng Peking University, Lei Wang Peking University, Yuqing Xia Microsoft Research, Ziming Miao Microsoft Research, Lingxiao Ma Microsoft Research, Fan Yang Microsoft Research Asia, Jilong Xue Microsoft Research, Zhi Yang Peking University, Mao Yang Microsoft Research, Xingda Wei Shanghai Jiao Tong University, Haibo Chen Shanghai Jiao Tong University DOI | ||