High-Throughput Non-Uniformly Quantized 3-bit LLM Inference (PPoPP 2026 - Main Conference)

Sat 31 January - Wed 4 February 2026 Sydney, Australia

co-located with HPCA/CGO/PPoPP/CC 2026

Who

YuAng Chen, Wenqi Zeng, Jeffrey Xu Yu

Track

PPoPP 2026 Main Conference

Time Zone

The program is currently displayed in (GMT+11:00) Hobart.

Use conference time zone: (GMT+11:00) HobartSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Tue 3 Feb 2026 11:50 - 12:10 at Balmoral - Mixed Precision and Quantization Chair(s): Dingwen Tao

Abstract

While Large Language Models (LLMs) are widely adopted, their massive parameter size constrains practical deployment. A common solution is clustering-based non-uniform quantization, which effectively compresses models to as low as 3 bits per weight while preserving high accuracy. However, instead of accelerating memory-bound LLM inference, the memory reduction paradoxically often causes a significant slowdown due to dequantization overhead and GPU underutilization.
To address the issue, we propose Quantix, a framework designed to convert memory savings into inference speedups. Quantix applies two key optimizations: (1) a hardware-aligned bit shuffling scheme for efficient data access, and (2) a fused dequantization-multiplication pipeline that effectively maps workloads on both CUDA and Tensor Cores. Quantix enables high-throughput batched inference, delivering average kernel-level speedups of 4.82$\times$ over FP16 cuBLAS and end-to-end speedups of up to 11.46$\times$ over state-of-the-art quantization methods on NVIDIA L40 GPUs.

DOI

https://doi.org/10.1145/3774934.3786423

YuAng Chen

Chinese University of Hong Kong

Wenqi Zeng

Hong Kong University of Science and Technology

Jeffrey Xu Yu