JanusQuant: Accurate and Efficient 2-bit KV Cache Quantization for Long-Context Inference
Long-context large language models (LLMs) have seen widespread adoption in recent years.
However, during inference, the key-value (KV) cache—which stores intermediate activations—consumes significant memory, particularly as sequence lengths grow.
Quantization offers a promising path to compress KV cache, but existing 2-bit approaches fall short of achieving optimal inference efficiency due to hardware-unfriendly algorithms and system implementations.
We present JanusQuant, a 2-bit KV cache quantization system that achieves both high accuracy and end-to-end efficiency through algorithm–system co-design for long-context generation tasks.
At its core is RtSmooth, a novel runtime smoothing quantization algorithm that mitigates outlier-induced accuracy loss via adaptive transformation.
Building on RtSmooth, JanusQuant further enhances quantized inference with a series of optimizations: a fast absmax positioning technique for lightweight quantization, a memory-efficient data structure for managing recent tokens, and a custom mixed-precision attention kernel to accelerate computation.
Across representative LLMs, JanusQuant preserves 99% of FP16 accuracy, reduces KV cache memory usage by up to 5.3×, and delivers up to 4.45× faster decoding throughput compared to state-of-the-art methods, while scaling efficiently to long-context inference.
Tue 3 FebDisplayed time zone: Hobart change
11:30 - 12:50 | Mixed Precision and QuantizationMain Conference at Balmoral Chair(s): Dingwen Tao Institute of Computing Technology, Chinese Academy of Sciences | ||
11:30 20mTalk | RoMeo: Mitigating Dual-dimensional Outliers with Rotated Mixed Precision Quantization Main Conference Qihao Zhang Tsinghua University, MingLiang Tang Tsinghua University, Mingshu Zhai Tsinghua University, Kinman Lei Tsinghua University, Jidong Zhai Tsinghua University DOI | ||
11:50 20mTalk | High-Throughput Non-Uniformly Quantized 3-bit LLM Inference Main Conference YuAng Chen Chinese University of Hong Kong, Wenqi Zeng Hong Kong University of Science and Technology, Jeffrey Xu Yu Chinese University of Hong Kong DOI | ||
12:10 20mTalk | JanusQuant: Accurate and Efficient 2-bit KV Cache Quantization for Long-Context Inference Main Conference Chengyu Sun Wuhan University, Yaqi Xia Wuhan University, Hulin Wang , Donglin Yang Nvidia Corporation, Xiaobo Zhou University of Macau, Dazhao Cheng WuHan University DOI | ||
12:30 20mTalk | HierCut: Enabling 16-bit Format Mixed Precision for Molecular Dynamics through Hierarchical CutoffBest Artifact Award Main Conference zeyu song Tsinghua University, Lin Gan Tsinghua University, Xiaohui Duan Shandong University, Jiayu Fu Tsinghua University, Zhengrui Li Tsinghua University, Yinuo Wang Tsinghua University, Guangzhao Li Chinese Academy of Sciences, Guangwen Yang Tsinghua University DOI | ||