Artificial Intelligence 9 min read

SAMP: A Model Inference Toolkit of Post-Training Quantization for Text Processing via Self-Adaptive Mixed-Precision

SAMP is an adaptive mixed-precision inference toolkit that automatically controls floating-point and integer operations to accelerate model inference while maintaining computational accuracy.

Kuaishou Tech
Kuaishou Tech
Kuaishou Tech
SAMP: A Model Inference Toolkit of Post-Training Quantization for Text Processing via Self-Adaptive Mixed-Precision

SAMP is an adaptive mixed-precision inference toolkit designed to address the challenges of model inference acceleration in the era of large AI models. The toolkit automatically controls floating-point and integer operations to achieve faster inference speeds while maintaining computational accuracy.

The research addresses the limitations of existing inference engines that primarily use single-precision computation (either pure floating-point or pure integer), which results in limited acceleration performance. SAMP introduces a novel approach by finding the optimal floating-point and fixed-point mixed-precision combination for large-scale matrix multiplications and Transformer layers, allowing models to achieve both high computational accuracy and efficient inference.

The toolkit consists of four main modules: Tokenizer, Embedding, Encoder, and Downstream Target. Key innovations include: (1) Adaptive precision control that balances computational accuracy and latency performance, (2) Superior inference efficiency across a wide precision range compared to other inference packages, and (3) Flexibility to support various NLP downstream tasks with user-friendly APIs.

SAMP offers two mixed-precision inference modes: Fully-Quant mode, which converts all data flow to 8-bit integers for maximum speed, and Quant-FFN-Only mode, which only quantizes the Feed-Forward Network while keeping Multi-Head Attention in floating-point for better accuracy preservation. The toolkit uses a precision decay-aware algorithm to automatically recommend optimal mixed-precision configurations when users don't specify requirements.

Experimental results demonstrate that SAMP achieves up to 1.05-1.15x acceleration compared to FasterTransformer on Chinese Language Understanding Evaluation (CLUE) classification tasks, while maintaining better accuracy than full INT8 quantization approaches. The toolkit is implemented in C++ and supports both C++ and Python APIs, requiring only CUDA 11.0 or higher.

post-training quantizationAI inferencetransformer modelsmixed-precision computingNLP acceleration
Kuaishou Tech
Written by

Kuaishou Tech

Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.