SAMP: A Model Inference Toolkit of Post-Training Quantization for Text Processing via Self-Adaptive Mixed-Precision
SAMP is an adaptive mixed-precision inference toolkit that automatically controls floating-point and integer operations to accelerate model inference while maintaining computational accuracy.
SAMP is an adaptive mixed-precision inference toolkit designed to address the challenges of model inference acceleration in the era of large AI models. The toolkit automatically controls floating-point and integer operations to achieve faster inference speeds while maintaining computational accuracy.
The research addresses the limitations of existing inference engines that primarily use single-precision computation (either pure floating-point or pure integer), which results in limited acceleration performance. SAMP introduces a novel approach by finding the optimal floating-point and fixed-point mixed-precision combination for large-scale matrix multiplications and Transformer layers, allowing models to achieve both high computational accuracy and efficient inference.
The toolkit consists of four main modules: Tokenizer, Embedding, Encoder, and Downstream Target. Key innovations include: (1) Adaptive precision control that balances computational accuracy and latency performance, (2) Superior inference efficiency across a wide precision range compared to other inference packages, and (3) Flexibility to support various NLP downstream tasks with user-friendly APIs.
SAMP offers two mixed-precision inference modes: Fully-Quant mode, which converts all data flow to 8-bit integers for maximum speed, and Quant-FFN-Only mode, which only quantizes the Feed-Forward Network while keeping Multi-Head Attention in floating-point for better accuracy preservation. The toolkit uses a precision decay-aware algorithm to automatically recommend optimal mixed-precision configurations when users don't specify requirements.
Experimental results demonstrate that SAMP achieves up to 1.05-1.15x acceleration compared to FasterTransformer on Chinese Language Understanding Evaluation (CLUE) classification tasks, while maintaining better accuracy than full INT8 quantization approaches. The toolkit is implemented in C++ and supports both C++ and Python APIs, requiring only CUDA 11.0 or higher.
Kuaishou Tech
Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.