SAMP: A Model Inference Toolkit of Post-Training Quantization for Text Processing via Self-Adaptive Mixed-Precision
SAMP is an adaptive mixed-precision inference toolkit that automatically controls floating-point and integer operations to accelerate model inference while maintaining computational accuracy.
SAMP is an adaptive mixed-precision inference toolkit designed to address the challenges of model inference acceleration in the era of large AI models. The toolkit automatically controls floating-point and integer operations to achieve faster inference speeds while maintaining computational accuracy.
The research addresses the limitations of existing inference engines that primarily use single-precision computation (either pure floating-point or pure integer), which results in limited acceleration performance. SAMP introduces a novel approach by finding the optimal floating-point and fixed-point mixed-precision combination for large-scale matrix multiplications and Transformer layers, allowing models to achieve both high computational accuracy and efficient inference.
The toolkit consists of four main modules: Tokenizer, Embedding, Encoder, and Downstream Target. Key innovations include: (1) Adaptive precision control that balances computational accuracy and latency performance, (2) Superior inference efficiency across a wide precision range compared to other inference packages, and (3) Flexibility to support various NLP downstream tasks with user-friendly APIs.
SAMP offers two mixed-precision inference modes: Fully-Quant mode, which converts all data flow to 8-bit integers for maximum speed, and Quant-FFN-Only mode, which only quantizes the Feed-Forward Network while keeping Multi-Head Attention in floating-point for better accuracy preservation. The toolkit uses a precision decay-aware algorithm to automatically recommend optimal mixed-precision configurations when users don't specify requirements.
Experimental results demonstrate that SAMP achieves up to 1.05-1.15x acceleration compared to FasterTransformer on Chinese Language Understanding Evaluation (CLUE) classification tasks, while maintaining better accuracy than full INT8 quantization approaches. The toolkit is implemented in C++ and supports both C++ and Python APIs, requiring only CUDA 11.0 or higher.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Kuaishou Tech
Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
