Optimizing Pretrained Language Model Inference: Lessons from the NLPCC Small Model Competition and Deployment at Xiaomi

This article shares the Xiaomi AI Lab NLP team's experience in the NLPCC lightweight language model competition, discusses efficiency challenges of large pretrained models like BERT, and details practical inference optimizations—including model distillation, batching, FP16 quantization, and FasterTransformer integration—that dramatically reduce latency and hardware costs in production.

DataFunTalk
DataFunTalk
DataFunTalk
Optimizing Pretrained Language Model Inference: Lessons from the NLPCC Small Model Competition and Deployment at Xiaomi

The Xiaomi AI Lab NLP team, represented by Zhao Qun, presents their experience in the NLPCC lightweight language model competition and the subsequent deployment of pretrained model inference optimizations at Xiaomi.

Background : With the rise of BERT and its variants, large pretrained models dominate state‑of‑the‑art results on many NLP tasks. Pretraining requires massive unsupervised data (often >100 GB) and extensive GPU resources, while inference of these models incurs high latency and resource consumption, especially in high‑concurrency online services.

Efficiency Challenges : Model size has grown exponentially (e.g., BERT ≈ 100 M parameters, GPT‑2 ≈ 1 B, Turing‑NLG ≈ 10 B). Larger models increase both training cost and inference time, making it difficult to meet service‑level objectives such as 99th‑percentile latency (P99) under heavy traffic.

Solution Directions :

Use smaller models or knowledge distillation to reduce parameter count (e.g., sub‑12 M models for the competition).

Optimize large models to increase per‑GPU throughput, suitable for lower‑concurrency scenarios.

NLPCC Small‑Model Competition :

The competition required models <12 M parameters (≈1/9 of BERT‑base) and evaluated them on four downstream tasks: CLUE‑WSC2020 (coreference), CSL (paper‑keyword classification), CLUENER2020 (NER), and CMRC2018 (reading comprehension). The team collected 160 GB raw Chinese text, filtered it to ~35 GB of clean data, and trained a 6‑layer “high‑slim” model with hidden and vocab sizes chosen to stay under the parameter budget.

Training details: 8 × V100 GPUs, mixed‑precision, LAMB optimizer, batch size ≈ 14400, gradient accumulation every ~10 steps, wwm‑MLM pretraining task (NSP omitted due to slow convergence). The model achieved MLM loss 1.5‑1.8; loss ≈ 1.8 yielded good classification/NER performance, while further training improved reading‑comprehension results.

Encountered Issues :

Gradient explosion when loss approached 1.5 – resolved by lowering the initial learning rate.

Limited effectiveness of the “Bert‑of‑Theseus” distillation method because hidden size could not be changed, forcing layer reduction and resulting in sub‑optimal teacher performance.

Post‑Competition Improvements : Additional data augmentation and a simple distillation pipeline later surpassed BERT‑base on WSC and CLUENER, achieving top ranking among small models.

Inference Optimization at Xiaomi :

The team evaluated inference metrics (QPS and P99 latency) on a T4 GPU. A vanilla BERT‑base (seq‑len 16) required ~20 GPUs to serve 2000 QPS with P99 ≈ 90 ms. After optimization, a single GPU could handle 3000 QPS with P99 < 40 ms.

Optimization techniques:

TensorFlow Serving Batching – aggregates multiple requests into a batch, leveraging GPU parallelism.

FP16 quantization using Tensor Cores – converts FP32 ops to FP16, roughly doubling throughput with negligible accuracy loss.

Integration of NVIDIA’s FasterTransformer library into TensorFlow Serving, supporting variable‑length inputs.

These methods reduced the required GPU count from 20 to 1 for BERT‑base, saving ~19× in cost.

Online Deployment Example : In Xiaomi’s “Xiao Ai” chatbot dialogue ranking task (pointwise BERT binary classification, 6‑layer model, seq‑len 32), the three optimizations compressed the deployment from 13 GPUs to 3 GPUs and lowered P99 latency from >200 ms to 35 ms.

Conclusion : The presentation covered three parts – competition experience, pretrained model inference optimization at Xiaomi, and future directions – demonstrating that systematic model compression and inference engineering can make large language models practical for high‑traffic production services.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

AIInference Optimizationlarge language modelsNLPBERTpretrained models
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.