How Alibaba Halved BERT Latency for Real‑Time Search
This article details Alibaba's technical challenges with BERT's high resource consumption in online search, analyzes memory and compute bottlenecks using TensorFlow profiling, and presents both TensorFlow‑based tweaks and a custom CUDA implementation that together double throughput and cut latency by about 50%.
Background
BERT (Bidirectional Encoder Representations from Transformers) was introduced by Google in 2018 and achieved state‑of‑the‑art results on many NLP tasks by pre‑training with Masked LM and Next Sentence Prediction on massive corpora.
Performance Pain Points
In Alibaba's search pipeline, each ranking request processes 10–20 documents, yielding an average batch size of 20 for BERT inference. The service requires sub‑20 ms latency, which forces the use of GPUs and leads to prohibitive hardware costs; even a 12‑layer model trimmed to 3 layers fell short of latency targets.
Performance Analysis
TensorFlow profiling revealed two main cost sources:
Memory‑to‑GPU parameter transfer.
Transformer computation.
These reflect memory management and compute inefficiencies.
Optimization Approaches
4.1 TensorFlow‑Based Tweaks
By explicitly placing all ops on the GPU device, parameter transfer overhead was reduced, lowering latency by roughly one‑third, though throughput remained unchanged. TensorFlow’s built‑in GPU memory manager still caused latency jitter under varying batch sizes.
Transformer computation incurred many CUDA kernel launches and data exchanges, limiting GPU utilization.
4.2 Re‑implementing BERT Prediction
Leveraging the open‑source cuBERT project, the team rewrote the inference logic, eliminating TensorFlow’s heavy graph overhead. Key steps included:
Parsing TensorFlow checkpoint parameters into cuBERT.
Implementing the downstream MLP.
Choosing appropriate cuBLAS/cuDNN kernels based on matrix sizes.
This custom implementation loads all parameters into GPU memory upfront and pre‑allocates intermediate buffers, achieving near‑zero allocation cost during inference.
By reducing kernel launch overhead to about 20 % of the original TensorFlow version and enabling half‑precision computation on V100/T4 GPUs, the custom BERT achieved a two‑fold throughput increase and a 50 % latency reduction.
Performance Comparison
Test environment: Linux kernel 3.10.0, gcc 4.9.2, 3‑layer Transformer, hidden size 768, sequence length 64, batch size 20.
Results showed the custom implementation doubling QPS while halving latency compared to the baseline TensorFlow model.
Future Work
Plans include exploring newer GPUs (e.g., T4) with mixed‑precision and int8 inference, and investigating knowledge distillation to replace the large model with smaller, faster variants.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
