How Alibaba Halved BERT Latency for Real‑Time Search

This article details Alibaba's technical challenges with BERT's high resource consumption in online search, analyzes memory and compute bottlenecks using TensorFlow profiling, and presents both TensorFlow‑based tweaks and a custom CUDA implementation that together double throughput and cut latency by about 50%.

Alibaba Cloud Developer
Alibaba Cloud Developer
Alibaba Cloud Developer
How Alibaba Halved BERT Latency for Real‑Time Search

Background

BERT (Bidirectional Encoder Representations from Transformers) was introduced by Google in 2018 and achieved state‑of‑the‑art results on many NLP tasks by pre‑training with Masked LM and Next Sentence Prediction on massive corpora.

Performance Pain Points

In Alibaba's search pipeline, each ranking request processes 10–20 documents, yielding an average batch size of 20 for BERT inference. The service requires sub‑20 ms latency, which forces the use of GPUs and leads to prohibitive hardware costs; even a 12‑layer model trimmed to 3 layers fell short of latency targets.

Performance Analysis

TensorFlow profiling revealed two main cost sources:

Memory‑to‑GPU parameter transfer.

Transformer computation.

These reflect memory management and compute inefficiencies.

Optimization Approaches

4.1 TensorFlow‑Based Tweaks

By explicitly placing all ops on the GPU device, parameter transfer overhead was reduced, lowering latency by roughly one‑third, though throughput remained unchanged. TensorFlow’s built‑in GPU memory manager still caused latency jitter under varying batch sizes.

Transformer computation incurred many CUDA kernel launches and data exchanges, limiting GPU utilization.

4.2 Re‑implementing BERT Prediction

Leveraging the open‑source cuBERT project, the team rewrote the inference logic, eliminating TensorFlow’s heavy graph overhead. Key steps included:

Parsing TensorFlow checkpoint parameters into cuBERT.

Implementing the downstream MLP.

Choosing appropriate cuBLAS/cuDNN kernels based on matrix sizes.

This custom implementation loads all parameters into GPU memory upfront and pre‑allocates intermediate buffers, achieving near‑zero allocation cost during inference.

By reducing kernel launch overhead to about 20 % of the original TensorFlow version and enabling half‑precision computation on V100/T4 GPUs, the custom BERT achieved a two‑fold throughput increase and a 50 % latency reduction.

Performance Comparison

Test environment: Linux kernel 3.10.0, gcc 4.9.2, 3‑layer Transformer, hidden size 768, sequence length 64, batch size 20.

Results showed the custom implementation doubling QPS while halving latency compared to the baseline TensorFlow model.

Future Work

Plans include exploring newer GPUs (e.g., T4) with mixed‑precision and int8 inference, and investigating knowledge distillation to replace the large model with smaller, faster variants.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

AlibabaPerformance OptimizationTensorFlowGPUBERT
Alibaba Cloud Developer
Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.