LightSeq: High‑Performance Open‑Source Inference Engine for Transformers, GPT and Other NLP Models
This article introduces LightSeq, an open‑source, GPU‑accelerated inference engine that dramatically speeds up Transformer‑based models such as BERT and GPT by up to 14× over TensorFlow, supports multiple decoding strategies, integrates seamlessly with major deep‑learning frameworks, and provides detailed performance benchmarks and technical optimizations.
Since the introduction of the Transformer model in 2017, the size of pretrained language models (e.g., BERT, GPT‑3) has grown exponentially, creating significant challenges for real‑time inference due to long latency and low queries‑per‑second (QPS) on a single GPU.
LightSeq, released by the ByteDance technology team in December 2019, is the first open‑source engine that fully supports high‑speed inference for a variety of models—including Transformer, GPT, BERT, and VAE—while offering features such as beam search, diverse beam search, and sampling.
Key Advantages
High performance: LightSeq can achieve up to 14× acceleration over TensorFlow and up to 1.4× over Faster Transformer on translation tasks.
Broad model support: It handles BERT, GPT, Transformer, VAE and multiple decoding methods.
Easy integration: Models trained in TensorFlow or PyTorch can be exported to LightSeq’s protocol and deployed without writing code, thanks to built‑in support for NVIDIA Triton Inference Server.
Usage Example
Prepare a model repository with the following structure (the transformer.pb file contains the exported weights and libtransformer.so is the compiled LightSeq library):
- model_zoo/
- model_repo/
- config.pbtxt
- transformer.pb
- 1/
- libtransformer.soThen launch Triton:
trtserver --model-store=${model_zoo}Performance Evaluation
On NVIDIA Tesla P4 and T4 GPUs, LightSeq outperforms TensorFlow and Faster Transformer in both machine‑translation (Transformer‑base and Transformer‑big) and text‑generation (top‑k/top‑p sampling) scenarios, delivering up to 13× speedup for GPT and VAE models and reducing 99th‑percentile service latency from ~360 ms to ~80 ms.
Technical Foundations
Operator Fusion: Multiple fine‑grained kernels (e.g., layer‑norm) are merged into a single CUDA kernel, eliminating intermediate memory reads/writes.
Dynamic Memory Reuse: All dynamic shapes are bounded, allowing pre‑allocation of maximum‑size buffers that are shared across tensors, enabling up to eight Transformer‑big models to run concurrently on a single T4.
Hierarchical Decoding: A coarse‑to‑fine logit selection kernel (R‑top‑k) dramatically reduces the number of elements that need full softmax and sorting, cutting decoding complexity from thousands to a few dozen candidates per step.
GPU profiling shows that after these optimizations, matrix multiplication (cuBLAS) dominates latency (≈ 85 %), while cache refresh and other ops account for the remaining overhead.
Resources
GitHub: https://github.com/bytedance/lightseq
For more details, see the LightSeq performance report, related papers (Vaswani et al., 2017; Devlin et al., 2018; Brown et al., 2020) and the cited open‑source projects (FasterTransformer, TurboTransformers, Triton Inference Server).
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
