TurboTransformers: An Efficient GPU Serving System for Transformer Models
TurboTransformers introduces a suite of GPU‑centric optimizations—including a high‑throughput batch reduction algorithm, a variable‑length‑aware memory allocator, and a dynamic‑programming‑based batch scheduling strategy—that together deliver significantly lower latency and higher throughput for Transformer‑based NLP services compared with existing frameworks such as PyTorch, TensorFlow, ONNX Runtime and TensorRT.
The paper addresses the difficulty of deploying large Transformer models (e.g., BERT) in online services due to high computational cost and variable‑length inputs, which make traditional GPU inference solutions inefficient.
TurboTransformers proposes three core innovations: (1) a more efficient GPU batch reduction algorithm that accelerates Softmax and LayerNorm operations; (2) a memory allocation scheme that perceives sequence length, allocates memory in large blocks, and reuses space based on tensor lifetimes; (3) a dynamic‑programming‑based batch scheduling algorithm that maximizes throughput for variable‑length requests while minimizing latency.
The system consists of a runtime library and a service framework, both open‑sourced on GitHub (https://github.com/Tencent/TurboTransformers). Detailed algorithmic descriptions include a parallel batch reduction method that reduces synchronization overhead, a warp‑level all‑reduce routine, and an O(n²) DP scheduler that selects optimal batch sizes per request list.
Experimental results show that TurboTransformers outperforms PyTorch, TensorFlow‑XLA, ONNX Runtime, FasterTransformer, and TensorRT on both RTX 2060 and V100 GPUs, achieving up to 2.58× speed‑up for BERT and significant latency reductions across variable‑length and fixed‑length workloads. The new batch scheduling algorithm further improves throughput under high request rates.
The authors conclude that TurboTransformers markedly improves latency and throughput for Transformer inference in GPU data centers and suggest future work combining these engineering optimizations with model‑level techniques such as distillation and quantization.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.