Artificial Intelligence 13 min read

TurboTransformers: An Efficient GPU Serving System for Transformer Models

TurboTransformers introduces a suite of GPU‑centric optimizations—including a high‑throughput batch reduction algorithm, a variable‑length‑aware memory allocator, and a dynamic‑programming‑based batch scheduling strategy—that together deliver significantly lower latency and higher throughput for Transformer‑based NLP services compared with existing frameworks such as PyTorch, TensorFlow, ONNX Runtime and TensorRT.

DataFunTalk

Feb 14, 2021

TurboTransformers: An Efficient GPU Serving System for Transformer Models

The paper addresses the difficulty of deploying large Transformer models (e.g., BERT) in online services due to high computational cost and variable‑length inputs, which make traditional GPU inference solutions inefficient.

TurboTransformers proposes three core innovations: (1) a more efficient GPU batch reduction algorithm that accelerates Softmax and LayerNorm operations; (2) a memory allocation scheme that perceives sequence length, allocates memory in large blocks, and reuses space based on tensor lifetimes; (3) a dynamic‑programming‑based batch scheduling algorithm that maximizes throughput for variable‑length requests while minimizing latency.

The system consists of a runtime library and a service framework, both open‑sourced on GitHub (https://github.com/Tencent/TurboTransformers). Detailed algorithmic descriptions include a parallel batch reduction method that reduces synchronization overhead, a warp‑level all‑reduce routine, and an O(n²) DP scheduler that selects optimal batch sizes per request list.

Experimental results show that TurboTransformers outperforms PyTorch, TensorFlow‑XLA, ONNX Runtime, FasterTransformer, and TensorRT on both RTX 2060 and V100 GPUs, achieving up to 2.58× speed‑up for BERT and significant latency reductions across variable‑length and fixed‑length workloads. The new batch scheduling algorithm further improves throughput under high request rates.

The authors conclude that TurboTransformers markedly improves latency and throughput for Transformer inference in GPU data centers and suggest future work combining these engineering optimizations with model‑level techniques such as distillation and quantization.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

GPU inference BERT Dynamic Batching TurboTransformers

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.