Feb 14, 2021 · Artificial Intelligence

TurboTransformers: An Efficient GPU Serving System for Transformer Models

TurboTransformers introduces a suite of GPU‑centric optimizations—including a high‑throughput batch reduction algorithm, a variable‑length‑aware memory allocator, and a dynamic‑programming‑based batch scheduling strategy—that together deliver significantly lower latency and higher throughput for Transformer‑based NLP services compared with existing frameworks such as PyTorch, TensorFlow, ONNX Runtime and TensorRT.

BERTDynamic BatchingGPU inference

0 likes · 13 min read

TurboTransformers: An Efficient GPU Serving System for Transformer Models