DataFunTalk
Feb 14, 2021 · Artificial Intelligence
TurboTransformers: An Efficient GPU Serving System for Transformer Models
TurboTransformers introduces a suite of GPU‑centric optimizations—including a high‑throughput batch reduction algorithm, a variable‑length‑aware memory allocator, and a dynamic‑programming‑based batch scheduling strategy—that together deliver significantly lower latency and higher throughput for Transformer‑based NLP services compared with existing frameworks such as PyTorch, TensorFlow, ONNX Runtime and TensorRT.
BERTDynamic batchingGPU inference
0 likes · 13 min read