How Baidu’s AIAK Boosts LLM Inference Speed by Cutting Token Latency
This article explains the architecture of large‑model inference engines, key performance metrics like TTFT and TPOT, the limitations of popular engines such as vLLM, and Baidu Baige's AIAK solutions—including multi‑process, static slot, and asynchronous execution—that dramatically reduce token‑interval latency and increase GPU utilization.
Large model inference engines are the core of generative language models, receiving user prompts, scheduling GPU forward passes, and returning tokenized responses.
The engine operates in two main phases: the Prefill stage, which processes the initial prompt and builds contextual memory, and the Decoder stage, a self‑regressive loop that predicts subsequent tokens until a stop condition is met.
Performance is measured from the user perspective using Service Level Objectives: TTFT (Time To First Token) evaluates Prefill latency, while TPOT (Time Per Output Token) measures the interval between generated tokens during the Decoder stage. Throughput, expressed as TPS (Tokens Per Second), indicates the maximum token generation rate under full load.
Popular engines like vLLM excel in memory management and throughput but introduce substantial CPU work for tokenization and scheduling, extending TPOT and reducing GPU utilization.
Baidu Baige AIAK Optimizations
Solution 1: Multi‑process Architecture – By extracting tokenize/detokenize operations into separate Triton models and overlapping them with GPU inference, CPU overhead is reduced, yielding roughly a 10% TPOT improvement.
Solution 2: Static Slot Scheme – The scheduling logic is restructured to use fixed slots for each batch, converting global scheduling into incremental local scheduling. Token concatenation and other CPU‑bound tasks are parallelized with CUDA kernels, eliminating host‑to‑device copies and significantly cutting token‑interval time.
Solution 3: Asynchronous Execution – CPU‑intensive token interval processing and GPU‑intensive forward inference run on separate pipelines. A producer‑consumer queue synchronizes the two threads, achieving near‑100% GPU utilization and effectively zero token interval.
These optimizations collectively reduce token interval from 35 ms to 14 ms, raise GPU utilization from 50‑60% to about 75%, and move toward the ultimate goal of 100% utilization with zero token latency, while also supporting broader efforts in quantization, service‑oriented design, and multi‑chip scalability.
Baidu Intelligent Cloud Tech Hub
We share the cloud tech topics you care about. Feel free to leave a message and tell us what you'd like to learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
