How Baidu’s AIAK Boosts LLM Inference Speed by Cutting Token Latency

This article explains the architecture of large‑model inference engines, key performance metrics like TTFT and TPOT, the limitations of popular engines such as vLLM, and Baidu Baige's AIAK solutions—including multi‑process, static slot, and asynchronous execution—that dramatically reduce token‑interval latency and increase GPU utilization.

Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
How Baidu’s AIAK Boosts LLM Inference Speed by Cutting Token Latency

Large model inference engines are the core of generative language models, receiving user prompts, scheduling GPU forward passes, and returning tokenized responses.

The engine operates in two main phases: the Prefill stage, which processes the initial prompt and builds contextual memory, and the Decoder stage, a self‑regressive loop that predicts subsequent tokens until a stop condition is met.

Performance is measured from the user perspective using Service Level Objectives: TTFT (Time To First Token) evaluates Prefill latency, while TPOT (Time Per Output Token) measures the interval between generated tokens during the Decoder stage. Throughput, expressed as TPS (Tokens Per Second), indicates the maximum token generation rate under full load.

Popular engines like vLLM excel in memory management and throughput but introduce substantial CPU work for tokenization and scheduling, extending TPOT and reducing GPU utilization.

Baidu Baige AIAK Optimizations

Solution 1: Multi‑process Architecture – By extracting tokenize/detokenize operations into separate Triton models and overlapping them with GPU inference, CPU overhead is reduced, yielding roughly a 10% TPOT improvement.

Solution 2: Static Slot Scheme – The scheduling logic is restructured to use fixed slots for each batch, converting global scheduling into incremental local scheduling. Token concatenation and other CPU‑bound tasks are parallelized with CUDA kernels, eliminating host‑to‑device copies and significantly cutting token‑interval time.

Solution 3: Asynchronous Execution – CPU‑intensive token interval processing and GPU‑intensive forward inference run on separate pipelines. A producer‑consumer queue synchronizes the two threads, achieving near‑100% GPU utilization and effectively zero token interval.

These optimizations collectively reduce token interval from 35 ms to 14 ms, raise GPU utilization from 50‑60% to about 75%, and move toward the ultimate goal of 100% utilization with zero token latency, while also supporting broader efforts in quantization, service‑oriented design, and multi‑chip scalability.

large-model inferencevLLMGPU utilizationLLM performanceAIAKtoken latency
Baidu Intelligent Cloud Tech Hub
Written by

Baidu Intelligent Cloud Tech Hub

We share the cloud tech topics you care about. Feel free to leave a message and tell us what you'd like to learn.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.