Artificial Intelligence 12 min read

How TAG Makes LLM Inference Fully Asynchronous for Higher Throughput

With the growing complexity of LLM architectures like GQA, MLA, and MoE, runtime overhead has become a bottleneck; this article analyzes Python performance, communication costs, and synchronous execution in current inference frameworks, introduces the fully asynchronous TAG architecture, and demonstrates its superior throughput and latency through benchmarks.

Alibaba Cloud Big Data AI Platform

Sep 16, 2024

How TAG Makes LLM Inference Fully Asynchronous for Higher Throughput

As GQA, MLA, MoE and other model structures evolve, large language model (LLM) inference is moving toward high concurrency and high throughput, making runtime overhead a critical concern.

Runtime Overhead Sources

Python performance : Most frameworks use Python for usability while implementing models and operators in C++. The Global Interpreter Lock (GIL) prevents effective parallelism, creating a bottleneck in high‑concurrency scenarios.

Communication overhead : To bypass the GIL, frameworks adopt multi‑process designs even on a single machine, and distributed inference adds multiple Prefill and Decode instances. Frequent inter‑process communication and message (de)serialization become performance limits.

Synchronous execution logic : Popular inference engines manage KV‑Cache blocks synchronously—allocating blocks, computing the model, and updating the cache in a strict sequence. When concurrency grows, the accumulated overhead can rival the model computation cost.

Recent community efforts (e.g., vLLM, SGLang) address these costs with multi‑step scheduling, asynchronous output handling, and separate API server processes.

Typical LLM Inference Engine Components

API Server – receives requests and returns responses.

Scheduler – allocates KV‑Cache blocks and schedules requests.

Model Runner – performs model computation and sampling.

Decoder – converts sampled token IDs into text.

Below is a timeline illustration of a fully synchronous inference engine:

Community Optimizations

Enabling MultiStepScheduling and asynchronous output handling (as in vLLM 0.6.0) allows the Model Runner to perform several forward steps without synchronizing with the Scheduler after each step, increasing GPU utilization.

However, this approach still has drawbacks:

After a batch of N steps, the Model Runner must wait for the Scheduler to dispatch the next batch, so the Scheduler’s overhead is only amortized, not eliminated.

Tokens generated during the N steps are not returned to the user until the batch completes, increasing time‑to‑first‑token (TTFT) and overall latency.

Fully Asynchronous TAG Engine

BladeLLM designed TAG (Totally Asynchronous Generator), a pure‑Python asynchronous LLM inference architecture that removes the synchronization point between Scheduler and Model Runner, enabling completely asynchronous execution.

Asynchronous Scheduler

The Scheduler only needs the length of each request’s token sequence, not the actual token IDs. The Model Runner’s maximum token generation per step is known (1 for normal steps, up to γ + 1 for speculative sampling). By reserving enough KV‑Cache space based on these bounds, the Scheduler can dispatch work without waiting for the Model Runner’s results, and updates its state asynchronously.

TAG Architecture

All component interactions are asynchronous. The workflow is:

WebServer receives user requests, tokenizes them, and enqueues them to the Scheduler; simultaneously, it detokenizes responses and returns them to users.

Update Loop processes Model Runner messages, updates the Scheduler with the actual number of tokens generated, detokenizes outputs, checks for termination, and releases a semaphore for the next scheduling round.

The Scheduler and Update Loop run in parallel using Python coroutines, with a semaphore limiting the maximum scheduling steps.

Model Runner continuously consumes requests, runs the model, and returns results without idle time.

TAG’s timeline (illustrated below) shows fully asynchronous message passing between components.

Cross‑Process Communication

The Scheduler and Model Runner are separated into CPU and GPU processes, requiring inter‑process communication. Options evaluated include RPC (e.g., gRPC), message queues, sockets, and shared memory. gRPC offers broad protocol support but incurs noticeable overhead for small messages; message queues degrade with larger payloads; shared memory provides the highest throughput for large data.

Consequently, the engine adopts Unix Domain Sockets combined with shared memory, achieving sub‑millisecond latency for 50–100 concurrent requests.

Performance Evaluation

Benchmarks compare TAG with vLLM 0.6.0 and SGLang 0.3.0 under 64, 256, and 512 concurrent requests, measuring QPS and total batch time (TBT). The test uses fixed input length (10 tokens) and output lengths drawn from a 20‑500 token normal distribution, with ignore_eos=True.

Results show that TAG completely masks worker‑process overhead, delivering lower latency, higher throughput, and supporting higher concurrency than the community solutions.

Conclusion and Outlook

We have built a fully asynchronous LLM inference engine, TAG, which eliminates synchronous dependencies among components, maximizes GPU utilization, and improves service throughput while reducing request latency. Internal tests demonstrate support for over a thousand concurrent decode instances without runtime overhead becoming a bottleneck. Future work will focus on further reducing Model Runner costs to achieve even more extreme performance.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

LLM Inference GPU Utilization performance benchmarking runtime optimization asynchronous scheduling

Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.