Boosting LLM Inference: How NanoFlow Doubles Throughput

The article introduces NanoFlow, a novel service framework that leverages intra‑device parallelism, operation‑based pipelining, and async scheduling to significantly improve large language model serving throughput, achieving up to 1.91× higher performance while integrating with Alibaba Cloud PAI.

Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Boosting LLM Inference: How NanoFlow Doubles Throughput

Overview

Recent rapid advances in AI have driven widespread adoption of large language models (LLMs), creating urgent demand for efficient serving. The paper “NanoFlow: Towards Optimal Large Language Model Serving Throughput” proposes a novel framework that significantly improves inference throughput.

Alibaba Cloud PAI team’s BladeLLM aims to provide high‑performance, stable, enterprise‑grade LLM inference. NanoFlow’s optimization strategies align with their research direction, offering insights for more efficient model serving.

Key Ideas

Traditional CPU execution can waste cycles when a single execution stream blocks on I/O. Techniques such as hyper‑threading, out‑of‑order execution, and multiple pipelines keep CPUs busy. GPUs face similar under‑utilization; NanoFlow addresses this at the software level.

Prior approaches used data, tensor, and pipeline parallelism across devices but failed to fully exploit intra‑device resources. NanoFlow introduces a new service framework that leverages internal parallelism via “NanoBatch”, breaking sequential dependencies in inference and overlapping resource usage. Its main innovations include operation‑based pipelining and scheduling that partition functional units for concurrent execution. Evaluations show up to 1.91× higher throughput than state‑of‑the‑art systems, achieving 59%–72% of optimal throughput and good cross‑model portability.

GPU Implementation

Similar to CPU hyper‑threading, NanoFlow schedules multiple independent execution streams on a GPU, allowing operations without data dependencies to run concurrently and maximize resource overlap. However, indiscriminate scheduling can cause contention, so careful balancing is required.

For a given model, NanoFlow determines the NanoBatch size for each operation and its resource allocation using offline profiling combined with a greedy search.

The figure illustrates tensor‑parallel group partitioning and the ideal execution flow achieving optimal resource overlap.

CPU Implementation

Even for CPU tasks, NanoFlow strives to keep the GPU busy. It employs an async scheduler that assembles the next batch and allocates KV‑cache space on the CPU while the current iteration runs on the GPU. After the iteration finishes, the prepared batch is immediately dispatched.

It also supports async KV‑cache offload: completed request caches are saved to SSD using an LRU policy. Offload and reload operations are overlapped with GPU execution, and cache pages are aggregated into contiguous memory before transfer.

Integration with PAI

Combined with Alibaba Cloud’s PAI platform, NanoFlow complements the pure‑asynchronous inference architecture TAG (Totally Asynchronous Generator) of BladeLLM, opening additional asynchronous execution space. Future work will reproduce and evaluate NanoFlow together with TAG to explore optimization opportunities in fully asynchronous systems.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

GPU SchedulingLLM servingAlibaba Cloud PAINanoFlowThroughput Optimization
Alibaba Cloud Big Data AI Platform
Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.