How Distributed KVCache (EIC) Revolutionizes Large‑Model Inference Performance

This article examines how Volcano Engine's Elastic Instant Cache (EIC) tackles the memory bottleneck, high‑concurrency latency, and cross‑node coordination challenges of large language model inference by decoupling storage and computation, pooling resources, and applying layered optimizations, ultimately boosting AI inference efficiency, scalability, and cost‑effectiveness across various deployment scenarios.

Volcano Engine Developer Services
Volcano Engine Developer Services
Volcano Engine Developer Services
How Distributed KVCache (EIC) Revolutionizes Large‑Model Inference Performance

Background: Rapid Evolution of Inference Frameworks

Large language models such as Doubao, Kimi and DeepSeek have advanced dramatically, driving demand for efficient, stable and cost‑effective inference services. Memory bottlenecks, high‑concurrency latency and cross‑node coordination become critical challenges.

Core Principles of LLM Inference Frameworks

Open‑source frameworks like vLLM, SGLang, TensorRT‑LLM and TGI aim to achieve low latency, high throughput and low cost through system‑level optimizations. KV‑Cache is the key mechanism that stores attention keys and values to avoid recomputation.

vLLM : uses PagedAttention, continuous batching, high GPU utilization, OpenAI API compatibility and broad model support.

SGLang : high‑performance runtime with advanced APIs, strong distributed deployment, built‑in JSON parsing and excellent high‑concurrency performance.

TensorRT‑LLM : built on NVIDIA TensorRT, supports INT8/FP8 low‑precision, In‑Flight Batching and Paged KV Caching for superior GPU performance.

Inference Framework Integration

EIC currently integrates with SGLang and vLLM; the following focuses on SGLang.

SGLang Architecture

SGLang is designed for large language and vision‑language models, separating a serving interface, scheduler and executor.

Serving Interface : API server, tokenizer and detokenizer handle request reception and simple encoding/decoding.

Scheduler : Scheduler class with group‑batch policy, disaggregation, overlap, Radix Cache and Prefill Adder modules.

Executor : TPWorker is the core entry, handling tensor‑parallel and pipeline‑parallel execution, managing Model Runner, memory pool and NCCL communication group.

Request flow diagram:

Inference Scheduling

Two levels of scheduling are discussed: single‑instance task scheduling (Zero‑Overhead Batch, Continuous Batching) and cluster‑wide global scheduling (Dynamo).

Zero‑Overhead Batch : overlaps CPU and GPU stages, achieving up to 30% higher throughput.

Continuous Batching : dynamically adds new requests to a batch as soon as any request finishes, reducing GPU idle time.

Cluster‑wide scheduling must consider PD (Prefill‑Decode) separation and KVCache awareness to balance load across instances.

PD Separation

Prefill is compute‑intensive while Decode is memory‑intensive. SGLang separates them into different server pools, improving resource allocation and overall inference efficiency.

KVCache Design

SGLang uses RadixAttention with a radix tree for PrefixCache; vLLM employs PageAttention with hashed prefix blocks.

External KVCache integration differs: SGLang’s HiCache provides hierarchical offload on a single node, and EIC extends it for multi‑instance sharing; vLLM uses a KV Transfer connector for modular external KVCache access.

Swap‑out (GPU → remote) is asynchronous; swap‑in (remote → GPU) is synchronous and must complete before computation.

Model Loading

Models are loaded from safetensors or sharded checkpoints. EIC’s sharded format stores the state_dict directly in its KV interface, enabling large‑block sequential I/O.

{model_path}-pp{pp}-rank{pprank}-tp{tp}-rank{rank}

Three Major Bottlenecks in LLM Inference

KVCache management , elastic scheduling and model loading limit concurrency, elasticity and startup speed.

KVCache Challenges

GPU memory consumption grows with context length and batch size, limiting concurrent requests.

KVCache lifecycle is per‑request, causing redundant computation for identical prompts.

Swap‑out/in incurs high latency, especially when moving data back to GPU.

Scheduling and PD Separation

Multi‑instance deployment and PD separation require KVCache‑aware load balancing; Dynamo provides such capabilities.

EIC Distributed KVCache Solution

EIC transforms KVCache from a private GPU resource into a shared, cluster‑wide cache, enabling cross‑request and cross‑node reuse, reducing memory pressure and latency.

Breaks single‑GPU memory limits by offloading KVCache via GDR with ultra‑low latency.

Supports prefix‑hash based cache lookup so repeated prompts are computed once and shared.

Traditional mode : each request computes and stores its own KVCache → resource waste and redundant computation. EIC mode : compute once, store centrally, all requests read as needed → “store‑instead‑compute”.

Performance, Cost and Ecosystem Benefits

Performance : low‑latency RDMA, optimized network model, multi‑NIC topology, write‑cache overhead kept below 5%.

Cost : engine‑level compression, MLA optimization reduces KVCache footprint and improves cache hit rate.

Ecosystem : out‑of‑the‑box support for vLLM, SGLang, Dynamo, LMCache, AIBrix and PD‑separation adapters.

Model Cache Layer – Seconds‑Level Loading

EIC acts as a high‑speed model cache, turning “pull” from remote storage into “read” from local distributed memory, cutting model load time for a 640 GB DeepSeek‑R1 model to roughly 13 seconds.

Case Studies

Case 1: Accelerating SGLang AI Coding – KVCache read throughput reaches 50 GB/s, reducing total runtime from 4232 s to 2065 s.

Case 2: Enhancing Service Elasticity – Rolling restart of 100 DeepSeek‑R1 nodes shows linear speed‑up with EIC bandwidth: 200 GB/s (≈1 h), 400 GB/s (≈30 min), 800 GB/s (≈15 min), achieving over three‑fold improvement compared to NVMe‑based loading.

Future Outlook

EIC will expand to more AI scenarios, integrate advanced inference techniques, bridge training and inference, improve QoS, and strengthen disaster recovery and security, solidifying its role as a core AI infrastructure component.

vLLMLLM inferenceAI infrastructuredistributed cachingSGLangKVCache
Volcano Engine Developer Services
Written by

Volcano Engine Developer Services

The Volcano Engine Developer Community, Volcano Engine's TOD community, connects the platform with developers, offering cutting-edge tech content and diverse events, nurturing a vibrant developer culture, and co-building an open-source ecosystem.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.