How Distributed KVCache (EIC) Revolutionizes Large‑Model Inference Performance
This article examines how Volcano Engine's Elastic Instant Cache (EIC) tackles the memory bottleneck, high‑concurrency latency, and cross‑node coordination challenges of large language model inference by decoupling storage and computation, pooling resources, and applying layered optimizations, ultimately boosting AI inference efficiency, scalability, and cost‑effectiveness across various deployment scenarios.
Background: Rapid Evolution of Inference Frameworks
Large language models such as Doubao, Kimi and DeepSeek have advanced dramatically, driving demand for efficient, stable and cost‑effective inference services. Memory bottlenecks, high‑concurrency latency and cross‑node coordination become critical challenges.
Core Principles of LLM Inference Frameworks
Open‑source frameworks like vLLM, SGLang, TensorRT‑LLM and TGI aim to achieve low latency, high throughput and low cost through system‑level optimizations. KV‑Cache is the key mechanism that stores attention keys and values to avoid recomputation.
vLLM : uses PagedAttention, continuous batching, high GPU utilization, OpenAI API compatibility and broad model support.
SGLang : high‑performance runtime with advanced APIs, strong distributed deployment, built‑in JSON parsing and excellent high‑concurrency performance.
TensorRT‑LLM : built on NVIDIA TensorRT, supports INT8/FP8 low‑precision, In‑Flight Batching and Paged KV Caching for superior GPU performance.
Inference Framework Integration
EIC currently integrates with SGLang and vLLM; the following focuses on SGLang.
SGLang Architecture
SGLang is designed for large language and vision‑language models, separating a serving interface, scheduler and executor.
Serving Interface : API server, tokenizer and detokenizer handle request reception and simple encoding/decoding.
Scheduler : Scheduler class with group‑batch policy, disaggregation, overlap, Radix Cache and Prefill Adder modules.
Executor : TPWorker is the core entry, handling tensor‑parallel and pipeline‑parallel execution, managing Model Runner, memory pool and NCCL communication group.
Request flow diagram:
Inference Scheduling
Two levels of scheduling are discussed: single‑instance task scheduling (Zero‑Overhead Batch, Continuous Batching) and cluster‑wide global scheduling (Dynamo).
Zero‑Overhead Batch : overlaps CPU and GPU stages, achieving up to 30% higher throughput.
Continuous Batching : dynamically adds new requests to a batch as soon as any request finishes, reducing GPU idle time.
Cluster‑wide scheduling must consider PD (Prefill‑Decode) separation and KVCache awareness to balance load across instances.
PD Separation
Prefill is compute‑intensive while Decode is memory‑intensive. SGLang separates them into different server pools, improving resource allocation and overall inference efficiency.
KVCache Design
SGLang uses RadixAttention with a radix tree for PrefixCache; vLLM employs PageAttention with hashed prefix blocks.
External KVCache integration differs: SGLang’s HiCache provides hierarchical offload on a single node, and EIC extends it for multi‑instance sharing; vLLM uses a KV Transfer connector for modular external KVCache access.
Swap‑out (GPU → remote) is asynchronous; swap‑in (remote → GPU) is synchronous and must complete before computation.
Model Loading
Models are loaded from safetensors or sharded checkpoints. EIC’s sharded format stores the state_dict directly in its KV interface, enabling large‑block sequential I/O.
{model_path}-pp{pp}-rank{pprank}-tp{tp}-rank{rank}Three Major Bottlenecks in LLM Inference
KVCache management , elastic scheduling and model loading limit concurrency, elasticity and startup speed.
KVCache Challenges
GPU memory consumption grows with context length and batch size, limiting concurrent requests.
KVCache lifecycle is per‑request, causing redundant computation for identical prompts.
Swap‑out/in incurs high latency, especially when moving data back to GPU.
Scheduling and PD Separation
Multi‑instance deployment and PD separation require KVCache‑aware load balancing; Dynamo provides such capabilities.
EIC Distributed KVCache Solution
EIC transforms KVCache from a private GPU resource into a shared, cluster‑wide cache, enabling cross‑request and cross‑node reuse, reducing memory pressure and latency.
Breaks single‑GPU memory limits by offloading KVCache via GDR with ultra‑low latency.
Supports prefix‑hash based cache lookup so repeated prompts are computed once and shared.
Traditional mode : each request computes and stores its own KVCache → resource waste and redundant computation. EIC mode : compute once, store centrally, all requests read as needed → “store‑instead‑compute”.
Performance, Cost and Ecosystem Benefits
Performance : low‑latency RDMA, optimized network model, multi‑NIC topology, write‑cache overhead kept below 5%.
Cost : engine‑level compression, MLA optimization reduces KVCache footprint and improves cache hit rate.
Ecosystem : out‑of‑the‑box support for vLLM, SGLang, Dynamo, LMCache, AIBrix and PD‑separation adapters.
Model Cache Layer – Seconds‑Level Loading
EIC acts as a high‑speed model cache, turning “pull” from remote storage into “read” from local distributed memory, cutting model load time for a 640 GB DeepSeek‑R1 model to roughly 13 seconds.
Case Studies
Case 1: Accelerating SGLang AI Coding – KVCache read throughput reaches 50 GB/s, reducing total runtime from 4232 s to 2065 s.
Case 2: Enhancing Service Elasticity – Rolling restart of 100 DeepSeek‑R1 nodes shows linear speed‑up with EIC bandwidth: 200 GB/s (≈1 h), 400 GB/s (≈30 min), 800 GB/s (≈15 min), achieving over three‑fold improvement compared to NVMe‑based loading.
Future Outlook
EIC will expand to more AI scenarios, integrate advanced inference techniques, bridge training and inference, improve QoS, and strengthen disaster recovery and security, solidifying its role as a core AI infrastructure component.
Volcano Engine Developer Services
The Volcano Engine Developer Community, Volcano Engine's TOD community, connects the platform with developers, offering cutting-edge tech content and diverse events, nurturing a vibrant developer culture, and co-building an open-source ecosystem.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
