Cloud Native 30 min read

How 3FS Powers High‑Performance KVCache for AI Inference: Architecture, Optimizations, and Cloud‑Native Deployment

This article details the design and engineering of the 3FS distributed file system as a scalable KVCache backend for large‑language‑model inference, covering its architecture, performance tuning, reliability fixes, integration with SGLang/vLLM, and cloud‑native Kubernetes operator deployment.

Alibaba Cloud Developer

Dec 17, 2025

How 3FS Powers High‑Performance KVCache for AI Inference: Architecture, Optimizations, and Cloud‑Native Deployment

Background and Motivation

Large language model (LLM) inference follows a self‑regressive pattern where each generated token requires recomputation of attention keys and values, creating a heavy compute bottleneck. KVCache mitigates this by caching K and V vectors, dramatically reducing latency and improving throughput for long‑context, high‑concurrency workloads such as multi‑turn dialogue and retrieval‑augmented generation.

3FS Overview

3FS (Fire‑Flyer File System) is an open‑source, high‑performance distributed file system built for AI workloads. Its core components are:

Mgmtd : a highly available control plane that maintains cluster topology and health.

Meta : stateless metadata service backed by FoundationDB, handling file open/close semantics.

Storage : SSD‑based data nodes exposing logical Targets linked in a Chain; data is replicated using the CRAQ (Chain Replication with Apportioned Queries) protocol, providing "write‑all, read‑any" consistency.

Client (FUSE) : a user‑space file‑system interface allowing applications to access 3FS via standard POSIX calls.

All components communicate over RDMA, delivering low‑latency, high‑bandwidth data paths.

Why 3FS for KVCache?

Compared with traditional solutions (Redis, GPU KVCache, GPFS, Ceph, JuiceFS), 3FS offers:

Capacity and Cost : PB‑scale shared storage pools built from commodity SSDs, reducing DRAM cost.

Bandwidth and Latency : Full‑stack RDMA delivers up to 6.6 TiB/s read bandwidth in a 180‑node cluster.

Read‑First Design : Optimized for read‑heavy KVCache access patterns, leveraging CRAQ for fast random reads.

Engineering Upgrades

Performance Optimizations

• RDMA traffic balancing and small‑I/O parameter tuning raised 4 KiB random‑read IOPS by 150 %.

• Integration of a full‑user‑space write‑back engine reduced CPU usage and memory overhead.

• Multi‑threaded client redesign and I/O aggregation increased effective bandwidth from ~200 MiB/s to ~20 GiB/s for SGLang workloads.

Product Enhancements

• Fixed Mgmtd IP drift and storage allocation imbalance.

• Added GPU‑Direct RDMA (GDR) zero‑copy paths and multi‑tenant isolation.

• Implemented HBM‑to‑storage end‑to‑end data flow without host‑memory copies.

Cloud‑Native Management

• Developed kvc-3fs-operator (GitHub: https://github.com/aliyun/kvc-3fs-operator) providing a Kubernetes CRD ThreeFsCluster for declarative deployment.

• Operator supports one‑click cluster provisioning, rolling upgrades, fault self‑healing, elastic scaling, and monitoring dashboards via ClickHouse + Grafana.

• Sidecar injection via MutatingAdmissionWebhook automatically mounts 3FS Fuse into user pods, making storage transparent to applications.

Reliability Improvements

DNS‑based Mgmtd discovery eliminates client breakage when the primary pod IP changes.

Randomized ChainTable generation and stripe‑size tuning ensure balanced storage utilization during file creation and cluster expansion.

Enhanced health‑probe logic detects storage node failures, updates routing tables, and retries I/O without manual intervention.

Multi‑network failover logic adds retries for Mgmtd primary election, preventing prolonged outages.

Integration with Inference Engines

3FS was integrated into SGLang and vLLM through a high‑performance USRBIO connector. After multi‑threading and I/O aggregation, SGLang achieved:

TTFT reduced by 78 % compared with a pure DRAM KVCache (520 % throughput gain).

Cold‑start recomputation time cut by 84 % (830 % throughput gain).

Performance results are documented at https://lmsys.org/blog/2025-09-10-sglang-hicache/.

KVCache Manager (KVCM) Integration

KVCM provides a unified HTTP/gRPC interface for global KVCache management and abstracts heterogeneous storage backends. To decouple KVCM from the 3FS Fuse dependency, a lightweight, stateless 3FS Master service was added, exposing POSIX‑compatible create/delete APIs over HTTP.

Metadata overhead was reduced by adopting a large‑file + slab allocator strategy, allowing clients to operate on a few big files while the allocator handles fine‑grained block allocation, dramatically lowering metadata service load.

Future Directions

Extend the 3FS Operator CRD for richer resource configuration and QoS‑aware multi‑tenant scheduling.

Introduce native KV semantics APIs to simplify application development.

Strengthen fault‑tolerance with dynamic replica migration and self‑healing mechanisms.

Co‑design storage hardware (AliFlash SSD, AliSCM) with 3FS for deeper software‑hardware synergy.

Key Takeaways

By combining RDMA‑accelerated data paths, CRAQ replication, extensive performance tuning, and a cloud‑native operator, 3FS delivers a scalable, low‑latency KVCache storage solution that meets the demanding throughput and latency SLAs of modern AI inference workloads.