How DHPS Boosted Online Inference Throughput by 270% with RDMA
This article details the design and evolution of DHPS, Kuaishou's load‑balanced, RDMA‑based high‑performance service architecture, explaining its network, storage, and traffic‑scheduling innovations that deliver over 270% query‑throughput improvement, lower latency, reduced CPU usage, and near‑five‑nine availability for large‑scale AI inference workloads.
Project Background
Current online inference services require massive real‑time data transfer between compute nodes (inference services) and storage nodes (online PS services). As model parameters grow, traditional distributed architectures must scale to thousands of service nodes, causing bandwidth explosion, high latency, and CPU‑heavy TCP communication, limiting scale‑out.
To meet Kuaishou’s requirements, we upgraded the traditional architecture to a high‑density compute‑storage distributed system using RDMA for efficient inter‑node communication, saving CPU cycles, increasing GPU density, and dramatically improving network efficiency. This resulted in DHPS, the first load‑balanced, RDMA‑based high‑performance service architecture deployed in an online system in China.
Technical Implementation
Overall Architecture
Network Construction: Built a four‑layer AZ‑level network supporting massive RDMA and TCP mixed operation, reducing CPU usage and enabling AZ‑level cross‑POD high‑performance communication.
Software Optimization: Developed a high‑performance storage engine and an RDMA communication library opt‑rdma to fully exploit dense server resources and boost system throughput.
Traffic Scheduling: Implemented hardware‑aware scheduling that prefers intra‑POD RDMA traffic, falls back to TCP when RDMA fails, and dynamically balances load based on real‑time latency and success metrics.
High‑Performance Storage Engine
Index Optimization: 12‑way Cuckoo hash with 8‑bit tag SIMD matching reduces hash collisions and shortens read paths.
Batch Reads: Prefetching hides memory latency, greatly increasing read throughput.
Expiration Mechanism: TTL‑layered precise expiration and forced reclamation keep CPU overhead low.
Memory Management: Key‑in‑Value layout with periodic compaction limits memory fragmentation to under 5%.
RDMA Communication Library
Ease of Use: Wraps the complex RDMA Verbs API, providing an RPC‑style interface compatible with existing frameworks.
Compatibility: Supports both RDMA and TCP, automatically selecting the optimal path and falling back on failures.
High Performance: User‑space, lock‑free, zero‑copy design with QP sharing and atomic‑based master‑worker threading achieves tens of millions of QPS per machine.
Robustness: Reliable Connection mode guarantees no loss, in‑order delivery; fallback mechanisms ensure service continuity.
Traffic Scheduling & Load Balancing
Prioritizes intra‑POD RDMA traffic, then intra‑AZ RDMA, and finally cross‑AZ TCP. Dynamic flow ratios and node‑level circuit‑breaker mechanisms automatically adjust traffic based on real‑time latency and availability.
Performance Gains
DHPS improves query throughput by over 270 %, doubles update performance, reduces memory fragmentation by 40 %, cuts network latency by 35 %, and achieves 99.999 % availability in large‑scale clusters. In Kuaishou’s recommendation model, CPU usage drops 70 %, latency falls from milliseconds to hundreds of microseconds, and extreme throughput exceeds 270 % compared with the previous storage engine.
Mixed RDMA/TCP operation further raises single‑machine throughput beyond pure TCP limits.
Future Outlook
DHPS will continue to evolve toward region‑level RDMA coverage, deeper integration with AI workloads, and broader reuse in HPC, distributed storage, and large‑scale model inference.
Kuaishou Tech
Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
