PetPS: A Persistent‑Memory Parameter Server for Large‑Scale Embedding Models
PetPS introduces a persistent‑memory‑based parameter server that redesigns indexing with the PetHash hash table and offloads parameter aggregation to NIC Gathering, achieving up to 1.7× higher throughput and significantly lower latency for industrial‑scale embedding models in recommendation, search, and advertising workloads.
Embedding models are widely used in industry for recommendation, advertising, and search, converting high‑dimensional sparse ID features into low‑dimensional dense vectors via large embedding tables. As model sizes grow to the trillion‑parameter level, traditional DRAM‑based parameter servers become costly and suffer long recovery times.
To address these challenges, the authors collected trace data from Kuaishou's online inference service and identified three key load characteristics: read‑intensive access, stable capacity load, and batch processing of thousands of IDs per request.
The proposed system, PetPS, builds on these insights with two main innovations: (1) PetHash , a persistent‑memory‑optimized hash index featuring a single‑layer structure, hotspot‑aware migration, and prefetching to minimize PM reads; and (2) NIC Gathering , which offloads the aggregation of embedding parameters to the network interface card using scatter‑gather DMA, reducing CPU involvement.
PetHash employs a single‑layer bucket layout with open addressing, storing metadata such as fingerprints, version numbers, and overflow counters. A dedicated migration thread moves hot key‑value pairs to their home buckets, while a prefetch mechanism issues fetch instructions for the next bucket during batch processing, effectively hiding PM latency.
NIC Gathering leverages the NIC's scatter‑gather DMA capability to collect parameters directly from PM, eliminating costly CPU reads and cache misses. The system ensures DMA safety using copy‑on‑write and an epoch‑list reclamation scheme.
Experimental evaluation on Intel Optane DC PM and real production workloads from Kuaishou shows that PetPS achieves 1.3‑1.7× higher peak throughput and reduces median and P99 latencies by up to 5× compared to baseline parameter servers (PSLite, DashPS, KuaiPS). PetHash improves index throughput by 1.3‑2.5×, and NIC Gathering cuts aggregation time from 180 µs to 14 µs, yielding up to 1.2× end‑to‑end throughput gains.
In summary, PetPS is the first industry‑grade persistent‑memory parameter server, demonstrating that tailored indexing and NIC‑offloaded aggregation can effectively mitigate PM read latency and CPU bottlenecks for massive embedding models, while also reducing hardware cost by about 30% without sacrificing performance.
Kuaishou Tech
Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.