CCD‑Aware Thread Orchestration Shatters Multi‑Core CPU Vector Search Performance Ceiling
The paper presents a CCD‑level load‑aware thread orchestration framework that boosts vector ANNS throughput up to 3.7×, cuts P999 tail latency by 30%‑90%, reduces L3 cache miss rates by 6%‑30% and CPU stall time by 20%‑80% on AMD EPYC multi‑chiplet CPUs.
Background and Challenge
Approximate Nearest Neighbor Search (ANNS) underpins modern search, recommendation and advertising services. Industrial workloads at Xiaohongshu process tens of millions of QPS over thousands of vector indexes ranging from millions to hundreds of billions of vectors, using CPU‑resident indexes under strict latency SLAs.
To meet growing demand the team adopted AMD EPYC CCD (Core Complex Die) chiplet CPUs (e.g., Genoa 4th‑gen, 12 CCDs, 96 cores, 32 MiB L3 per CCD). Although the core count theoretically promises near‑linear throughput scaling, observed HNSW throughput reached only 82 % of the theoretical peak and IVF speed‑up was merely 1.6‑2.8×, far below expectations.
Root‑cause analysis revealed that existing thread schedulers ignore CCD topology, causing massive cross‑CCD cache invalidations, cache pollution and intensified memory‑bandwidth contention.
Core Contributions
Systematic characterization and quantification of industrial‑grade ANNS workloads on CCD‑multicore CPUs, exposing access locality distribution, cross‑CCD traffic skew and tail‑latency sources.
Design of a CCD‑aware hot‑cold adaptive mapping algorithm that dynamically assigns vector tables to CCDs based on online traffic estimation, pairing hot tables with cold ones to balance memory traffic and avoid Hot‑Hot co‑location.
Introduction of a topology‑aware hierarchical task‑stealing mechanism that respects CPU physical topology, limiting cross‑CCD steals to situations where an entire CCD is idle.
CPU Cache‑Aware Scheduling Study
HNSW and IVF are the dominant ANNS algorithms (HNSW via hierarchical graphs, IVF via inverted file indexes). Existing schedulers such as OpenMP (work‑sharing) and custom thread pools like Baidu's bthread (global work stealing) do not consider CCD topology, leading to systematic bottlenecks.
Unlike NUMA, CCDs expose independent L3 caches per die, and operating systems provide little CCD‑aware scheduling support.
Proposed CCD‑Aware Framework
The CCD‑Level and Load‑Aware Thread Orchestration Framework is inserted as a drop‑in middle layer between the vector index and the underlying scheduler, requiring no changes to index code or OS scheduling.
Unified Task Submission Interface
Inter‑query parallelism for HNSW: each query runs as a single task on one core; queries for the same table are kept on the same CCD to maximize L3 cache reuse.
Intra‑query parallelism for IVF: a query is split into sub‑tasks (different cluster lists) executed in parallel across cores of the same CCD, reducing per‑query latency.
Hot‑Cold Adaptive Mapping Scheduler
The scheduler estimates per‑table memory traffic in real time, sorts tables by traffic, and applies a greedy two‑ended scan: the hottest table is paired with the CCD that currently has the lowest cumulative traffic, while the coldest table is placed on the same CCD, forming a hot‑cold pair. This balances traffic and prevents multiple hot tables from sharing a CCD.
Two sliding windows (10 s for rapid changes, 60 s for stable trends) trigger remapping when traffic deviation exceeds a threshold; versioned snapshots ensure smooth transition without latency spikes.
Topology‑Aware Hierarchical Task Stealing
Level 1: Local dequeue – threads first pull from their own queue (zero cross‑CCD cost).
Level 2: Intra‑CCD stealing – if the local queue is empty, steal from another thread on the same CCD, preserving cache locality.
Level 3: Cross‑CCD stealing – allowed only when all threads on a CCD are idle and another CCD is heavily loaded, limiting cache‑miss penalties.
This hierarchy reduces cross‑CCD steal rates from ~75 % (HNSW) / ~80 % (IVF) to <10 % / ~5 % respectively.
Experimental Evaluation
Environment : AMD EPYC 96‑core Genoa (12 CCDs, 32 MiB L3/CCD) and 48‑core Rome (12 CCDs, 16 MiB L3/CCD) with 576 GB DDR5 / 512 GB DDR4 memory. Test data includes 60 HNSW tables (1 M‑10 M vectors, 64‑256 dimensions) and 15 IVF tables (10 K‑15 M vectors, same dimensions) drawn from production workloads.
Throughput : On 96‑core Genoa, the CCD‑aware V2 reaches >100 KQPS for HNSW (vs. ~70 KQPS for V0/V1) and ~35 KQPS for IVF (vs. ~25 KQPS for V1, ~10 KQPS for V0). HNSW shows near‑linear scaling as CCD count grows from 4 to 12.
Latency : P50 latency improves by 30 %‑50 %; P999 tail latency drops by 60 %‑90 %. Long‑tail reductions stem from eliminating occasional cross‑CCD steals that caused cache re‑warming.
Hardware Metrics (relative to V1 baseline):
L3 cache miss rate ↓ 6 %‑15 % (HNSW) and ↓ 15 %‑30 % (IVF).
CPU stall proportion ↓ 20 %‑40 % (HNSW) and ↓ 40 %‑80 % (IVF).
Cross‑CCD steal rate ↓ from ~75 % / ~80 % to <10 % / ~5 %.
These reductions indicate that more computation stays within L3 cache, memory accesses shrink, and CPU cycles are better utilized.
Conclusion
The study systematically analyses performance bottlenecks of industrial ANNS services on CCD‑based multicore CPUs and introduces the first CCD‑aware thread orchestration framework. By combining hot‑cold load‑aware mapping with topology‑aware hierarchical stealing, the framework resolves the long‑standing conflict between cache affinity and load balancing without hardware changes or index code modifications, achieving up to 3.7× throughput gain and up to 90 % tail‑latency reduction.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Xiaohongshu Tech REDtech
Official account of the Xiaohongshu tech team, sharing tech innovations and problem insights, advancing together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
