CCD‑Aware Thread Orchestration Shatters Multi‑Core CPU Vector Search Performance Ceiling

The paper presents a CCD‑level load‑aware thread orchestration framework that boosts vector ANNS throughput up to 3.7×, cuts P999 tail latency by 30%‑90%, reduces L3 cache miss rates by 6%‑30% and CPU stall time by 20%‑80% on AMD EPYC multi‑chiplet CPUs.

Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
CCD‑Aware Thread Orchestration Shatters Multi‑Core CPU Vector Search Performance Ceiling

Background and Challenge

Approximate Nearest Neighbor Search (ANNS) underpins modern search, recommendation and advertising services. Industrial workloads at Xiaohongshu process tens of millions of QPS over thousands of vector indexes ranging from millions to hundreds of billions of vectors, using CPU‑resident indexes under strict latency SLAs.

To meet growing demand the team adopted AMD EPYC CCD (Core Complex Die) chiplet CPUs (e.g., Genoa 4th‑gen, 12 CCDs, 96 cores, 32 MiB L3 per CCD). Although the core count theoretically promises near‑linear throughput scaling, observed HNSW throughput reached only 82 % of the theoretical peak and IVF speed‑up was merely 1.6‑2.8×, far below expectations.

Root‑cause analysis revealed that existing thread schedulers ignore CCD topology, causing massive cross‑CCD cache invalidations, cache pollution and intensified memory‑bandwidth contention.

Core Contributions

Systematic characterization and quantification of industrial‑grade ANNS workloads on CCD‑multicore CPUs, exposing access locality distribution, cross‑CCD traffic skew and tail‑latency sources.

Design of a CCD‑aware hot‑cold adaptive mapping algorithm that dynamically assigns vector tables to CCDs based on online traffic estimation, pairing hot tables with cold ones to balance memory traffic and avoid Hot‑Hot co‑location.

Introduction of a topology‑aware hierarchical task‑stealing mechanism that respects CPU physical topology, limiting cross‑CCD steals to situations where an entire CCD is idle.

CPU Cache‑Aware Scheduling Study

HNSW and IVF are the dominant ANNS algorithms (HNSW via hierarchical graphs, IVF via inverted file indexes). Existing schedulers such as OpenMP (work‑sharing) and custom thread pools like Baidu's bthread (global work stealing) do not consider CCD topology, leading to systematic bottlenecks.

Unlike NUMA, CCDs expose independent L3 caches per die, and operating systems provide little CCD‑aware scheduling support.

Proposed CCD‑Aware Framework

The CCD‑Level and Load‑Aware Thread Orchestration Framework is inserted as a drop‑in middle layer between the vector index and the underlying scheduler, requiring no changes to index code or OS scheduling.

Unified Task Submission Interface

Inter‑query parallelism for HNSW: each query runs as a single task on one core; queries for the same table are kept on the same CCD to maximize L3 cache reuse.

Intra‑query parallelism for IVF: a query is split into sub‑tasks (different cluster lists) executed in parallel across cores of the same CCD, reducing per‑query latency.

Hot‑Cold Adaptive Mapping Scheduler

The scheduler estimates per‑table memory traffic in real time, sorts tables by traffic, and applies a greedy two‑ended scan: the hottest table is paired with the CCD that currently has the lowest cumulative traffic, while the coldest table is placed on the same CCD, forming a hot‑cold pair. This balances traffic and prevents multiple hot tables from sharing a CCD.

Two sliding windows (10 s for rapid changes, 60 s for stable trends) trigger remapping when traffic deviation exceeds a threshold; versioned snapshots ensure smooth transition without latency spikes.

Topology‑Aware Hierarchical Task Stealing

Level 1: Local dequeue – threads first pull from their own queue (zero cross‑CCD cost).

Level 2: Intra‑CCD stealing – if the local queue is empty, steal from another thread on the same CCD, preserving cache locality.

Level 3: Cross‑CCD stealing – allowed only when all threads on a CCD are idle and another CCD is heavily loaded, limiting cache‑miss penalties.

This hierarchy reduces cross‑CCD steal rates from ~75 % (HNSW) / ~80 % (IVF) to <10 % / ~5 % respectively.

Experimental Evaluation

Environment : AMD EPYC 96‑core Genoa (12 CCDs, 32 MiB L3/CCD) and 48‑core Rome (12 CCDs, 16 MiB L3/CCD) with 576 GB DDR5 / 512 GB DDR4 memory. Test data includes 60 HNSW tables (1 M‑10 M vectors, 64‑256 dimensions) and 15 IVF tables (10 K‑15 M vectors, same dimensions) drawn from production workloads.

Throughput : On 96‑core Genoa, the CCD‑aware V2 reaches >100 KQPS for HNSW (vs. ~70 KQPS for V0/V1) and ~35 KQPS for IVF (vs. ~25 KQPS for V1, ~10 KQPS for V0). HNSW shows near‑linear scaling as CCD count grows from 4 to 12.

Latency : P50 latency improves by 30 %‑50 %; P999 tail latency drops by 60 %‑90 %. Long‑tail reductions stem from eliminating occasional cross‑CCD steals that caused cache re‑warming.

Hardware Metrics (relative to V1 baseline):

L3 cache miss rate ↓ 6 %‑15 % (HNSW) and ↓ 15 %‑30 % (IVF).

CPU stall proportion ↓ 20 %‑40 % (HNSW) and ↓ 40 %‑80 % (IVF).

Cross‑CCD steal rate ↓ from ~75 % / ~80 % to <10 % / ~5 %.

These reductions indicate that more computation stays within L3 cache, memory accesses shrink, and CPU cycles are better utilized.

Conclusion

The study systematically analyses performance bottlenecks of industrial ANNS services on CCD‑based multicore CPUs and introduces the first CCD‑aware thread orchestration framework. By combining hot‑cold load‑aware mapping with topology‑aware hierarchical stealing, the framework resolves the long‑standing conflict between cache affinity and load balancing without hardware changes or index code modifications, achieving up to 3.7× throughput gain and up to 90 % tail‑latency reduction.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Performance Optimizationvector searchthread schedulingCPU cacheANNSCCD
Xiaohongshu Tech REDtech
Written by

Xiaohongshu Tech REDtech

Official account of the Xiaohongshu tech team, sharing tech innovations and problem insights, advancing together.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.