NeighborHash: An Enhanced Batch Query Architecture for Real‑time Recommendation Systems
NeighborHash is a distributed batch‑query architecture for real‑time recommendation systems that combines a cache‑line‑optimized hash table—featuring Lodger Relocation, bidirectional cache‑aware probing, and inline‑chaining—with an NVMe‑backed key‑value service, versioned updates, and asynchronous memory‑access chaining to achieve sub‑microsecond, high‑throughput top‑N retrieval.
Modern recommendation systems require ultra‑low latency top‑N retrieval from massive data sets. The storage layer that can manage huge volumes and support high‑throughput batch queries becomes a critical component during recall, ranking, and offline model training.
Key challenges include (1) high‑throughput batch reads for large feature sets, (2) consistency of batch queries under concurrent updates, and (3) scaling storage capacity while preserving performance.
To address these, the authors designed a distributed batch‑query architecture featuring NeighborHash , an optimized hash table that dramatically reduces cache‑line accesses per lookup. NeighborHash is combined with an NVMe‑backed distributed key‑value service that scales horizontally for both functionality and model size, while a versioned update protocol guarantees eventual consistency during bulk queries.
The work is published as “An Enhanced Batch Query Architecture in Real‑time Recommendation” in the Proceedings of CIKM 2024 (doi: 10.1145/3627673.3680034 ).
NeighborHash builds on traditional hash‑table techniques (Separate Chaining and Open Addressing) and introduces three novel optimizations:
Lodger Relocation : a dynamic Cellar‑Region strategy that minimizes probe‑sequence length by moving “lodger” entries to nearby empty buckets.
Bidirectional Cacheline‑aware Probing : probes both directions within a cache line and expands outward only when necessary, reducing average cache‑line accesses from 1.47 (linear probing) to 1.12 at 75% load.
Inline‑Chaining : encodes the next‑element offset in the high 12 bits of a bucket, allowing a compact, cache‑friendly linked list without extra pointers.
Experimental configuration uses uniformly distributed keys, a load factor of 0.8, large‑page memory, a read‑dominant 90% hit workload, and absl::Hash . Results show linear probing accesses 1.36 cache lines and 2.37 comparisons on average, while Separate Chaining accesses 2.46 cache lines with similar comparisons.
The design also draws inspiration from Google’s Swiss Table, which splits a 64‑bit hash into H1 (57 bits) and H2 (7 bits) plus a marker bit. This meta‑data layer enables SIMD‑accelerated batch probing: 16 meta‑data bytes are compared in parallel using SSE instructions, producing a bitmap of matching slots.
Benchmarking compares NeighborHash (NB) against Linear Probing (LP), bytell Hash (BT), Coalesced Hash (CH), and a Random Access (RA) upper bound. NB consistently outperforms LP, BT, and CH, and approaches RA when accounting for the average of 1.12 cache‑line accesses per query.
To mitigate memory‑access dependencies, the authors introduce AMAC (Asynchronous Memory Access Chain). In batch mode, each lookup maintains a state machine; while one key is waiting for data, the processor proceeds with the next key, effectively overlapping memory loads and keeping the superscalar pipeline busy.
Further SIMD acceleration mirrors Swiss Table’s approach but extends it to multiple keys simultaneously, using a residual vector to hold keys that require additional probing.
Ablation studies isolate the impact of Lodger Relocation, Cacheline‑aware Neighbor Probing, and Inline‑Chaining, showing throughput gains of roughly 20 %, 30 %, and 30 % respectively, and reducing average cache‑line accesses per lookup (APCL) from 1.72 to 1.14.
The overall system adopts a Google Mesa‑style architecture with two main components: an update subsystem (handling versioned data ingestion, rolling updates, and etcd‑stored metadata) and a batch‑query subsystem (sharded, replicated, and capable of handling hundreds of thousands of queries per second). The update path ensures strong version consistency across shards during rolling updates.
For cold data, the design offloads large values to NVMe while keeping keys and offsets in memory, using io_uring for efficient I/O. The cache layer stores the full key set in memory and employs lightweight frequency statistics for eviction, differing from traditional LRU schemes.
Bilibili Tech
Provides introductions and tutorials on Bilibili-related technologies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.