Databases 19 min read

How NeighborHash Boosts Real‑Time Recommendation Queries with Low Latency

To meet the ultra‑low latency demands of modern recommendation systems, the authors designed a distributed batch‑query architecture featuring the NeighborHash optimization—a cache‑line‑aware hash table that reduces memory accesses, combined with NVMe‑backed storage and AMAC techniques, achieving high throughput and near‑optimal bandwidth utilization.

dbaplus Community

Jun 8, 2025

How NeighborHash Boosts Real‑Time Recommendation Queries with Low Latency

Background

Real‑time recommendation systems require ultra‑low‑latency top‑N retrieval from billions of key‑value pairs. The storage layer must provide high‑throughput batch reads, strong consistency under concurrent updates, and large capacity while keeping memory bandwidth usage minimal.

NeighborHash Design

NeighborHash is a SIMD‑accelerated hash table that reduces cache‑line accesses per lookup. Each 64‑bit hash is split into H1 (57 bits) for bucket addressing and H2 (7 bits) plus a marker that forms a 128‑bit metadata vector compared with a single SSE instruction.

Lodger Relocation : When a bucket is occupied by a “lodger” (a key whose hash maps to a different bucket), the lodger is moved to a nearby empty slot before inserting the new key. This keeps probe lengths short and simplifies insertion.

Bidirectional Cacheline‑aware Probing : Buckets that share a cache line are probed in both directions, lowering the average number of cache‑line accesses (≈1.12 at 75 % load factor) compared with classic linear probing.

Inline Chaining : The high 12 bits of the stored value encode a relative offset to the next element in the collision chain, allowing the chain to stay within the same cache line and supporting load factors >80 %.

Asynchronous Memory Access Chain (AMAC)

AMAC hides memory‑access latency in batch‑lookup scenarios where each lookup depends on the result of the previous one (chasing). A per‑key state machine issues pre‑fetch and compute instructions for the next key while the current key is waiting for data, keeping the processor’s multiple issue slots fully utilized.

Benchmark Results

Micro‑benchmarks compare NeighborHash (NB) against Linear Probing (LP), Byte‑level Hash (BT), Coalesced Hash (CH) and a Random Access (RA) baseline. NB consistently reduces cache‑line accesses and comparisons. Adding AMAC with 8 or 32 parallel state machines (AMAC‑8, AMAC‑32) narrows the gap to RA, achieving up to 2× speedup on large datasets.

Ablation studies show that each of the three innovations contributes roughly 20‑30 % throughput improvement, and together they lower the average cache‑line accesses per operation (APCL) from 1.72 to 1.14.

System Architecture

The production system follows a split architecture inspired by Google‑Mesa: an update subsystem and a batch‑query subsystem, each deployed across multiple shards and replicas.

Update Subsystem

Listener : watches for new model or user‑behavior data.

Index : converts raw data to protobuf/binary and builds indexes.

Update : performs rolling updates of replicas, restarting one replica at a time.

Metadata (table state, quotas) is stored in etcd; each table maps to a Kubernetes StatefulSet for elastic scaling of CPU, memory, and NVMe resources.

Batch‑Query Subsystem

Queries are served without downtime using rolling replica updates. Strong consistency for versioned embedding tables is achieved by embedding version metadata directly in the query protocol, avoiding external naming services.

NVMe Backend

Cold data resides on NVMe while hot keys stay in memory. The key‑offset‑length table is kept in RAM; actual value bytes are fetched from NVMe via io_uring, reducing cost without sacrificing query latency.

Open‑source implementation: https://github.com/slow-steppers/NeighborHash/commits/main/

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

recommendation system hash table Distributed storage NVMe batch query

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.