How Baidu’s HybridIndexTable Redefined Inverted Index Performance
This article examines Baidu's Limitless ad‑funnel optimization, detailing how a cache‑friendly HybridIndexTable and adaptive memory containers dramatically improve inverted‑list scan speed, reduce memory overhead, and enable lock‑free updates for high‑throughput retrieval systems.
Limitless System Goals
The advertising pipeline (recall → coarse ranking → fine ranking) should behave like a "no‑cut" funnel: high read throughput, immediate update latency, and low memory consumption. Baidu’s Limitless project aims to keep the funnel as narrow as possible while satisfying all three constraints.
Technical Background – Inverted List
Before optimization the inverted index was a classic HashMap<keysign, SkipList>. Each query scans a linked list of postings for the triggered keysign. Advertising workloads generate many long and ultra‑long chains, causing severe cache‑misses because linked lists have poor spatial locality on modern CPUs with deep memory hierarchies.
Performance Bottleneck Analysis
Cache‑unfriendly linked lists incur O(n) cache‑miss cost per scan, while balanced trees give O(log n) . Prefix trees (tries) reduce cache misses to O(k/s) , where k is key length and s is fan‑out, making them far more cache‑friendly.
Solution – HybridIndexTable (HIT)
To meet the Limitless requirements Baidu designed an in‑memory data structure called HybridIndexTable (HIT) . Its key ideas are: HashMap indexes keysign.
Short posting chains are stored contiguously in a custom RowContainer (RC) . Long chains are represented by a leaf‑prefix tree based on the Adaptive Radix Tree (ART).
Both short chains and leaf nodes share the same RC layout, enabling compact storage and sequential scans.
Additional details:
Key‑order routing guarantees ordered traversal for ultra‑long chains.
RC units are scanned without internal ordering; the scanner processes RCs one by RC at a time, which yields fast sequential access.
Advantages of HIT
Read performance: contiguous memory plus prefix‑tree reduces cache misses; the design is thread‑friendly for readers.
Update latency: append‑only writes and mark‑delete semantics allow immediate material updates.
Memory overhead: adaptive algorithms keep per‑element overhead under 8 bytes.
RowContainer (RC) Design
RC_x denotes a contiguous storage region that can hold up to x elements. It supports only two operations:
Append‑Only: new elements are added at the tail.
Mark‑Delete: a bitmap marks slots as invalid without moving other elements.
Metadata includes a cursor (next free slot) and a validity bitmap. For the smallest RC (RC1) the metadata fits in 8 bytes; larger RCs reserve an extra pointer for the data array and use bit‑fields to store the bitmap, still keeping the per‑element overhead ≤ 8 bytes.
Memory allocation is slab‑based. Approximately 80 % of posting chains are short and fit into RC1, so a dedicated slab pool manages these objects, eliminating an extra pointer indirection during lookup. Larger RCs use an array‑of‑arrays layout to keep internal fragmentation low.
LeafCompaction for Sparse Keys
Even with ART, sparse key distributions can produce deep, low‑fan‑out sub‑trees, hurting scan performance. LeafCompaction merges a leaf node into its parent when the subtree is sparse and splits it back into a tree when it becomes dense. The default heuristic evaluates the average subtree size; in practice it achieves up to 90 % compression and doubles per‑element scan speed in sparse scenarios.
Read‑Copy‑Update (RCU) for Lock‑Free Reads
Traditional fine‑grained locks introduce contention that masks the benefits of cache‑friendly data layouts. Baidu adopts Linux’s RCU mechanism:
Readers access a stable snapshot without acquiring locks.
Writers copy the data structure, apply modifications, then atomically replace the old pointer.
A grace period ends when all CPUs have passed a quiescent state, after which the old version can be reclaimed.
Three properties of the retrieval service simplify RCU usage:
Single writer thread (the reload thread) performs material updates.
No transactional dependencies between recalled items.
Reader hold time is bounded by strict latency SLAs.
Learned Index & L2I – Workload‑Adaptive Scanning
Recent research on Learned Indexes (RMI, CDF models) shows that data‑driven models can replace traditional index structures. Baidu extends this idea with L2I , which:
Analyzes offline co‑occurrence of ultra‑short and short chains to merge them, leveraging HIT’s fast sequential scan.
Uses entropy‑maximizing <pid,cid> bitmaps to decide whether a query can be answered by a point‑lookup fallback.
References
Software Engineering Advice from Building Large‑Scale Distributed Systems, 2002
Cache‑conscious indexing for decision‑support in main memory, 1999
Making B+‑trees cache conscious in main memory, 2000
FAST: fast architecture‑sensitive tree search on modern CPUs and GPUs, 2010
The Adaptive Radix Tree: ARTful Indexing for Main‑Memory Databases, 2013
PATRICIA – Practical Algorithm To Retrieve Information Coded in Alphanumeric, 1968
HOT: A height‑optimized trie index for main‑memory database systems, 2018
The Case for Learned Index Structures, 2018
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
