Design Analysis of Lucene and In-Memory Inverted Index Service for Advertising Retrieval
The team analyzed Lucene’s disk‑based inverted index and built a custom in‑memory inverted‑index service for Himalaya’s ad engine, encoding terms as 64‑bit keys, supporting real‑time updates and BooleanQuery‑style and custom expression filtering, which cut query latency from ~50 ms to under 5 ms and enabled massive scaling.
Background
Himalaya's advertising engine originally used ElasticSearch (which relies on Lucene) to retrieve large numbers of ad materials. As the volume grew, Lucene's disk‑based inverted index caused significant I/O and performance bottlenecks. Because ad material IDs can be fully stored in memory, the team decided to design a high‑performance, memory‑resident inverted‑index service tailored to this scenario.
Lucene Design Analysis
Lucene stores data on disk and builds an inverted list for each Term . The core structures include a term dictionary (lexicon) and posting lists. Key components discussed are:
BytesRefHash + FreqProxPostingsArray for bidirectional term‑id mapping.
Use of slice to write doc IDs as D‑Gaps for compression.
Finite State Transducer (FST) to map terms to posting‑list file offsets, reducing memory usage.
Figures illustrate the term dictionary, posting list structures, and the overall in‑memory representation of Lucene’s inverted index.
In‑Memory Inverted Index Service Design
Index Data Storage : For millions of ad items, the service stores the entire index in memory. A bitmap (FixedBitSet or SparseFixedBitSet) is used for posting lists, with RoaringDocIdSet for sparse scenarios.
Term Dictionary Storage : Instead of strings, terms are encoded as a 64‑bit long composed of document type (high 16 bits), field type (middle 16 bits), and field value (low 32 bits). Alternative structures such as Trie, PatriciaTrie, or Radix tree are considered, but the long‑key map is chosen for fixed‑size, fast lookup.
Index Update : The service supports real‑time updates using an in‑place strategy. Updates lock at the term level to minimise contention between query and update threads.
Index Query : The query model mirrors Lucene’s BooleanQuery (MUST, SHOULD, FILTER, MUST_NOT). Example Java code:
BooleanQuery.Builder query = new BooleanQuery.Builder();
query.add(new TermQuery(new Term("content", "aa")), BooleanClause.Occur.MUST);
query.add(new TermQuery(new Term("content", "bb")), BooleanClause.Occur.SHOULD);
query.add(new TermQuery(new Term("content", "cc")), BooleanClause.Occur.MUST_NOT);
query.add(new TermQuery(new Term("content", "dd")), BooleanClause.Occur.SHOULD);Boolean operators (AND, OR, NOT, +, -) are mapped to the corresponding Lucene clauses.
Complex Query Expression Design
To handle business‑logic‑heavy filtering, a custom expression language is introduced with logical operators (and, or, not) and comparison operators (==, !=, in, not in). Example:
appPositionTypeId not in (2, 3) and deliveryGoal in (2, 3)and an IF function that mimics Java conditional logic:
IF(styleId == 21, adType in (CONTENT, PICTURE_TEXT), materialId != 3)These expressions are parsed by a JavaCC‑generated lexer/parser and transformed into the internal BooleanQuery representation.
Filtering Statistics
During query execution each ad material must be counted for the filtering rule that caused its exclusion. The initial design used a HashMap<DocId, FilterCode> with AtomicInteger counters, which proved to be a CPU hotspot. The optimized design replaces the map with a per‑document counter array (or sparse 2‑D matrix) accessed via O(1) index lookup, and uses the Disruptor framework to off‑load counting to an asynchronous pipeline.
Summary
The analysis of Lucene’s storage and query mechanisms guided the design of a custom in‑memory inverted index service. After deployment, query latency dropped from ~50 ms (ElasticSearch) to < 5 ms, and the system now scales to a much larger set of ad materials, supporting future business growth.
Ximalaya Technology Team
Official account of Ximalaya's technology team, sharing distilled technical experience and insights to grow together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.