Backend Development 27 min read

Design Analysis of Lucene and In-Memory Inverted Index Service for Advertising Retrieval

The team analyzed Lucene’s disk‑based inverted index and built a custom in‑memory inverted‑index service for Himalaya’s ad engine, encoding terms as 64‑bit keys, supporting real‑time updates and BooleanQuery‑style and custom expression filtering, which cut query latency from ~50 ms to under 5 ms and enabled massive scaling.

Ximalaya Technology Team
Ximalaya Technology Team
Ximalaya Technology Team
Design Analysis of Lucene and In-Memory Inverted Index Service for Advertising Retrieval

Background

Himalaya's advertising engine originally used ElasticSearch (which relies on Lucene) to retrieve large numbers of ad materials. As the volume grew, Lucene's disk‑based inverted index caused significant I/O and performance bottlenecks. Because ad material IDs can be fully stored in memory, the team decided to design a high‑performance, memory‑resident inverted‑index service tailored to this scenario.

Lucene Design Analysis

Lucene stores data on disk and builds an inverted list for each Term . The core structures include a term dictionary (lexicon) and posting lists. Key components discussed are:

BytesRefHash + FreqProxPostingsArray for bidirectional term‑id mapping.

Use of slice to write doc IDs as D‑Gaps for compression.

Finite State Transducer (FST) to map terms to posting‑list file offsets, reducing memory usage.

Figures illustrate the term dictionary, posting list structures, and the overall in‑memory representation of Lucene’s inverted index.

In‑Memory Inverted Index Service Design

Index Data Storage : For millions of ad items, the service stores the entire index in memory. A bitmap (FixedBitSet or SparseFixedBitSet) is used for posting lists, with RoaringDocIdSet for sparse scenarios.

Term Dictionary Storage : Instead of strings, terms are encoded as a 64‑bit long composed of document type (high 16 bits), field type (middle 16 bits), and field value (low 32 bits). Alternative structures such as Trie, PatriciaTrie, or Radix tree are considered, but the long‑key map is chosen for fixed‑size, fast lookup.

Index Update : The service supports real‑time updates using an in‑place strategy. Updates lock at the term level to minimise contention between query and update threads.

Index Query : The query model mirrors Lucene’s BooleanQuery (MUST, SHOULD, FILTER, MUST_NOT). Example Java code:

BooleanQuery.Builder query = new BooleanQuery.Builder();
query.add(new TermQuery(new Term("content", "aa")), BooleanClause.Occur.MUST);
query.add(new TermQuery(new Term("content", "bb")), BooleanClause.Occur.SHOULD);
query.add(new TermQuery(new Term("content", "cc")), BooleanClause.Occur.MUST_NOT);
query.add(new TermQuery(new Term("content", "dd")), BooleanClause.Occur.SHOULD);

Boolean operators (AND, OR, NOT, +, -) are mapped to the corresponding Lucene clauses.

Complex Query Expression Design

To handle business‑logic‑heavy filtering, a custom expression language is introduced with logical operators (and, or, not) and comparison operators (==, !=, in, not in). Example:

appPositionTypeId not in (2, 3) and deliveryGoal in (2, 3)

and an IF function that mimics Java conditional logic:

IF(styleId == 21, adType in (CONTENT, PICTURE_TEXT), materialId != 3)

These expressions are parsed by a JavaCC‑generated lexer/parser and transformed into the internal BooleanQuery representation.

Filtering Statistics

During query execution each ad material must be counted for the filtering rule that caused its exclusion. The initial design used a HashMap<DocId, FilterCode> with AtomicInteger counters, which proved to be a CPU hotspot. The optimized design replaces the map with a per‑document counter array (or sparse 2‑D matrix) accessed via O(1) index lookup, and uses the Disruptor framework to off‑load counting to an asynchronous pipeline.

Summary

The analysis of Lucene’s storage and query mechanisms guided the design of a custom in‑memory inverted index service. After deployment, query latency dropped from ~50 ms (ElasticSearch) to < 5 ms, and the system now scales to a much larger set of ad materials, supporting future business growth.

JavaadvertisingQuery OptimizationLuceneInverted IndexData StructuresMemory Search
Ximalaya Technology Team
Written by

Ximalaya Technology Team

Official account of Ximalaya's technology team, sharing distilled technical experience and insights to grow together.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.