Real‑Time Inverted Index Update Techniques in the 58 Search Engine
This article explains how the 58 search engine achieves millisecond‑level real‑time inverted‑index updates by redesigning the underlying data structures, combining static and dynamic indexing, using a chain‑array hybrid, segment merging strategies, and lock‑free read‑write concurrency while maintaining search performance.
Inverted indexes are the core technology of search engines, enabling fast retrieval over massive document collections. Traditional batch‑built static indexes provide excellent query speed but suffer from high update latency, which is unacceptable for online services that require sub‑second responses.
The article introduces how the 58 search engine maintains retrieval performance while achieving real‑time index updates. It first distinguishes between static (offline) and dynamic (online) indexes: static indexes are built in bulk and stored on disk, whereas dynamic indexes reside in memory and handle newly added documents.
Two common update strategies are described: (1) periodic full‑rebuild of the entire index, suitable for low‑latency‑sensitive scenarios; (2) building an auxiliary dynamic index in memory and merging it with the main static index when it reaches a size threshold. The 58 system adopts the second strategy.
The architecture of the 58 index system consists of an offline indexing phase that shards documents and generates static inverted index files, and an online phase that loads these files, runs a document‑receiving thread, and creates incremental in‑memory indexes. The static index is divided into three parts: meta information, a doc‑id array, and term‑position data, optimizing cache friendliness and space usage.
To support real‑time updates, the design replaces a pure static structure with a hybrid chain‑array model. Each term’s posting list is a linked list of arrays; the first array holds four doc‑ids, the next eight, and so on, doubling each time. Statistics from production show that 88.7% of terms have posting lengths ≤ 4, justifying the initial array size.
"a": {2}
"banana": {2}
"is": {0,1,2}
"it": {0,1,2}
"what": {0,1}When a new document is added, the system appends entries to the tail of the linked list or expands the current array, avoiding in‑place modifications and eliminating the need for write locks. A MAXID field records the highest fully written doc‑id, allowing queries to filter out partially written documents.
The forward (positive) index uses a two‑level array: the first level stores pointers to second‑level arrays of 256 elements each, enabling dynamic allocation while keeping writes confined to the tail, thus also lock‑free. Multi‑value fields follow a similar pattern with variable‑length entries.
Performance tests show that the new real‑time index reduces update latency to ≤ 20 ms for 99.9% of documents, while query throughput and latency remain virtually unchanged compared to the previous version. Memory consumption drops by ~3–6%, and CPU usage stays stable.
In summary, by redesigning the inverted‑index data structures to a lock‑free chain‑array hybrid and employing fine‑grained segment merging, the 58 search engine achieves near‑instant index freshness without sacrificing retrieval speed.
58 Tech
Official tech channel of 58, a platform for tech innovation, sharing, and communication.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.