Backend Development 13 min read

Design and Implementation of Real-Time Indexing in 58.com’s ESearch Search Engine

This article explains how 58.com’s in‑house C++ search kernel ESearch was architected to provide second‑level real‑time indexing, high‑concurrency low‑latency querying, flexible ranking models, and efficient storage structures for billions of daily queries across massive classified data.

58 Tech

Jun 1, 2018

Design and Implementation of Real-Time Indexing in 58.com’s ESearch Search Engine

58.com, China’s largest classified information platform, replaced Solr with its own C++‑based search kernel called ESearch, dramatically improving performance and customizability.

After years of optimization, ESearch now serves all search services of the 58 Group (including 58.com, Ganji, Anjuke) handling tens of billions of queries per day, offering features such as second‑level real‑time indexing, support for massive data volumes (millions of documents per node, 8000 QPS, millisecond latency), rich query capabilities (compound, spatial, facet, grouping, deduplication), business‑specific customizations, and an extensible ranking framework that incorporates linear weighting, custom scoring expressions, machine‑learning models, and multi‑model fusion.

The article outlines the overall architecture and focuses on the design and implementation of real‑time indexing.

The system consists of an application layer (Proxy and Builder) and a kernel layer (Merger and Searcher). Proxy receives front‑end queries, performs intent recognition and query rewriting, and forwards them to Merger; Merger distributes requests to multiple Searcher instances using consistent hashing and merges results; Searcher holds index data and ranking models, performing recall, scoring, and sorting; Builder constructs documents and sends them to Searcher.

Real‑time indexing must meet two requirements: document updates become visible within seconds, and updates must not degrade query latency. ESearch adopts a read‑write‑separation design and improves the merge mechanism by partitioning each node’s index into multiple lifecycle segments (3 seconds, 15 minutes, 6 hours, 1 day, 1 month, older). New documents are merged only into the smallest segment, ensuring rapid update propagation.

Every 3 seconds a batch of new documents forms a 3‑second segment that is immediately searchable; after 3 seconds it merges into the 15‑minute segment, which contains only recent updates and can be merged in milliseconds. Higher‑level segments merge less frequently (e.g., daily or monthly) during low‑traffic periods, minimizing performance impact. Queries are executed across relevant segments in parallel, and time‑ordered data can be limited to appropriate segments for efficiency.

Each index segment contains four structures: a primary‑key index (hash table mapping original keys to internal doc‑ids), a delete table (bitmap marking removed documents), an inverted index (ordered arrays of doc‑ids per term, optionally storing term frequency, weight, position, and a bitmap of hit fields for multi‑field queries), and a forward index (column‑store mapping doc‑ids to attribute values and feature vectors). Dense fields use arrays, sparse fields use hash tables, and boolean fields use bitmaps. Columnar storage can reduce cache locality, and combined forward fields can be created for frequently co‑accessed attributes.

To keep query latency low, ESearch caches results per segment; when a document is updated, the old segment’s cache is filtered using the delete table, ensuring only the latest version is visible. Feature updates that affect ranking are applied directly to the forward index, allowing rapid refresh without rebuilding segments. After retrieving cached results, a second‑stage scoring pass re‑ranks the top‑N documents using complex machine‑learning models, implementing a classic coarse‑to‑fine ranking pipeline suitable for 58’s time‑sensitive listings.

In summary, the article presents the architecture of the ESearch kernel, detailing how real‑time indexing, segment‑based storage, and two‑stage caching and ranking are tailored to 58.com’s massive, time‑ordered classified data, and outlines ongoing optimization directions for performance, resource utilization, and ranking.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

backend performance Search Engine real-time indexing Large Scale C++

Written by

58 Tech

Official tech channel of 58, a platform for tech innovation, sharing, and communication.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.