How Message Queues Enable Near Real‑Time Incremental Indexing in Search Engines
This article examines the high‑real‑time requirements of incremental data ingestion for search engines, compares three update schemes, and details how adopting a Kafka subscription‑based message‑queue approach dramatically improves latency and flexibility for the Nuomi search framework.
Background
Search engines demand high real‑time performance not only for indexing, recall, and ranking but also for data ingestion. Full data is static and loaded once, while incremental data may arrive at any time, making its timely ingestion crucial for overall search performance.
Design Ideas and Trade‑offs
Common incremental update solutions include:
Providing a write interface directly in the search framework.
Using files to deliver new data.
Using a message queue.
Scheme 1 – Write Interface
The framework exposes a data‑write API that publishers call directly. This offers the fastest single write speed but tightly couples producer and consumer code, making maintenance and scaling difficult. It also requires the framework to handle full data reception, consuming significant resources.
Scheme 2 – File‑Based Updates
New data is first written to a file, then periodically delivered to the search process. The process monitors the local directory’s inode changes and loads the file when it appears. This decouples producer and consumer but introduces latency, high I/O cost, and poor scalability when files accumulate.
Scheme 3 – Message Queue
Publishers push data to a message queue; consumers pull it asynchronously. This decouples both sides, eliminates file handling, and greatly improves real‑time performance. Subscription mode is chosen when multiple search instances need the same data.
Current Situation in Nuomi
Nuomi currently uses Scheme 2 (file‑based updates). Although the framework is mature and fault‑tolerant, the high‑frequency, large‑volume nature of lifestyle‑service data makes file updates too slow and inflexible.
Final Choice
Given the business characteristics, Nuomi adopts Scheme 3, employing a Kafka subscription model to achieve real‑time incremental loading.
Specific Implementation
The existing search framework runs a dedicated thread that watches file changes and loads increments via callbacks. To switch to Kafka, only this thread’s logic needs modification.
The incremental loading workflow with files is:
Read configuration (file naming rules, read offsets).
Register a callback for inode changes.
When a file appears, wake the processing thread, read from the stored offset byte‑by‑byte, process each record, and continue until the file ends.
Replacing file handling with a Kafka subscription involves:
Adding a subscription class that configures parameters and reads the starting offset from a local config file.
In StartSubscribe, obtain the number of sub‑channels and issue subscription requests for each.
Enter SubscribeMainloop, where an epoll_wait loop receives events.
On readable events, invoke the processing callback; on error events, attempt to re‑subscribe.
This loop repeats, loading each incremental record until the queue is empty, then blocks on epoll_wait until new data arrives.
Summary
The proposed Kafka subscription‑based approach replaces the file‑centric update mechanism, reducing the latency of a single data update from over half an hour to under 0.5 seconds, and allowing 1,000 records to be refreshed within 2 seconds, thus achieving near real‑time streaming suitable for wide deployment in the search framework’s incremental ingestion component.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
