How Message Queues Enable Near Real‑Time Incremental Indexing in Search Engines

This article examines the high‑real‑time requirements of incremental data ingestion for search engines, compares three update schemes, and details how adopting a Kafka subscription‑based message‑queue approach dramatically improves latency and flexibility for the Nuomi search framework.

21CTO
21CTO
21CTO
How Message Queues Enable Near Real‑Time Incremental Indexing in Search Engines

Background

Search engines demand high real‑time performance not only for indexing, recall, and ranking but also for data ingestion. Full data is static and loaded once, while incremental data may arrive at any time, making its timely ingestion crucial for overall search performance.

Design Ideas and Trade‑offs

Common incremental update solutions include:

Providing a write interface directly in the search framework.

Using files to deliver new data.

Using a message queue.

Scheme 1 – Write Interface

The framework exposes a data‑write API that publishers call directly. This offers the fastest single write speed but tightly couples producer and consumer code, making maintenance and scaling difficult. It also requires the framework to handle full data reception, consuming significant resources.

Scheme 2 – File‑Based Updates

New data is first written to a file, then periodically delivered to the search process. The process monitors the local directory’s inode changes and loads the file when it appears. This decouples producer and consumer but introduces latency, high I/O cost, and poor scalability when files accumulate.

Scheme 3 – Message Queue

Publishers push data to a message queue; consumers pull it asynchronously. This decouples both sides, eliminates file handling, and greatly improves real‑time performance. Subscription mode is chosen when multiple search instances need the same data.

Current Situation in Nuomi

Nuomi currently uses Scheme 2 (file‑based updates). Although the framework is mature and fault‑tolerant, the high‑frequency, large‑volume nature of lifestyle‑service data makes file updates too slow and inflexible.

Final Choice

Given the business characteristics, Nuomi adopts Scheme 3, employing a Kafka subscription model to achieve real‑time incremental loading.

Specific Implementation

The existing search framework runs a dedicated thread that watches file changes and loads increments via callbacks. To switch to Kafka, only this thread’s logic needs modification.

The incremental loading workflow with files is:

Read configuration (file naming rules, read offsets).

Register a callback for inode changes.

When a file appears, wake the processing thread, read from the stored offset byte‑by‑byte, process each record, and continue until the file ends.

Replacing file handling with a Kafka subscription involves:

Adding a subscription class that configures parameters and reads the starting offset from a local config file.

In StartSubscribe, obtain the number of sub‑channels and issue subscription requests for each.

Enter SubscribeMainloop, where an epoll_wait loop receives events.

On readable events, invoke the processing callback; on error events, attempt to re‑subscribe.

This loop repeats, loading each incremental record until the queue is empty, then blocks on epoll_wait until new data arrives.

Summary

The proposed Kafka subscription‑based approach replaces the file‑centric update mechanism, reducing the latency of a single data update from over half an hour to under 0.5 seconds, and allowing 1,000 records to be refreshed within 2 seconds, thus achieving near real‑time streaming suitable for wide deployment in the search framework’s incremental ingestion component.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

search engineKafkareal-time dataMessage Queueincremental indexing
21CTO
Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.