Big Data 19 min read

How to Build Near‑Real‑Time Elasticsearch Indexes for PB‑Scale Data

This article explains why traditional databases like MySQL struggle with petabyte‑scale queries, introduces Elasticsearch’s architecture and inverted‑index advantages, and details a practical pipeline using Hive, Canal, Otter, and a custom DTS service to achieve near‑real‑time indexing for massive data volumes.

dbaplus Community
dbaplus Community
dbaplus Community
How to Build Near‑Real‑Time Elasticsearch Indexes for PB‑Scale Data

Why Use a Search Engine Instead of MySQL

MySQL is designed for massive data storage, not for complex, high‑concurrency queries on petabyte‑scale datasets; its indexing strategy leads to costly B+‑tree storage and frequent full‑table scans, especially when query patterns violate the left‑most‑prefix rule.

Example of a query that cannot use an index:

SELECT * FROM product WHERE title like '%格力空调%'

Even a slight typo forces a full scan:

SELECT * FROM product WHERE title like '%格空调%'

These limitations cause severe performance degradation for large‑scale e‑commerce search.

Elasticsearch Overview

Elasticsearch, built on Lucene, provides distributed, near‑real‑time search with three key strengths:

Complex query support : every field is indexed via an inverted index, enabling full‑text search, custom scoring, and rich tokenizers.

Scalability : horizontal scaling across hundreds of nodes handles PB‑level structured or unstructured data.

High availability : automatic replica handling and fault detection ensure continuous service.

Key concepts map to relational terms: a MySQL database corresponds to an ES index, tables to types, rows to documents, and schema to mapping.

How ES Indexes Are Built

Data originates from MySQL tables (product_sku, product_property, sku_stock, product_category). Direct joins on MySQL at tens of millions of rows cause unacceptable load, so a middle layer is introduced.

First, data is synchronized to Hive, which creates a temporary table stored in HBase. Hive then periodically dumps the full dataset to an indexing program that reads from HBase and writes to Elasticsearch.

However, Hive’s batch processing incurs a 30‑minute latency, insufficient for real‑time needs.

Introducing Wide Tables to Reduce Join Cost

By denormalizing MySQL data into a single wide table that contains all SKU‑related attributes, Hive no longer needs to perform expensive joins, reducing the pipeline latency from >30 minutes to a few minutes. Yet, this still leaves a multi‑minute delay.

Achieving Near‑Real‑Time Indexing with Binlog, Canal, and Otter

Real‑time change capture is done via MySQL binary logs (binlog). Canal mimics a MySQL slave, pulls the binlog, parses events, and forwards them to a manager component.

Otter (an Alibaba open‑source data sync system) coordinates multiple Canal instances, using ZooKeeper for configuration and node discovery. The data flow follows four stages:

Select : acquire change events from various sources.

Extract : assemble and filter raw data.

Transform : convert data to the target schema.

Load : write the transformed data to the destination (e.g., MQ, Elasticsearch).

The manager stores table‑level subscription configurations in ZooKeeper; nodes listen to binlog, process S‑E‑T‑L, and publish messages to a message queue.

Consumers read the MQ messages and update Elasticsearch indexes instantly. The same pipeline can also update the wide table, keeping both the full‑copy and the real‑time index in sync.

Handling Full‑Refresh vs. Incremental Updates

Full‑refresh jobs run once daily during low‑traffic windows to guarantee data completeness and to rebuild the index after failures. Incremental updates via the MQ cover only hot fields (price, stock) to meet latency requirements.

Conclusion

By combining MySQL binlog capture, Canal, Otter‑based DTS, and Elasticsearch’s distributed architecture, the described solution reduces index latency from tens of minutes to seconds, while still supporting periodic full‑refreshes for data integrity.

Key diagrams:

MySQL vs ES concepts
MySQL vs ES concepts
Elasticsearch distributed architecture
Elasticsearch distributed architecture
Hive to HBase to ES pipeline
Hive to HBase to ES pipeline
Canal binlog capture
Canal binlog capture
Otter architecture
Otter architecture
DTS data flow
DTS data flow
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

ElasticsearchmysqlCanalreal-time indexingOtter
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.