How to Build Near‑Real‑Time Elasticsearch Indexes for PB‑Scale Data
This article explains why traditional databases like MySQL struggle with petabyte‑scale queries, introduces Elasticsearch’s architecture and inverted‑index advantages, and details a practical pipeline using Hive, Canal, Otter, and a custom DTS service to achieve near‑real‑time indexing for massive data volumes.
Why Use a Search Engine Instead of MySQL
MySQL is designed for massive data storage, not for complex, high‑concurrency queries on petabyte‑scale datasets; its indexing strategy leads to costly B+‑tree storage and frequent full‑table scans, especially when query patterns violate the left‑most‑prefix rule.
Example of a query that cannot use an index:
SELECT * FROM product WHERE title like '%格力空调%'Even a slight typo forces a full scan:
SELECT * FROM product WHERE title like '%格空调%'These limitations cause severe performance degradation for large‑scale e‑commerce search.
Elasticsearch Overview
Elasticsearch, built on Lucene, provides distributed, near‑real‑time search with three key strengths:
Complex query support : every field is indexed via an inverted index, enabling full‑text search, custom scoring, and rich tokenizers.
Scalability : horizontal scaling across hundreds of nodes handles PB‑level structured or unstructured data.
High availability : automatic replica handling and fault detection ensure continuous service.
Key concepts map to relational terms: a MySQL database corresponds to an ES index, tables to types, rows to documents, and schema to mapping.
How ES Indexes Are Built
Data originates from MySQL tables (product_sku, product_property, sku_stock, product_category). Direct joins on MySQL at tens of millions of rows cause unacceptable load, so a middle layer is introduced.
First, data is synchronized to Hive, which creates a temporary table stored in HBase. Hive then periodically dumps the full dataset to an indexing program that reads from HBase and writes to Elasticsearch.
However, Hive’s batch processing incurs a 30‑minute latency, insufficient for real‑time needs.
Introducing Wide Tables to Reduce Join Cost
By denormalizing MySQL data into a single wide table that contains all SKU‑related attributes, Hive no longer needs to perform expensive joins, reducing the pipeline latency from >30 minutes to a few minutes. Yet, this still leaves a multi‑minute delay.
Achieving Near‑Real‑Time Indexing with Binlog, Canal, and Otter
Real‑time change capture is done via MySQL binary logs (binlog). Canal mimics a MySQL slave, pulls the binlog, parses events, and forwards them to a manager component.
Otter (an Alibaba open‑source data sync system) coordinates multiple Canal instances, using ZooKeeper for configuration and node discovery. The data flow follows four stages:
Select : acquire change events from various sources.
Extract : assemble and filter raw data.
Transform : convert data to the target schema.
Load : write the transformed data to the destination (e.g., MQ, Elasticsearch).
The manager stores table‑level subscription configurations in ZooKeeper; nodes listen to binlog, process S‑E‑T‑L, and publish messages to a message queue.
Consumers read the MQ messages and update Elasticsearch indexes instantly. The same pipeline can also update the wide table, keeping both the full‑copy and the real‑time index in sync.
Handling Full‑Refresh vs. Incremental Updates
Full‑refresh jobs run once daily during low‑traffic windows to guarantee data completeness and to rebuild the index after failures. Incremental updates via the MQ cover only hot fields (price, stock) to meet latency requirements.
Conclusion
By combining MySQL binlog capture, Canal, Otter‑based DTS, and Elasticsearch’s distributed architecture, the described solution reduces index latency from tens of minutes to seconds, while still supporting periodic full‑refreshes for data integrity.
Key diagrams:
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
