How to Build Near‑Real‑Time Elasticsearch Indexes for PB‑Scale Data
Learn how to construct near‑real‑time Elasticsearch indexes for petabyte‑scale datasets by comparing MySQL limitations, leveraging inverted indexes, using Hive and wide tables, and employing binlog‑based pipelines with Canal and Otter to achieve second‑level index updates.
Preface
Hello, I am Su San. In this article I share practical experience on building real‑time indexes for petabyte‑level data, based on our company's production environment.
Our business growth has led to data volumes in the billions, requiring sub‑second query latency. Elasticsearch (ES) is the de‑facto search engine for such workloads, outperforming traditional databases like MySQL.
Why Use a Search Engine Instead of MySQL?
MySQL Shortcomings
MySQL is designed for massive data storage, not for complex, high‑volume queries. Adding many indexes does not guarantee performance because MySQL selects only one index per query, and composite indexes must follow the left‑most prefix rule. Index storage also grows dramatically; a 10 GB table can have a 30 GB index.
Some queries cannot be satisfied by indexes at all, e.g. searching product titles containing "格力空调":
SELECT * FROM product WHERE title LIKE '%格力空调%'A typo in the keyword (e.g., "%格空调%") leads to a full table scan that returns no rows, illustrating MySQL's query limitations.
ES Overview
Elasticsearch, built on Lucene, provides near‑real‑time search for petabyte data, supporting full‑text, log, and monitoring scenarios. Easy support for complex queries: every field is indexed via an inverted index, enabling arbitrary full‑text searches. Strong scalability: distributed storage across hundreds of nodes handles PB‑scale structured or unstructured data. High availability: master‑slave nodes and automatic failure detection ensure resilience.
Key concepts compared with MySQL:
Database ↔ Index
Table ↔ Type
Row ↔ Document
Schema ↔ Mapping
Manual index creation ↔ Automatic field indexing in ES
ES achieves fast queries using an inverted index, which maps terms to the documents containing them, avoiding full scans.
How to Build ES Indexes
Data typically originates from MySQL. Directly joining multiple tables (product_sku, product_property, sku_stock, product_category) and writing the result to ES becomes impractical at tens of millions of rows.
Instead, we introduce a middle layer—Hive—built on Hadoop for batch processing. Hive can transform MySQL data into a wide table stored in HBase, reducing join overhead.
However, Hive’s batch nature introduces latency (tens of minutes), which is unacceptable for real‑time user experiences.
Why Hive Cannot Provide Real‑Time Indexing
Hive runs on Hadoop’s static batch processing model, incurring high job‑submission overhead and minute‑level latency for PB‑scale data.
Maintaining a wide table also creates coupling challenges when source tables change.
PB‑Scale Near‑Real‑Time Indexing Strategy
To achieve true real‑time updates, we listen to MySQL binlog changes and propagate them to ES and the wide table.
Binlog Listening
MySQL master writes data changes to the binary log; slaves replay these events. By mimicking a MySQL slave, the open‑source project Canal captures binlog events.
Canal parses the binary log and forwards relevant changes.
Data Transfer Service (DTS)
We built a pipeline based on Alibaba’s Otter (a distributed incremental data sync system) to manage the flow:
Configure Zookeeper with the tables to monitor.
Nodes run Canal to capture binlog, then pass data through Select, Extract, Transform, Load (SETL) stages.
Transformed data is sent to a message queue (MQ).
Downstream services consume MQ messages to update ES indexes and wide tables.
The SETL stages are: Select: acquire data from sources (Canal, etc.). Extract: assemble and filter data. Transform: convert data to target formats. Load: write data to destinations such as ES or MQ.
This DTS architecture enables second‑level index updates while also supporting periodic full‑index rebuilds to ensure data completeness and recover from cluster failures.
Conclusion
We summarized our approach to building near‑real‑time Elasticsearch indexes for petabyte‑scale data, covering MySQL limitations, ES fundamentals, Hive‑based batch processing, and a binlog‑driven real‑time pipeline using Canal and Otter. Readers are encouraged to explore the underlying middleware for deeper understanding.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Su San Talks Tech
Su San, former staff at several leading tech companies, is a top creator on Juejin and a premium creator on CSDN, and runs the free coding practice site www.susan.net.cn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
