Big Data 18 min read

How to Build Near‑Real‑Time Elasticsearch Indexes for PB‑Scale Data

Learn how to construct near‑real‑time Elasticsearch indexes for petabyte‑scale datasets by comparing MySQL limitations, leveraging inverted indexes, using Hive and wide tables, and employing binlog‑based pipelines with Canal and Otter to achieve second‑level index updates.

Su San Talks Tech
Su San Talks Tech
Su San Talks Tech
How to Build Near‑Real‑Time Elasticsearch Indexes for PB‑Scale Data

Preface

Hello, I am Su San. In this article I share practical experience on building real‑time indexes for petabyte‑level data, based on our company's production environment.

Our business growth has led to data volumes in the billions, requiring sub‑second query latency. Elasticsearch (ES) is the de‑facto search engine for such workloads, outperforming traditional databases like MySQL.

Why Use a Search Engine Instead of MySQL?

MySQL Shortcomings

MySQL is designed for massive data storage, not for complex, high‑volume queries. Adding many indexes does not guarantee performance because MySQL selects only one index per query, and composite indexes must follow the left‑most prefix rule. Index storage also grows dramatically; a 10 GB table can have a 30 GB index.

Some queries cannot be satisfied by indexes at all, e.g. searching product titles containing "格力空调":

SELECT * FROM product WHERE title LIKE '%格力空调%'

A typo in the keyword (e.g., "%格空调%") leads to a full table scan that returns no rows, illustrating MySQL's query limitations.

ES Overview

Elasticsearch, built on Lucene, provides near‑real‑time search for petabyte data, supporting full‑text, log, and monitoring scenarios. Easy support for complex queries: every field is indexed via an inverted index, enabling arbitrary full‑text searches. Strong scalability: distributed storage across hundreds of nodes handles PB‑scale structured or unstructured data. High availability: master‑slave nodes and automatic failure detection ensure resilience.

Key concepts compared with MySQL:

Database ↔ Index

Table ↔ Type

Row ↔ Document

Schema ↔ Mapping

Manual index creation ↔ Automatic field indexing in ES

ES achieves fast queries using an inverted index, which maps terms to the documents containing them, avoiding full scans.

How to Build ES Indexes

Data typically originates from MySQL. Directly joining multiple tables (product_sku, product_property, sku_stock, product_category) and writing the result to ES becomes impractical at tens of millions of rows.

Instead, we introduce a middle layer—Hive—built on Hadoop for batch processing. Hive can transform MySQL data into a wide table stored in HBase, reducing join overhead.

However, Hive’s batch nature introduces latency (tens of minutes), which is unacceptable for real‑time user experiences.

Why Hive Cannot Provide Real‑Time Indexing

Hive runs on Hadoop’s static batch processing model, incurring high job‑submission overhead and minute‑level latency for PB‑scale data.

Maintaining a wide table also creates coupling challenges when source tables change.

PB‑Scale Near‑Real‑Time Indexing Strategy

To achieve true real‑time updates, we listen to MySQL binlog changes and propagate them to ES and the wide table.

Binlog Listening

MySQL master writes data changes to the binary log; slaves replay these events. By mimicking a MySQL slave, the open‑source project Canal captures binlog events.

Canal parses the binary log and forwards relevant changes.

Data Transfer Service (DTS)

We built a pipeline based on Alibaba’s Otter (a distributed incremental data sync system) to manage the flow:

Configure Zookeeper with the tables to monitor.

Nodes run Canal to capture binlog, then pass data through Select, Extract, Transform, Load (SETL) stages.

Transformed data is sent to a message queue (MQ).

Downstream services consume MQ messages to update ES indexes and wide tables.

The SETL stages are: Select: acquire data from sources (Canal, etc.). Extract: assemble and filter data. Transform: convert data to target formats. Load: write data to destinations such as ES or MQ.

This DTS architecture enables second‑level index updates while also supporting periodic full‑index rebuilds to ensure data completeness and recover from cluster failures.

Conclusion

We summarized our approach to building near‑real‑time Elasticsearch indexes for petabyte‑scale data, covering MySQL limitations, ES fundamentals, Hive‑based batch processing, and a binlog‑driven real‑time pipeline using Canal and Otter. Readers are encouraged to explore the underlying middleware for deeper understanding.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

ElasticsearchHivemysqlCanalreal-time indexingOtter
Su San Talks Tech
Written by

Su San Talks Tech

Su San, former staff at several leading tech companies, is a top creator on Juejin and a premium creator on CSDN, and runs the free coding practice site www.susan.net.cn.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.