Big Data 19 min read

How to Build Near Real-Time ElasticSearch Indexes for PB-Scale Data

This article explains why traditional databases like MySQL struggle with massive datasets, introduces ElasticSearch’s inverted‑index architecture, and details a practical pipeline using Hive, wide tables, binlog, Canal, and Otter to achieve near real‑time indexing for petabyte‑level data.

Su San Talks Tech

Dec 8, 2024

How to Build Near Real-Time ElasticSearch Indexes for PB-Scale Data

Introduction

Hello, I am Su San. In this article I share practical experience on building real‑time indexes for PB‑level data, aiming to help readers handle massive data queries with second‑level or even millisecond‑level response times.

Why Use a Search Engine Instead of MySQL?

MySQL Limitations

MySQL is designed for massive data storage but not for complex queries on massive datasets. Adding many indexes does not guarantee fast queries because MySQL selects only one index per query, may trigger full table scans, and index storage costs can be huge (e.g., a 10 GB table with a 30 GB index).

Some queries cannot be satisfied by MySQL indexes, such as searching product titles containing a keyword:

SELECT * FROM product WHERE title like '%格力空调%'

If the keyword is misspelled, the query also fails to use any index, leading to full scans and no results.

ElasticSearch Overview

ElasticSearch (ES), built on the Lucene engine, provides near real‑time queries for PB‑scale data and is widely used for full‑text search, log analysis, and monitoring. Supports complex query conditions: Every field is indexed (inverted index), enabling arbitrary full‑text searches. Strong scalability: Distributed storage with easy horizontal expansion across hundreds of nodes. High availability: Master‑slave nodes with automatic failure detection and recovery.

Key concepts compared with MySQL:

Database ↔ Index

Table ↔ Type

Row ↔ Document

Schema ↔ Mapping

Manual index creation ↔ All fields are indexed by default in ES.

ES achieves fast queries mainly through inverted indexes, which map terms to the documents containing them, avoiding full‑document scans.

How to Build ES Indexes

Data typically originates from MySQL. Directly joining multiple tables (e.g., product_sku, product_property, sku_stock, product_category) and writing the result to ES becomes a performance bottleneck at tens of millions of rows.

Using Hive as an intermediate layer can offload the join work, but Hive’s batch‑oriented processing introduces tens of minutes of latency, which is unacceptable for real‑time needs.

Creating a wide table that denormalizes product data reduces join time, yet the resulting index update delay remains several minutes.

Real‑Time Change Capture with Binlog

To achieve true real‑time updates, we listen to MySQL binlog events. Alibaba’s open‑source Canal mimics a MySQL slave, receives binlog streams, and parses them.

For reliable delivery and further processing, we use Alibaba’s Otter (a distributed data synchronization system) together with a custom Data Transfer Service (DTS) that implements four stages:

Select : Acquire data from various sources (e.g., Canal).

Extract : Assemble and filter data from MySQL, Oracle, files, etc.

Transform : Convert data to the target schema.

Load : Write data to the destination such as MQ, ES, or databases.

The pipeline publishes change events to a message queue; downstream services consume the messages to update ES indexes and the wide table.

Full‑And‑Incremental Index Updates

Even with real‑time updates, periodic full index rebuilds are necessary to ensure data completeness and to recover from cluster failures. Full rebuilds are scheduled during low‑traffic windows (e.g., midnight) and run once per day.

If a full rebuild is in progress, real‑time messages can be delayed until the rebuild finishes, preventing overwrites.

Conclusion

By combining MySQL wide‑table full refreshes with a DTS‑driven real‑time pipeline (Canal + Otter + MQ → ES), we achieve near‑second index updates for PB‑scale data while still performing daily full rebuilds for reliability.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

data pipeline Hive Canal real-time indexing Otter

Written by

Su San Talks Tech

Su San, former staff at several leading tech companies, is a top creator on Juejin and a premium creator on CSDN, and runs the free coding practice site www.susan.net.cn.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.