Big Data 22 min read

How to Build Near Real-Time Elasticsearch Indexes for PB-Scale Data

This article explains how to construct near real-time Elasticsearch indexes for petabyte‑scale datasets by comparing MySQL limitations, introducing ES fundamentals, leveraging Hive and wide tables, and employing binlog‑based tools like Canal and Otter for low‑latency data synchronization.

Su San Talks Tech
Su San Talks Tech
Su San Talks Tech
How to Build Near Real-Time Elasticsearch Indexes for PB-Scale Data
Hello everyone, I am Su San, and today I want to discuss how to build real‑time indexes based on our company's PB‑level indexing experience, which should be helpful to everyone.

With rapid business growth, data volumes have exploded to billions or even tens of billions of records, requiring query responses in seconds or milliseconds. Elasticsearch (ES) is the clear leader among search engines, consistently ranking first in DBRanking evaluations and widely adopted by large enterprises. Compared with traditional databases like MySQL, ES offers superior performance for massive data queries.

This article shares practical ideas for constructing near real‑time indexes on ES.

Why use a search engine? Isn't MySQL enough?

ES index data construction

PB‑level ES near real‑time indexing approach

Why use a search engine? Isn't MySQL enough?

MySQL shortcomings

MySQL’s architecture is not suitable for massive data queries. It stores large volumes but cannot efficiently handle complex conditional queries on huge datasets. Two main reasons:

1. Adding indexes can improve query speed, but MySQL selects only the cheapest index during execution. If no index matches the query, a full table scan occurs. Composite indexes must follow the left‑most prefix rule, which often fails under diverse query conditions, leading to index loss. Moreover, each index creates a B+ tree, increasing storage costs dramatically—for a 10 GB table, the index size may reach 30 GB.

2. Some queries cannot be satisfied by MySQL indexes, e.g., searching for products with titles containing "格力空调":

SELECT * FROM product WHERE title like '%格力空调%'

This triggers a full table scan and is error‑prone (e.g., misspelling "格力空调" as "格空间"):

SELECT * FROM product WHERE title like '%格空调%'

Thus, MySQL’s query capability is limited.

ES Overview

ES, built on the Lucene engine, is an open‑source distributed search and analytics engine that provides near‑real‑time queries for PB‑scale data, widely used for full‑text search, log analysis, and monitoring.

Key characteristics:

Supports complex query conditions : Every field is indexed (inverted index), enabling full‑text search with custom scoring, sorting, and tokenizers.

Strong scalability : Distributed storage allows horizontal expansion to hundreds or thousands of nodes, handling PB‑scale structured or unstructured data.

High availability : Master‑slave nodes and automatic fault detection ensure robust resilience.

Comparing MySQL concepts with ES:

ES vs MySQL concepts
ES vs MySQL concepts

Key analogies:

1. MySQL database ≈ ES index (logical data collection). 2. MySQL table ≈ ES type. 3. MySQL row ≈ ES document. 4. MySQL schema ≈ ES mapping (field rules, tokenization). 5. MySQL indexes are manually created; in ES, any field can be indexed via mapping.

ES achieves high efficiency through inverted indexes, avoiding full‑document scans. An inverted index maps terms to the documents containing them, allowing rapid term lookup and intersection of multiple term lists.

Inverted index illustration
Inverted index illustration

ES’s distributed architecture further supports massive data queries. An ES cluster consists of multiple nodes; each index is split into shards stored across nodes. Queries are executed in parallel on relevant shards and results are merged, distributing load and preventing single‑node bottlenecks.

ES cluster architecture
ES cluster architecture

When a new node joins, ES automatically migrates shards to achieve load balancing, a process far simpler than MySQL’s manual sharding with middleware.

How to Build ES Indexes

Data typically originates from MySQL. Directly pulling data from MySQL and writing to ES can impact online services, especially when joins across multiple tables (e.g., product, sku, stock, category) are required for comprehensive product information.

Product data model
Product data model

Performing joins on MySQL at tens of millions of rows imposes heavy load. To avoid this, an intermediate layer—Hive—is introduced.

What is Hive?

Hive is a Hadoop‑based data warehouse tool that extracts, transforms, and loads (ETL) large‑scale data (TB+). It converts Hive SQL into MapReduce jobs, enabling batch processing of massive datasets.

Data flow: MySQL → Hive (creates wide table) → HBase (stores temporary table) → Dump task reads from HBase, processes business data, and writes to ES.

Hive to ES pipeline
Hive to ES pipeline

Hive joins are time‑consuming (tens of minutes for tens of millions of rows). To reduce latency, a wide table approach is adopted: combine product_sku, product_property, sku_stock, etc., into a single denormalized table, eliminating costly joins.

Wide table design
Wide table design

Even with a few‑minute delay, real‑time requirements demand faster updates. The solution is to capture database changes via binlog.

Why can't Hive achieve real‑time import?

Hive runs on Hadoop’s batch‑oriented framework, which has high latency and overhead, making minute‑level updates infeasible.

To achieve real‑time change capture, we use MySQL binlog and tools like Canal.

Answer: binlog

Canal mimics a MySQL slave, requesting binlog dump from the master, then parses the binary log.

Canal binlog flow
Canal binlog flow

For orchestration, Alibaba’s Otter provides distributed database synchronization, handling binlog parsing, filtering, and routing.

What is Otter?

Otter is a distributed data sync system that parses incremental logs and synchronizes MySQL data across data centers.

Otter architecture
Otter architecture

We adapted Otter into a Data Transfer Service (DTS) with four stages: Select, Extract, Transform, Load (S‑E‑T‑L). Nodes run Canal to capture binlog, process data, and publish to MQ; downstream services consume MQ messages to update ES and the wide table.

DTS workflow
DTS workflow

Real‑time updates focus on sensitive fields (price, stock) while the wide table maintains complete product information. Full‑index rebuilds are still needed daily to ensure data completeness and recover from cluster failures.

Summary

This article outlined our experience building near real‑time ES indexes for PB‑scale data, covering MySQL limitations, ES fundamentals, the role of Hive and wide tables, and a binlog‑driven pipeline using Canal, Otter, and a custom DTS to achieve second‑level index updates.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

ElasticsearchHiveCanalreal-time indexingOtterPB-scale
Su San Talks Tech
Written by

Su San Talks Tech

Su San, former staff at several leading tech companies, is a top creator on Juejin and a premium creator on CSDN, and runs the free coding practice site www.susan.net.cn.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.