How to Build Near Real-Time Elasticsearch Indexes for PB-Scale Data
This article explains how to construct near real-time Elasticsearch indexes for petabyte‑scale datasets by comparing MySQL limitations, introducing ES fundamentals, leveraging Hive and wide tables, and employing binlog‑based tools like Canal and Otter for low‑latency data synchronization.
Hello everyone, I am Su San, and today I want to discuss how to build real‑time indexes based on our company's PB‑level indexing experience, which should be helpful to everyone.
With rapid business growth, data volumes have exploded to billions or even tens of billions of records, requiring query responses in seconds or milliseconds. Elasticsearch (ES) is the clear leader among search engines, consistently ranking first in DBRanking evaluations and widely adopted by large enterprises. Compared with traditional databases like MySQL, ES offers superior performance for massive data queries.
This article shares practical ideas for constructing near real‑time indexes on ES.
Why use a search engine? Isn't MySQL enough?
ES index data construction
PB‑level ES near real‑time indexing approach
Why use a search engine? Isn't MySQL enough?
MySQL shortcomings
MySQL’s architecture is not suitable for massive data queries. It stores large volumes but cannot efficiently handle complex conditional queries on huge datasets. Two main reasons:
1. Adding indexes can improve query speed, but MySQL selects only the cheapest index during execution. If no index matches the query, a full table scan occurs. Composite indexes must follow the left‑most prefix rule, which often fails under diverse query conditions, leading to index loss. Moreover, each index creates a B+ tree, increasing storage costs dramatically—for a 10 GB table, the index size may reach 30 GB.
2. Some queries cannot be satisfied by MySQL indexes, e.g., searching for products with titles containing "格力空调":
SELECT * FROM product WHERE title like '%格力空调%'This triggers a full table scan and is error‑prone (e.g., misspelling "格力空调" as "格空间"):
SELECT * FROM product WHERE title like '%格空调%'Thus, MySQL’s query capability is limited.
ES Overview
ES, built on the Lucene engine, is an open‑source distributed search and analytics engine that provides near‑real‑time queries for PB‑scale data, widely used for full‑text search, log analysis, and monitoring.
Key characteristics:
Supports complex query conditions : Every field is indexed (inverted index), enabling full‑text search with custom scoring, sorting, and tokenizers.
Strong scalability : Distributed storage allows horizontal expansion to hundreds or thousands of nodes, handling PB‑scale structured or unstructured data.
High availability : Master‑slave nodes and automatic fault detection ensure robust resilience.
Comparing MySQL concepts with ES:
Key analogies:
1. MySQL database ≈ ES index (logical data collection). 2. MySQL table ≈ ES type. 3. MySQL row ≈ ES document. 4. MySQL schema ≈ ES mapping (field rules, tokenization). 5. MySQL indexes are manually created; in ES, any field can be indexed via mapping.
ES achieves high efficiency through inverted indexes, avoiding full‑document scans. An inverted index maps terms to the documents containing them, allowing rapid term lookup and intersection of multiple term lists.
ES’s distributed architecture further supports massive data queries. An ES cluster consists of multiple nodes; each index is split into shards stored across nodes. Queries are executed in parallel on relevant shards and results are merged, distributing load and preventing single‑node bottlenecks.
When a new node joins, ES automatically migrates shards to achieve load balancing, a process far simpler than MySQL’s manual sharding with middleware.
How to Build ES Indexes
Data typically originates from MySQL. Directly pulling data from MySQL and writing to ES can impact online services, especially when joins across multiple tables (e.g., product, sku, stock, category) are required for comprehensive product information.
Performing joins on MySQL at tens of millions of rows imposes heavy load. To avoid this, an intermediate layer—Hive—is introduced.
What is Hive?
Hive is a Hadoop‑based data warehouse tool that extracts, transforms, and loads (ETL) large‑scale data (TB+). It converts Hive SQL into MapReduce jobs, enabling batch processing of massive datasets.
Data flow: MySQL → Hive (creates wide table) → HBase (stores temporary table) → Dump task reads from HBase, processes business data, and writes to ES.
Hive joins are time‑consuming (tens of minutes for tens of millions of rows). To reduce latency, a wide table approach is adopted: combine product_sku, product_property, sku_stock, etc., into a single denormalized table, eliminating costly joins.
Even with a few‑minute delay, real‑time requirements demand faster updates. The solution is to capture database changes via binlog.
Why can't Hive achieve real‑time import?
Hive runs on Hadoop’s batch‑oriented framework, which has high latency and overhead, making minute‑level updates infeasible.
To achieve real‑time change capture, we use MySQL binlog and tools like Canal.
Answer: binlog
Canal mimics a MySQL slave, requesting binlog dump from the master, then parses the binary log.
For orchestration, Alibaba’s Otter provides distributed database synchronization, handling binlog parsing, filtering, and routing.
What is Otter?
Otter is a distributed data sync system that parses incremental logs and synchronizes MySQL data across data centers.
We adapted Otter into a Data Transfer Service (DTS) with four stages: Select, Extract, Transform, Load (S‑E‑T‑L). Nodes run Canal to capture binlog, process data, and publish to MQ; downstream services consume MQ messages to update ES and the wide table.
Real‑time updates focus on sensitive fields (price, stock) while the wide table maintains complete product information. Full‑index rebuilds are still needed daily to ensure data completeness and recover from cluster failures.
Summary
This article outlined our experience building near real‑time ES indexes for PB‑scale data, covering MySQL limitations, ES fundamentals, the role of Hive and wide tables, and a binlog‑driven pipeline using Canal, Otter, and a custom DTS to achieve second‑level index updates.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Su San Talks Tech
Su San, former staff at several leading tech companies, is a top creator on Juejin and a premium creator on CSDN, and runs the free coding practice site www.susan.net.cn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
