How to Build Near Real-Time Elasticsearch Indexes for PB-Scale Data
This article explains why traditional databases like MySQL struggle with massive data, introduces Elasticsearch’s advantages, and details a practical architecture using Hive, Canal, and Otter to achieve near real‑time indexing of petabyte‑scale datasets with minimal latency.
Preface
Hello everyone, today I want to share practical experience on building real‑time indexes from PB‑level data, which should be helpful.
Our business has grown rapidly, resulting in explosive data growth and challenges in massive data queries. We need to return results within seconds or milliseconds even when data reaches billions or tens of billions, which makes a search engine indispensable. Elasticsearch (ES) is the leading choice, ranking first in DBRanking for years. Compared with traditional databases like MySQL, ES offers distinct advantages, and we will explore how ES indexes are generated and kept real‑time at PB scale.
Why Use a Search Engine Instead of MySQL?
MySQL Limitations
MySQL’s architecture is not designed for massive data queries. It can store huge amounts of data but cannot efficiently handle complex conditions on such data. Adding indexes does improve query speed, but MySQL selects only the cheapest index during execution. If no index matches the query, a full table scan occurs. Composite indexes must follow the left‑most prefix rule, which often fails in large‑scale scenarios, leading to index loss. Moreover, each index creates a B+ tree, increasing storage cost dramatically; for a 10 GB table, indexes can occupy 30 GB.
Some queries cannot be satisfied by MySQL indexes at all. For example, searching for products whose title contains "格力空调":
<code>SELECT * FROM product WHERE title like '%格力空调%'</code>This query cannot hit any index and triggers a full table scan. A typo such as "格空调" would return no results at all.
Elasticsearch Overview
Elasticsearch, built on the Lucene engine, is an open‑source distributed search and analytics engine that provides near real‑time queries on PB‑scale data. Its key features are:
Easy support for complex query conditions: Every field is indexed (inverted index), enabling full‑text search with custom scoring, sorting, and rich analyzers.
Strong scalability: Distributed storage allows horizontal expansion to hundreds or thousands of nodes, handling PB‑scale structured or unstructured data.
High availability and disaster recovery: Master‑slave nodes and automatic fault detection ensure robust operation.
Compared with MySQL, ES stores each field in an inverted index, allowing keyword‑to‑document lookups instead of scanning entire documents.
Inverted indexing works by tokenizing each document, building a mapping from terms to the documents that contain them, and ranking results using TF‑IDF or similar algorithms.
ES’s distributed architecture shards an index across multiple nodes. Queries are executed in parallel on each shard, and results are merged, providing high throughput and low latency.
How to Build ES Indexes
Data is typically sourced from MySQL. Directly reading from MySQL and writing to ES can impact the production database, especially when joins are required for product, SKU, stock, and category tables. To avoid this, we introduce an intermediate layer.
We first sync MySQL data to Hive, where Hive generates a wide table that denormalizes product information, reducing the need for costly joins.
However, Hive’s batch processing introduces latency of tens of minutes, which is unacceptable for real‑time search.
Real‑Time Change Capture with Binlog
The solution is to capture MySQL binlog events in real time. Alibaba’s open‑source project Canal simulates a MySQL slave, receives binlog streams, and parses them.
Canal provides the raw change data, which is then processed by Otter, a distributed data synchronization system that parses, transforms, and loads the changes.
Otter’s workflow consists of four stages:
Select: Acquire data from various sources (Canal, etc.).
Extract: Assemble and filter data from MySQL, Oracle, files, etc.
Transform: Convert data to the target schema.
Load: Write data to the destination such as MQ, ES, or databases.
We call this pipeline Data Transfer Service (DTS). After DTS is deployed, MQ messages trigger real‑time updates to ES and to the wide table, keeping both in sync.
During real‑time indexing, only sensitive fields such as price and stock are subscribed to, while the wide table subscribes to all fields. Full‑index rebuilds are still performed nightly to ensure data completeness and to recover from cluster failures.
Conclusion
This article summarized practical approaches for building near real‑time Elasticsearch indexes on petabyte‑scale data, covering MySQL limitations, ES advantages, inverted indexing, distributed architecture, and a concrete pipeline using Hive, Canal, Otter, and DTS to achieve second‑level index updates.
macrozheng
Dedicated to Java tech sharing and dissecting top open-source projects. Topics include Spring Boot, Spring Cloud, Docker, Kubernetes and more. Author’s GitHub project “mall” has 50K+ stars.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.