Big Data 18 min read

Building Near Real-Time Elasticsearch Indexes for PB‑Scale Data

This article explains how to construct near real‑time Elasticsearch indexes for petabyte‑level datasets by comparing MySQL limitations, describing Elasticsearch fundamentals, and detailing a pipeline that uses Hive, wide tables, MySQL binlog, Canal, and Otter to achieve second‑level index updates.

Full-Stack Internet Architecture
Full-Stack Internet Architecture
Full-Stack Internet Architecture
Building Near Real-Time Elasticsearch Indexes for PB‑Scale Data

In recent years the company’s rapid growth has led to data volumes exploding to billions of records, making traditional relational databases like MySQL unsuitable for fast, complex queries; Elasticsearch (ES) is introduced as a distributed search engine that can handle PB‑scale data with near‑real‑time response.

MySQL’s drawbacks are highlighted: multiple indexes are chosen only one at query time, the left‑most prefix rule often invalidates composite indexes, and storing many indexes can consume more space than the data itself. A sample SQL query using SELECT * FROM product WHERE title LIKE '%格力空调%' demonstrates how full‑table scans occur when the query cannot use an index.

ES is then described: built on Lucene, it creates an inverted index for every field, supports complex queries, scales horizontally via shards, and provides high availability through primary‑replica nodes. The article explains inverted indexing with diagrams and shows how ES distributes queries across nodes to avoid single‑node bottlenecks.

To build ES indexes, the typical approach of pulling data directly from MySQL and performing joins is shown to be impractical for tens of millions of rows. Instead, a Hive‑based pipeline is proposed: MySQL data is synced to Hive, transformed into a temporary table stored in HBase, and then periodically dumped into ES. However, Hive’s batch nature introduces tens‑of‑minutes latency.

To reduce latency, a wide table strategy is discussed: denormalizing product, SKU, price, stock, and category data into a single table in MySQL, which eliminates costly Hive joins and cuts the sync time to a few minutes, though it still cannot achieve true real‑time updates.

The article then turns to MySQL binlog as the source of change data capture. Canal, an open‑source tool that mimics a MySQL slave, reads binlog events in real time. Otter, a distributed synchronization system, manages configurations via ZooKeeper and orchestrates the data flow through four stages—Select, Extract, Transform, Load (SETL)—sending processed records to a message queue.

These components together form a Data Transfer Service (DTS). By subscribing to the MQ, the system can instantly update ES indexes and the wide table, achieving second‑level index freshness. The need for periodic full‑index rebuilds is also explained to guarantee data completeness and recover from cluster failures.

Finally, the article summarizes the end‑to‑end solution: combine MySQL wide‑table full loads with DTS‑driven real‑time updates to maintain near‑real‑time ES indexes for PB‑scale data, and invites readers to explore the underlying middleware (ES, Canal, Otter) in more depth.

Big DataData PipelineElasticsearchCanalReal-time indexingOtter
Full-Stack Internet Architecture
Written by

Full-Stack Internet Architecture

Introducing full-stack Internet architecture technologies centered on Java

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.