How to Build a Scalable Billion‑Item Elasticsearch Search System from Scratch
This article walks through the end‑to‑end design and implementation of a searchable system that scales from millions to hundreds of millions of products, covering business background, data‑center architecture, ES cluster sizing, multi‑room data sync, RocketMQ‑Flink pipelines, and multi‑layer reconciliation to achieve high QPS and strong consistency.
Business Background
The platform’s招商管理系统 (merchant recruitment management) supports TikTok e‑commerce activities across many entities such as live‑stream rooms, product recruitment, and coupon recruitment, with product recruitment being the largest. It needs a unified data‑center that can query and retrieve data for millions of items.
Key Concepts
Metric : A piece of metadata describing an attribute of an entity (e.g., product name, shop score, influencer level). Any field with clear semantics can be defined as a metric.
Collection : A set of related metrics that can be fetched together using a common identifier (e.g., product ID, shop ID). Collections express one‑to‑many relationships with metrics.
Solution : An abstraction that defines how to retrieve metrics from a specific collection.
Custom Header : The title row of a two‑dimensional data list, linked one‑to‑many with metrics.
Filter Item : A one‑to‑one mapping to a metric, used for filtering rows in a data list.
Audit View : A dynamically rendered page composed of custom headers and filter items for audit scenarios.
Data‑Center Design
The data‑center is an Elasticsearch‑based service that provides configurable, extensible, and generic data acquisition for the recruitment platform. It abstracts data synchronization and query capabilities, exposing two data sources: external RPC interfaces and recruitment‑record ES indices.
Functional Design
Metrics flow through Filter Item → Custom Header → Audit View to render dynamic audit pages. Because different recruitment entities require distinct audit views, the design allows arbitrary combinations of these components.
Building the ES Cluster (0 → 1)
Two stability requirements guide the initial build: basic disaster‑recovery mechanisms and eventual data consistency between the recruitment DB and multi‑region ES.
1. Capacity Planning
Determine the number of primary shards per index based on projected data growth and query/write traffic.
Decide the number and specifications of data nodes per cluster.
Understand vertical vs. horizontal scaling and design disaster‑recovery across multiple data‑centers.
Key Configuration Tips
{"dynamic": false}to prevent unwanted mapping expansion. index.translog.durability=async improves write throughput at the risk of data loss.
Default refresh_interval is 1 s, meaning data becomes searchable within one second after a successful write.
Data Synchronization Solutions
The core challenge is keeping the recruitment DB and ES indices consistent in near‑real time.
Solution 1: Direct Multi‑Region Write (Dsyncer)
Writes directly to ES in each region. Advantages: short path, low latency. Drawbacks: weak multi‑region write success guarantees, limited bulk write performance, and difficulty ensuring ordered updates.
Solution 2: RocketMQ → Single‑Region ES + Cross‑Cluster Replication
DB changes are sent to RocketMQ, written to a single ES region, then replicated to other regions via ES’s cross‑cluster replication. Drawbacks: longer path, higher latency, and a single‑point‑of‑failure risk if the source region goes down.
Solution 3: RocketMQ + Flink → Multi‑Region ES (Chosen)
Multiple independent consumer groups consume from RocketMQ and, using Flink, write to ES in each region separately. This eliminates single‑point risk and leverages RocketMQ’s ordered key handling.
Why RocketMQ Guarantees Ordering
RocketMQ assigns messages with the same Sharding Key to the same Queue, ensuring FIFO order within that queue. Consumers process messages from a given Queue in strict order, which solves most out‑of‑order issues, though extreme cases (e.g., partition failures) still require reconciliation.
Multi‑Layer Reconciliation Mechanism
Three layers ensure eventual consistency between DB and ES:
Business Check Platform (BCP) – Seconds‑level Reconciliation : Listens to Binlog and directly compares ES across regions, detecting latency or blockage quickly.
Minute‑level Reconciliation : Periodically queries DB and ES without relying on intermediate components; triggers automatic compensation when mismatches are found.
T+1 Offline Reconciliation : Syncs DB and ES data to Hive nightly, validates day‑level consistency, and initiates compensation if needed.
These layers complement each other: BCP catches real‑time issues, minute‑level fills gaps where BCP cannot see, and offline reconciliation provides a safety net.
Current Capacity and Observations
After the first phase, the system supports tens of millions of product indices, handling 500‑1000 QPS per region for reads and around 500 QPS for writes. However, CPU spikes and query latency spikes have been observed, indicating future performance‑tuning work.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
