How JD Scaled Its One‑Yuan Grab Treasure System with Database Sharding and ES Aggregation
This article details JD's One‑Yuan Grab Treasure platform redesign, covering business growth drivers, database sharding estimation, hash‑plus‑range routing implementation, Elasticsearch aggregation, Canal‑based sync, historical data migration, and downgrade mechanisms to ensure high‑throughput, reliable order processing during massive sales events.
Background
One‑Yuan Grab Treasure is a JD virtual‑goods service where each product (item) is sold in multiple periods (phases). Each period has a fixed number of participants (e.g., 1000 people each paying 1 CNY). The system experienced rapid order growth, especially before the 618 promotion, leading to concerns about database capacity and latency.
Data Volume Estimation
Assuming 1 million orders per day → 3.6 billion orders per year. Each order record ≈500 bytes (10 columns, 50 bytes each). Annual storage ≈170 GB. With a 200 GB per‑node capacity, one additional node per year would satisfy raw storage, but actual capacity must also meet QPS and latency targets measured in load tests.
Sharding Design
Routing Strategy
Two classic sharding methods were considered:
Hash routing : hash(item_id) → database; evenly distributes rows but requires full data migration when a new node is added.
Range (interval) routing : range of period_id → table; supports incremental expansion but can create hotspot tables and adds a lookup step.
The final design combines both:
Hash the grab_item_id to select the target database.
Within the chosen database, use the grab_period_id range to select the specific table.
This hybrid approach keeps the DAO layer transparent to upper services while allowing dynamic addition of databases without immediate data migration.
Table Layout
Each database contains a set of tables named order_0, order_1, …, each covering a contiguous interval of grab_period_id. The interval size is configurable (e.g., 10 000 periods per table) and stored in a routing‑rule table that is cached in memory.
Aggregation and Synchronization
Aggregated queries (e.g., total participants per item) are served by Elasticsearch (ES). ES provides automatic shard scaling and a query‑only interface for the business layer.
Because ES does not guarantee strong consistency, a fallback aggregation table in MySQL stores recent aggregates (e.g., last 5 minutes) for read‑through when ES is delayed or unavailable.
Change Capture with Canal
Canal monitors the binlog of a replica MySQL instance. For each committed transaction it extracts rows that belong to the sharded order tables, builds a lightweight message {table, primary_key}, and publishes it to a message queue.
Consumers read the message, fetch the latest row from the replica, and update both ES and the MySQL aggregation table. This decouples the write path from the indexing path and enables retry on failure.
Historical Data Migration
The system originally ran on a single database. Migration to the sharded architecture proceeds in two phases:
Pre‑cutover : A batch job scans the legacy table, transforms each row to the target (db, table) based on the routing rules, and bulk‑loads it into the new shards. The job is idempotent and can be re‑run for verification.
Post‑cutover : After the cutover window, only incremental rows (inserted after the pre‑cutover snapshot) are migrated. This minimizes downtime.
Validation compares three dimensions—order_id, item_id, period_id—between source and target. Any mismatch triggers a re‑migration of the affected day.
Degradation Paths
If Canal falls behind or the replica becomes unavailable, a switch enables direct scanning of the primary database for the most recent hours and pushes the data to ES and the aggregation table.
If ES is down, the ES switch is turned off and queries fall back to the MySQL aggregation table, ensuring that aggregate results remain available.
Key Commands and Configuration Snippets
# Example of routing rule table (cached in memory)
CREATE TABLE routing_rule (
db_index INT NOT NULL,
period_start BIGINT NOT NULL,
period_end BIGINT NOT NULL,
table_name VARCHAR(64) NOT NULL,
PRIMARY KEY (db_index, period_start)
);
# Canal configuration (simplified)
instance.master.address = 10.0.1.100:3306
instance.filter.regex = ^order_.*$
instance.destination = order_syncConclusion
The combination of hash‑based database selection, range‑based table selection, Elasticsearch aggregation, Canal‑driven change capture, and a two‑stage migration process enabled the One‑Yuan Grab Treasure service to scale for massive traffic spikes while preserving data integrity and providing robust fallback mechanisms.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
