Databases 21 min read

How We Scaled an E‑commerce Order System with Sharding, Consistent Hashing, and Zero‑Downtime Migration

This article details how a rapidly growing e‑commerce platform migrated from a single MySQL instance to a 16‑shard architecture using Sharding‑Jdbc, introduced consistent‑hashing to mitigate data skew, leveraged ES+HBase for multi‑dimensional queries, and implemented zero‑downtime migration strategies such as dual‑write and Canal replication.

dbaplus Community
dbaplus Community
dbaplus Community
How We Scaled an E‑commerce Order System with Sharding, Consistent Hashing, and Zero‑Downtime Migration

Background

A few years ago an e‑commerce company saw its daily orders surge from 300,000 to 1,000,000, with peak traffic reaching 2,000 TPS (over 10,000 TPS during flash‑sale events). The system ran on a single MySQL database, which soon became a bottleneck.

Why Sharding?

When a database hits I/O or CPU limits, active connections grow until the maximum threshold is reached, causing service outages. After exhausting code, SQL, and index optimizations, the team decided to split the database.

IO bottleneck : Hot data exceeds cache, causing heavy disk reads. Solution – master‑slave replication with read‑write splitting or sharding + horizontal partitioning.

Write IO bottleneck : Frequent writes generate many disk writes. Solution – multiple master databases plus horizontal partitioning.

CPU bottleneck : Complex SQL (JOIN, GROUP BY, ORDER BY) or massive single‑table size (> 100 M rows) stresses CPU. Solution – SQL tuning, indexing, or sharding.

Sharding Options

Proxy middleware (MyCat, KingShard) – low code changes but adds a black‑box layer; failures can be hard to debug.

Lightweight Java libraries (Sharding‑Jdbc, TSharding) – requires code changes but gives developers full control.

Cloud‑native distributed databases (e.g., PolarDB‑X) – minimal code changes but locks you into the provider.

The team chose the second option, Sharding‑Jdbc, for its transparency and controllability.

Sharding Design

Target: support a two‑year growth to 10 M daily orders (10× current volume). The plan split data into 16 databases, each handling ~625 k daily orders (≈1,250 TPS). Each database contains 16 tables, giving an average of 2.85 M rows per table – a manageable size for the next 2‑3 years.

Sharding key: user_id (most frequent query) hashed to distribute rows evenly. The order_id embeds the user_id (prefix) so that the user’s orders reside in the same shard, enabling fast look‑ups.

Using 16 shards (2⁴) allows hash‑modulo to be replaced by a bitwise AND, which is faster than a modulo operation.

Handling Data Skew with Consistent Hashing

Simple hash‑modulo can cause uneven data distribution. The team introduced consistent hashing with virtual nodes, which spreads data more uniformly and simplifies scaling: adding a new shard only moves data from its immediate neighbor.

Images illustrate the virtual‑node concept and data migration on a hash ring.

Hash ring with virtual nodes
Hash ring with virtual nodes
Data distribution before and after virtual nodes
Data distribution before and after virtual nodes

Management‑Side Query Solution

After sharding, multi‑dimensional queries (e.g., orders of a specific product) become inefficient. The solution combines Elasticsearch (ES) for indexing and HBase for raw storage:

All order rows are written to HBase (fast row‑key look‑ups).

Relevant fields (merchant, product name, order date, etc.) are indexed in ES.

Query flow: ES returns matching row‑keys, then HBase fetches the full rows.

ES + HBase query flow
ES + HBase query flow

Real‑Time Data Sync

MySQL order data is streamed to HBase and ES using Canal + RocketMQ:

Canal mimics a MySQL slave, pulling binlog events.

Events are sent to RocketMQ (transactional messages).

Consumers write to HBase and update ES indexes.

Canal data pipeline
Canal data pipeline

Zero‑Downtime Migration Strategies

Three main approaches were evaluated:

Slave‑only migration : create a replica, sync, then promote. Requires a brief downtime – unsuitable for high‑traffic systems.

Dual‑write : old and new databases receive writes simultaneously; old data is batch‑migrated; traffic is gradually shifted to the new shard.

Canal‑based sync : use Canal to replicate changes, avoiding code changes for dual‑write.

Detailed steps for the dual‑write approach include code changes, migration scripts, data validation, gradual read‑traffic cut‑over, and final cleanup.

Dual‑write migration flow
Dual‑write migration flow

Scaling and Degradation

During peak promotions, the system can switch to asynchronous order processing: orders are queued in RocketMQ, temporarily stored in Redis, and batch‑written to MySQL every 100 ms or 10 orders. This reduces write pressure but may cause temporary inconsistency, which is acceptable as a degradation mode.

Async batch write flow
Async batch write flow
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

ElasticsearchshardingmysqlHBaseconsistent hashingdatabase scalingzero-downtime migration
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.