How Vivo Scaled Its E‑Commerce Order System with Sharding, Migration, and Distributed Transactions

This article details how Vivo transformed its monolithic v1.0 online‑store into a service‑oriented order system by separating the order module, applying data archiving, caching, read‑write splitting, sharding‑sphere based database sharding, synchronizing MySQL changes to Elasticsearch, handling distributed transactions, and safely migrating to a new database cluster.

dbaplus Community
dbaplus Community
dbaplus Community
How Vivo Scaled Its E‑Commerce Order System with Sharding, Migration, and Distributed Transactions

Background

Rapid growth of the Vivo official mall exposed the limitations of the monolithic v1.0 architecture: oversized modules, low development efficiency, performance bottlenecks, and difficult maintenance. Starting in 2017, a v2.0 upgrade split business lines into independent services, with the order module extracted as a dedicated order system.

System Architecture

The order system runs on its own database and provides standardized order, payment, logistics, and after‑sale services to the mall.

System architecture diagram
System architecture diagram

Technical Challenges

Data volume and high concurrency

Historical orders reached tens of millions of rows in MySQL. InnoDB’s B+‑tree index lookup time grows with data size, making simple indexing insufficient.

Data archiving (hot vs. cold tables)

Table partitioning / sharding

Cache layer (Redis) to absorb read traffic

Read‑write splitting

Vertical and horizontal database sharding

Data archiving

Recent orders are kept in a hot table, while older orders are moved to an archive table. Application queries are adjusted to target the appropriate table, reducing the size of the primary order table.

Cache layer

Redis is placed in front of MySQL to cache frequently accessed data. It works well for product data but has limited hit rate for order data because each order is unique.

Cache diagram
Cache diagram

Read‑write splitting

The primary database handles writes and replicates changes to multiple read replicas. Heavy write traffic during peak ordering still pressures the primary, and replication lag (typically <1 ms) can cause temporary inconsistencies.

Read‑write splitting diagram
Read‑write splitting diagram

Sharding strategy

Sharding‑Sphere (formerly Sharding‑JDBC) is used for horizontal sharding of both databases and tables. The user ID is the sharding key; its hash determines the database and table indices:

- Database index: Hash(userId) / m % n
- Table index:    Hash(userId) % m

where n is the number of databases and m the number of tables per database.

Sharding routing diagram
Sharding routing diagram

Sharding‑Sphere selection

Four options were evaluated: client SDK, middleware proxy, in‑house framework, and building from scratch. The open‑source Sharding‑JDBC (now Sharding‑Sphere) was chosen for its jar‑based client‑side sharding and XA transaction support.

GitHub repository: https://github.com/sharding-sphere/

Sharding‑Sphere logo
Sharding‑Sphere logo

Limitations and mitigations

Sharding eliminates cross‑shard joins; unsupported SQL must be rewritten manually. Additional challenges include:

Global unique ID : Order numbers embed database and table indices, allowing lookup without a user ID.

Historical order IDs : A mapping table stores old order numbers and their user IDs; usage declines over time.

Admin pagination : Order data is duplicated to Elasticsearch for flexible backend queries.

MySQL → Elasticsearch synchronization

Two approaches were considered:

Message Queue (MQ) : Order‑change messages are consumed by an ES update service.

Binlog : Tools like Canal act as a MySQL slave, parse binlog events, and push changes to ES.

The MQ solution was selected because ES is only used by the admin backend, so real‑time consistency is not critical. Manual sync functions are added for compensation in case of failures.

MQ sync diagram
MQ sync diagram
Binlog sync diagram
Binlog sync diagram

Database migration strategies

Two plans were evaluated:

Non‑stop migration

Copy data from the old DB to the new cluster and run a real‑time sync program (e.g., Binlog).

Deploy dual‑write services, initially writing only to the old DB.

Enable dual‑write, stop the sync program, and run a compensation program to ensure consistency.

Gradually shift read traffic to the new DB.

After full verification, switch all reads/writes to the new DB and decommission the old one.

Non‑stop migration flow
Non‑stop migration flow

Stop‑the‑world migration

Deploy the new order system and sync two‑month‑old orders to the new DB for audit.

Shutdown the v1 application to freeze old data.

Run migration for the remaining orders and audit.

Launch the v2 application; if it fails, roll back to v1 using the dual‑write switch.

Stop‑the‑world migration flow
Stop‑the‑world migration flow

The stop‑the‑world plan was chosen because the higher cost of non‑stop migration outweighed the limited impact of a nightly downtime.

Distributed transaction handling

E‑commerce scenarios require cross‑service consistency (e.g., notifying shipping after payment, awarding points after order confirmation). Solutions include:

Two‑phase commit (2PC) or three‑phase commit (3PC) for strong consistency.

TCC, local messages, transaction messages, and best‑effort notifications for eventual consistency.

Implementation of a local message table: asynchronous operations are recorded within the local transaction; a scheduled job retries failed messages.

Local message flow
Local message flow

System security and stability measures

Network isolation : Only a few third‑party APIs are exposed to the internet with signature verification; internal services communicate via intranet RPC.

Row‑level locks : Prevent concurrent updates to the same order.

Idempotent interfaces : All APIs are designed to be idempotent, avoiding side effects from retries.

Circuit breaker : Hystrix protects external calls from cascading failures.

Monitoring & alerting : Log‑based error alerts, service‑trace analysis, and middleware health checks enable rapid anomaly detection.

Pitfalls encountered

MQ ordering issue : Concurrent threads could write stale data to ES. The fix was to lock the order row and perform the entire sync within a single transaction.

Sharding‑JDBC pagination bug : When GROUP BY fields differ from ORDER BY fields, Sharding‑JDBC rewrites the limit to Integer.MAX_VALUE, breaking pagination.

select a from temp group by a, b order by a desc limit 1,10

Corrected SQL:

select a from temp group by a desc, b limit 1,10

The bug existed in Sharding‑JDBC 3.1.1.

When paginating Elasticsearch results with a non‑unique sort field, add a unique secondary sort key (e.g., order creation time plus order ID) to avoid missing or duplicate records.

Results

One‑click rollout succeeded and has been stable for over a year.

Core service performance improved by more than tenfold.

System decoupling dramatically increased iteration speed.

The architecture can support at least five years of rapid growth for the mall.

Conclusion

The team selected solutions that matched actual business needs rather than chasing the latest hype. By anticipating future demands and planning architectural evolution, the system stayed ahead of growth challenges and achieved long‑term scalability.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

backenddatabase-migrationdistributed-transactions
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.