How Vivo Scaled Its E‑Commerce Order System with Sharding, Migration, and Distributed Transactions
This article details how Vivo transformed its monolithic v1.0 online‑store into a service‑oriented order system by separating the order module, applying data archiving, caching, read‑write splitting, sharding‑sphere based database sharding, synchronizing MySQL changes to Elasticsearch, handling distributed transactions, and safely migrating to a new database cluster.
Background
Rapid growth of the Vivo official mall exposed the limitations of the monolithic v1.0 architecture: oversized modules, low development efficiency, performance bottlenecks, and difficult maintenance. Starting in 2017, a v2.0 upgrade split business lines into independent services, with the order module extracted as a dedicated order system.
System Architecture
The order system runs on its own database and provides standardized order, payment, logistics, and after‑sale services to the mall.
Technical Challenges
Data volume and high concurrency
Historical orders reached tens of millions of rows in MySQL. InnoDB’s B+‑tree index lookup time grows with data size, making simple indexing insufficient.
Data archiving (hot vs. cold tables)
Table partitioning / sharding
Cache layer (Redis) to absorb read traffic
Read‑write splitting
Vertical and horizontal database sharding
Data archiving
Recent orders are kept in a hot table, while older orders are moved to an archive table. Application queries are adjusted to target the appropriate table, reducing the size of the primary order table.
Cache layer
Redis is placed in front of MySQL to cache frequently accessed data. It works well for product data but has limited hit rate for order data because each order is unique.
Read‑write splitting
The primary database handles writes and replicates changes to multiple read replicas. Heavy write traffic during peak ordering still pressures the primary, and replication lag (typically <1 ms) can cause temporary inconsistencies.
Sharding strategy
Sharding‑Sphere (formerly Sharding‑JDBC) is used for horizontal sharding of both databases and tables. The user ID is the sharding key; its hash determines the database and table indices:
- Database index: Hash(userId) / m % n
- Table index: Hash(userId) % mwhere n is the number of databases and m the number of tables per database.
Sharding‑Sphere selection
Four options were evaluated: client SDK, middleware proxy, in‑house framework, and building from scratch. The open‑source Sharding‑JDBC (now Sharding‑Sphere) was chosen for its jar‑based client‑side sharding and XA transaction support.
GitHub repository: https://github.com/sharding-sphere/
Limitations and mitigations
Sharding eliminates cross‑shard joins; unsupported SQL must be rewritten manually. Additional challenges include:
Global unique ID : Order numbers embed database and table indices, allowing lookup without a user ID.
Historical order IDs : A mapping table stores old order numbers and their user IDs; usage declines over time.
Admin pagination : Order data is duplicated to Elasticsearch for flexible backend queries.
MySQL → Elasticsearch synchronization
Two approaches were considered:
Message Queue (MQ) : Order‑change messages are consumed by an ES update service.
Binlog : Tools like Canal act as a MySQL slave, parse binlog events, and push changes to ES.
The MQ solution was selected because ES is only used by the admin backend, so real‑time consistency is not critical. Manual sync functions are added for compensation in case of failures.
Database migration strategies
Two plans were evaluated:
Non‑stop migration
Copy data from the old DB to the new cluster and run a real‑time sync program (e.g., Binlog).
Deploy dual‑write services, initially writing only to the old DB.
Enable dual‑write, stop the sync program, and run a compensation program to ensure consistency.
Gradually shift read traffic to the new DB.
After full verification, switch all reads/writes to the new DB and decommission the old one.
Stop‑the‑world migration
Deploy the new order system and sync two‑month‑old orders to the new DB for audit.
Shutdown the v1 application to freeze old data.
Run migration for the remaining orders and audit.
Launch the v2 application; if it fails, roll back to v1 using the dual‑write switch.
The stop‑the‑world plan was chosen because the higher cost of non‑stop migration outweighed the limited impact of a nightly downtime.
Distributed transaction handling
E‑commerce scenarios require cross‑service consistency (e.g., notifying shipping after payment, awarding points after order confirmation). Solutions include:
Two‑phase commit (2PC) or three‑phase commit (3PC) for strong consistency.
TCC, local messages, transaction messages, and best‑effort notifications for eventual consistency.
Implementation of a local message table: asynchronous operations are recorded within the local transaction; a scheduled job retries failed messages.
System security and stability measures
Network isolation : Only a few third‑party APIs are exposed to the internet with signature verification; internal services communicate via intranet RPC.
Row‑level locks : Prevent concurrent updates to the same order.
Idempotent interfaces : All APIs are designed to be idempotent, avoiding side effects from retries.
Circuit breaker : Hystrix protects external calls from cascading failures.
Monitoring & alerting : Log‑based error alerts, service‑trace analysis, and middleware health checks enable rapid anomaly detection.
Pitfalls encountered
MQ ordering issue : Concurrent threads could write stale data to ES. The fix was to lock the order row and perform the entire sync within a single transaction.
Sharding‑JDBC pagination bug : When GROUP BY fields differ from ORDER BY fields, Sharding‑JDBC rewrites the limit to Integer.MAX_VALUE, breaking pagination.
select a from temp group by a, b order by a desc limit 1,10Corrected SQL:
select a from temp group by a desc, b limit 1,10The bug existed in Sharding‑JDBC 3.1.1.
When paginating Elasticsearch results with a non‑unique sort field, add a unique secondary sort key (e.g., order creation time plus order ID) to avoid missing or duplicate records.
Results
One‑click rollout succeeded and has been stable for over a year.
Core service performance improved by more than tenfold.
System decoupling dramatically increased iteration speed.
The architecture can support at least five years of rapid growth for the mall.
Conclusion
The team selected solutions that matched actual business needs rather than chasing the latest hype. By anticipating future demands and planning architectural evolution, the system stayed ahead of growth challenges and achieved long‑term scalability.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
