Backend Development 17 min read

How Vivo Scaled Its Order System: Sharding, Migration, and Real‑World Lessons

This article details how Vivo transformed its monolithic e‑commerce order service into a scalable, service‑oriented system by applying data archiving, sharding‑JDBC based sharding, MySQL‑to‑Elasticsearch sync, zero‑downtime migration, and distributed‑transaction techniques, while sharing practical pitfalls and performance results.

vivo Internet Technology

Dec 23, 2020

How Vivo Scaled Its Order System: Sharding, Migration, and Real‑World Lessons

Background

Rapid user growth exposed the limitations of the monolithic Vivo official mall v1.0 architecture: oversized modules, low development efficiency, performance bottlenecks, and difficult maintenance. Starting in 2017, a v2.0 upgrade split business modules vertically, allowing each line to operate independently as services. The order module, the core of the e‑commerce transaction flow, was approaching a single‑table storage limit, threatening support for new product releases and large‑scale promotional traffic.

System Architecture

The order module was extracted into an independent order system with its own database, providing standardized order, payment, logistics, and after‑sale services to the mall ecosystem.

Technical Challenges

Data volume and high concurrency

Data volume : Order tables in MySQL grew to tens of millions of rows. InnoDB uses a B+‑tree (O(log n) lookup), so query latency degrades as n increases. Indexing alone cannot solve the problem; the table size must be reduced. Solutions: data archiving and table splitting.

High concurrency : Order traffic surged, exceeding the capacity of a single MySQL node and risking slowdowns or crashes. Solutions: Redis cache, read/write splitting, and database sharding.

Solution Overview

Data archiving : Separate recent orders from historical ones by moving old data to a dedicated archive table and updating query code to target the appropriate table.

Cache layer : Use Redis as a front‑cache for MySQL queries. Cache works well for product data but has limited hit rate for order data because each order is unique.

Read/write splitting : Primary handles writes; replicas serve reads. Replication lag (<1 ms) can cause temporary inconsistency, so UI logic may need to tolerate stale reads.

Sharding (database and table) : Apply both vertical and horizontal sharding. Horizontal sharding distributes rows across multiple databases; vertical sharding groups tables by business domain.

Sharding‑JDBC (Sharding‑Sphere) Selection

After evaluating client‑SDKs, middleware proxies, in‑house frameworks, and building a custom solution, the team chose the open‑source Sharding‑JDBC library (now Sharding‑Sphere).

GitHub repository: https://github.com/sharding-sphere/

Features: jar‑based client‑side sharding, XA transaction support.

Sharding Strategy

The user identifier is used as the sharding key. Its hash value determines the target database and table:

- Database index: Hash(userId) / m % n
- Table index:    Hash(userId) % m

Limitations of Sharding

Complex cross‑shard queries often require manual rewriting because Sharding‑Sphere does not support them.

Global unique ID : Auto‑increment primary keys are no longer globally unique. The solution embeds the database and table index into the order number, enabling shard location without a user ID.

Historical order numbers : A separate mapping table stores old order numbers and their corresponding user IDs.

Backend analytics : Order data is duplicated into Elasticsearch for flexible filtering and pagination.

MySQL → Elasticsearch Synchronization

Two approaches were considered:

MQ‑based : An ES update service consumes order‑change messages and updates the index.

Binlog‑based : Using Canal or similar tools, the ES service pretends to be a MySQL replica, parses binlog events, and updates the index.

The team selected the MQ approach because Elasticsearch is only used by the admin backend and real‑time consistency requirements are modest. Manual compensation sync jobs were added for failure recovery.

Database Migration Strategies

Zero‑downtime (online) migration

Copy data from the old instance to the new cluster.

Deploy a binlog‑based sync program to keep the new cluster up‑to‑date.

Enable dual‑write order services, initially writing only to the old DB.

Gradually switch reads to the new DB while verifying data consistency.

After full verification, retire the old DB and remove the dual‑write logic.

Planned‑downtime (offline) migration

Deploy the new order system and migrate two‑month‑old orders with audit.

Stop the V1 application to freeze the old DB.

Migrate the remaining orders and audit again.

Launch the V2 application; if issues arise, roll back to V1 (dual‑write switch can be toggled).

The offline approach was chosen because its cost was lower and the expected business impact during night windows was acceptable.

Distributed Transaction Handling

E‑commerce workflows require coordination across services (e.g., payment → shipping, order completion → points system). Common solutions include 2PC, 3PC for strong consistency, and TCC, local/message‑based approaches for eventual consistency.

The implemented pattern uses a local message table: within a local transaction, asynchronous actions are recorded; if an action fails, a scheduled compensating task retries.

System Security and Stability

Network isolation : Only a few third‑party APIs are exposed externally with signature verification; internal services communicate via intranet RPC.

Row‑level locking : Order updates acquire database row locks to prevent concurrent modifications.

Idempotency : All interfaces are designed to be idempotent, mitigating retry side‑effects.

Circuit breaking : Hystrix protects downstream calls.

Monitoring & alerting : Log‑based error alerts, tracing, and middleware health checks enable rapid incident detection.

Pitfalls Encountered

MQ‑based ES sync ordering issue

When two threads processed the same order change, the later thread could overwrite newer data in ES. The fix was to lock the row during the read‑modify‑write sequence, ensuring serial execution.

Sharding‑JDBC pagination with GROUP BY

Incorrect ordering caused pagination to fail. The correct SQL pattern is SELECT a FROM temp GROUP BY a DESC, b LIMIT 1,10 using version 3.1.1 of Sharding‑JDBC.

Elasticsearch pagination with duplicate sort values

When the primary sort field has duplicates, add a unique secondary sort key (e.g., creation timestamp plus order ID) to avoid missing or duplicate records.

Results

Successful one‑time launch, stable for over a year.

Core service performance improved by more than tenfold.

System decoupling dramatically increased iteration speed.

Architecture now supports at least five years of rapid growth.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

distributed systems e-commerce Microservices Elasticsearch Sharding mysql database scaling

Written by

vivo Internet Technology

Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.