Databases 10 min read

A Comprehensive Review of Sharding Implementation and Testing in a Billion-Scale Order System

This article recounts the end‑to‑end experience of planning, executing, and reviewing a database sharding project for a high‑traffic order platform, detailing pre‑implementation risk assessment, the detailed testing lifecycle, gray‑release strategies, performance validation, and post‑mortem lessons learned.

转转QA

Apr 18, 2022

A Comprehensive Review of Sharding Implementation and Testing in a Billion-Scale Order System

Background

Recently, many technical articles have shared test plans for sharding, and in the second half of 2021 our middle‑platform order team also performed a sharding migration. Due to project overload at year‑end, a systematic summary was delayed; this article provides an immersive review divided into "pre", "during", and "post" phases.

Pre‑Implementation

Initially, the project faced resistance because of the large change scope close to Double‑11 and the inexperience of the QA team. However, after multiple load‑test rounds and the presence of tables exceeding one billion rows, the database became a bottleneck, prompting acceptance of the sharding effort.

Key characteristics of the sharding project:

Added an atomic service layer for database interaction with minimal logic changes.

Changes were primarily at the infrastructure level, invisible to upper‑level services.

The project was initiated by the order tech team and required QA collaboration for regression testing.

Targeted test points included ensuring consistency between old and new service interfaces and verifying core order creation and state transition scenarios.

Preparation involved listing tasks, aligning QA and development resources, and establishing a realistic schedule with buffers and plan‑B options.

Project pacing principles included assessing which tasks could be done early, realistic workload estimation, reserving buffers, and solution‑oriented goal setting.

During Implementation

The testing lifecycle was divided into several stages: greeting, code review (CR), diff stage, functional testing, wrap‑up, gray‑release, and performance testing.

Greeting : After project kickoff, the team secured business QA resources in advance to avoid last‑minute constraints.

CR : The existing middleware needed to be replaced; a CR approach was used to switch from the old middleware to direct reads from the new replica, minimizing risk while ensuring quality.

Data Gathering : Business modules were broken down, test cases written, and QA collaborated on data preparation.

Diff Stage : Teams performed interface diffs using TestNG assertions, flagging any mismatches for developer review. Critical interfaces were prioritized, and the rest were deferred to functional testing.

Daily stand‑ups tracked completed work early on and remaining issues later, assigning clear owners for each task.

Functional Testing : Multiple business sides participated; the middle‑platform focused on order flow and financial correctness, while business QA verified UI and custom interactions.

Wrap‑Up : As the project progressed, emphasis shifted to resolving hard‑to‑fix issues while maintaining quality.

Gray Release : After testing, a staged gray rollout was performed using whitelist machines and feature switches, monitoring error logs and synchronization tables to quickly rollback if needed.

Four gray phases were executed: write‑old/read‑old, write‑old/read‑new, write‑new/read‑old, and write‑new/read‑new, with continuous monitoring.

Performance Testing : With Double‑11 approaching, limited time remained for load testing. Existing 618 promotion scripts were adapted, and traffic estimates based on conversion rates guided peak order volume calculations. Manual monitoring and scheduled tasks helped identify bottlenecks.

Post‑Implementation

After Double‑11, the team reflected on successes and areas for improvement:

Early involvement of the middle‑platform saved downstream effort.

Dynamic adjustments during the diff stage prevented stalls.

Stand‑up focus shifted from completed to pending items as the project matured.

Early business analysis proved essential.

Regrets included not using the interface testing platform during diff, lacking automated test generation for business code paths, and allocating excessive resources to the diff phase.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Migration Performance Database Testing Sharding

Written by

转转QA

In the era of knowledge sharing, discover 转转QA from a new perspective.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.