How Graded Release Boosted Mobile Tieba Quality: A Step‑by‑Step Blueprint
This article details the design, architecture, implementation steps, and early results of a graded release system for Baidu's mobile Tieba product, showing how staged rollouts, traffic splitting, and automated validation improve release quality and user experience.
Background
For a product with billions of page views like mobile Tieba, any bug that reaches production can affect thousands of users and damage the product’s reputation, while engineers waste time juggling high‑priority fixes. The goal was to discover and contain defects earlier, even with existing development and testing capabilities.
Requirements and Goals
2.1 Requirement Overview
Typical releases push the entire codebase to production and rely on QA to validate online behavior, but engineers cannot target the newly deployed modules directly, making it hard to catch online bugs. The team needed a way to verify new code before full rollout and expand deployment only after confirming stability.
2.2 Detailed Requirements
After studying Facebook’s canary release and other industry solutions, the Tieba team defined two essential processes:
Enable multi‑level, staged deployment of new features.
Perform timely, effective validation of the newly released service in the live environment.
The graded release model was built around these needs.
Graded Release Scheme
Graded release stretches a single deployment into several stages, each limiting traffic and user scope, allowing independent verification at each level.
3.1 Overall Architecture
Three main components cooperate:
Release Platform – handles staged code deployment.
TIP (Test In Product) Platform – triggers online test tools after each stage.
Traffic Splitting System – routes test and production traffic to the correct environment.
The workflow for each stage:
Release Platform deploys new code to a specific machine group.
TIP Platform is notified of completion.
TIP drives automated and manual test suites to validate the service.
Traffic Splitting System directs test traffic to the newly deployed nodes.
Engineers review TIP reports and decide whether to proceed.
Release Platform either promotes to the next stage or rolls back.
3.2 Traffic Splitting System
The core of traffic control is an Nginx‑based extension that decides routing based on request attributes:
Internal IP + pub_env=1 (custom cookie) → force internal environment.
Internal IP + pub_env=2 → force small‑traffic environment.
Regular external traffic → consistent hash routing to small‑traffic or full‑traffic pools.
3.3 Platform Interaction
The Release Platform and TIP Platform exchange status, trigger deployments, and collect validation results, as illustrated below.
Implementation
4.1 Key Tasks
Infrastructure: Build the Nginx‑based traffic splitter.
Platform Support: Develop the Release and TIP platforms for staged deployment and feedback.
Application Layer: Establish online testing procedures and release workflows.
4.2 Implementation Benefits
After a three‑week trial (May 7 – May 25), the graded release showed immediate quality improvements:
Two small‑traffic rollbacks (one performance, one functional).
One internal‑environment rollback (functional issue).
During the internal stage, three functional bugs and one build‑script issue were caught.
During the small‑traffic stage, one performance problem was discovered.
One functional defect had been missed by offline testing and was only found in the live environment.
Key Technologies
1. Node Partition & Cross‑Data‑Center Access
With a dual‑data‑center deployment, UI clusters are split into internal, small‑traffic, full‑traffic‑JX, and full‑traffic‑TC nodes. To avoid cross‑data‑center access—a major reliability risk—each data center must retain at least one small‑traffic node.
2. Consistent Splitting & Load Balancing
During rollout, user experience must remain consistent: a user should not see both old and new versions intermittently, and any defect should affect only a limited, stable user segment. Therefore, random Nginx splitting is unsuitable; instead, a cookie‑based hash (e.g., baiduid) provides both consistency and better load distribution.
3. Config Storage & One‑Click Deployment
The deployment platform relies on one‑click deployment, which requires configuration templates stored in a repository. During deployment, template variables are replaced with actual runtime values. This approach works smoothly for pipelines already practicing continuous integration.
4. Platform Support
After the underlying infrastructure is ready, the Release and TIP platforms must operate reliably. The Release platform handles staged deployment and rollback, while the TIP platform manages the entire release lifecycle and provides Test‑In‑Product support.
5. Online Automation
Automated test cases are migrated to run in the live environment, using cookies to ensure traffic hits the correct version. Special considerations include:
Setting and clearing cookies around test execution.
Avoiding account lockout due to frequent logins.
Bypassing anti‑spam, captcha, and other verification mechanisms.
Managing the impact of test traffic on case stability.
Improving execution efficiency via distributed test execution.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Baidu Tech Salon
Baidu Tech Salon, organized by Baidu's Technology Management Department, is a monthly offline event that shares cutting‑edge tech trends from Baidu and the industry, providing a free platform for mid‑to‑senior engineers to exchange ideas.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
