Mastering Smooth and Gray Releases for Large‑Scale Internet Finance Platforms
This article details a step‑by‑step transformation of an internet finance platform's online release process, covering application architecture, public component selection, smooth deployment techniques using Dubbo weight adjustment, RocketMQ control, LTS task isolation, verification methods, and a comprehensive gray‑release strategy with practical pitfalls and future improvements.
Application Logic Architecture
The platform consists of a client layer (mobile apps, web pages, H5), a WEB layer that forwards traffic via Nginx, a Business Front End (BFE) acting as an API gateway deployed as a Tomcat WAR, an APP layer of dozens of Tomcat‑based services, and a Data layer (databases, caches, distributed file systems). Shared infrastructure includes a configuration center, task scheduler, service registry, and message queue.
Shared Components Overview
Configuration Center : Disconf (Baidu open‑source) provides runtime configuration with hot‑update support.
Task Scheduler : Light Task Scheduler (LTS) – a lightweight Crontab‑like system for unified job management.
Service Registry : Dubbo (Alibaba RPC framework); Spring Cloud can be used for Spring‑heavy projects.
Message Queue : RocketMQ (Alibaba) – sufficient for financial‑grade messaging.
Release Practice 1.0 – Problems
The original release process for BFE and APP services (Java WAR on Tomcat) exhibited five critical issues:
Tomcat restart during APP release caused request loss and data anomalies.
LTS tasks running on a node being released failed and retried.
Active RocketMQ consumers on the node produced consumption errors, retries, or dead‑letter queues.
No immediate post‑release validation, exposing users to potential defects.
Inability to keep old and new APP versions simultaneously for extended verification.
These were grouped into smooth‑release issues (first three) and release‑validation issues (last two).
Release Practice 1.1 – Smooth Release
To achieve smooth releases, the team leveraged the shared components to gracefully take services offline before deploying new code.
Dubbo Weight Adjustment
Using Dubbo‑Admin, the service weight of the target APP was set to 0, effectively disabling the provider without affecting other services.
RocketMQ Consumer Offline
Before restarting an APP node that consumes messages, the corresponding consumer group and queue bindings were removed via a custom interface added to the RocketMQ web console.
LTS Task Isolation
In ZooKeeper a ZNode (e.g., machineID=offline) was created for nodes that should stop receiving new tasks. The JobTracker checks this tag and skips task assignment to those nodes.
Verification Mechanism
Two checks were introduced:
API calls to Dubbo, RocketMQ, and LTS to confirm that the target node is truly offline.
Monitoring checks via CAT and ELK APIs to ensure request counts and log traffic have dropped to zero.
After these modifications, the new release workflow reduced business impact to a minimum.
Release Practice 1.2 – Gray Release and Validation
A dedicated Wi‑Fi network (HDFB) was set up in the office. Devices on this network resolve business domains to a separate gray‑release WEB layer that mirrors production but is accessible only internally, enabling isolated validation.
Gray‑Release Architecture
GROUP tags (e.g., BLUE, GREEN) are introduced at the framework level. Each APP instance registers its GROUP in the shared components:
Disconf versioning maps to GROUP.
Dubbo service names are prefixed with GROUP.
RocketMQ topics carry a GROUP suffix.
LTS task IDs embed GROUP information to control execution nodes.
During a release, a subset of APP machines is taken offline, the new code is deployed, and their GROUP is switched from BLUE to GREEN. Verification is performed via the isolated HDFB WEB entry while production traffic continues on BLUE. Once verified, the GREEN group is gradually expanded until it becomes the primary group, after which BLUE machines are updated.
Prerequisites and Constraints
Data‑layer changes that break compatibility prevent gray release.
Incompatible service interfaces between old and new APP versions also block gray release.
High traffic volumes require careful pacing; the team uses four data‑center zones with 4× redundancy and releases 25% of traffic per step.
Future Optimizations
Further automation to reduce manual intervention and shorten verification cycles.
Define criteria for when a hot‑fix can skip gray release.
Improve monitoring for environments running multiple versions simultaneously.
Explore using the gray‑release infrastructure for traffic replay and full‑chain stress testing.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
