Operations 12 min read

Designing a Comprehensive Stability Assurance System for Large‑Scale Internet Services at Manbang

This article explains how Manbang built a rigorous stability‑assurance framework—including strict fault grading, a "watch‑and‑protect" system, blue‑green deployments, online pressure testing, fault‑drill platforms, and runtime metadata—to ensure rapid iteration while maintaining high availability for millions of logistics users.

Manbang Technology Team

Nov 23, 2020

Designing a Comprehensive Stability Assurance System for Large‑Scale Internet Services at Manbang

Maintaining stability for high‑speed, large‑scale internet systems is a constant challenge, and Manbang faces this daily as millions of truck drivers and shippers rely on its apps for logistics, payments, insurance, and more.

To address this, Manbang defined strict fault‑grading criteria: any outage affecting more than 50 core user scenarios triggers a fault, and if 10% of users experience a five‑minute outage it is classified as a P1 incident.

The company introduced a "watch‑and‑protect" (看护) framework. "Watch" covers alerting, monitoring, and diagnostics to quickly detect issues, while "Protect" includes release management and pressure‑testing capabilities to prevent problems.

Key capabilities include a global change‑event stream that correlates alarms with recent releases, configuration changes, DB or network modifications, enabling root‑cause identification for about 80% of incidents.

Manbang also implemented a full‑link blue‑green deployment mechanism that automates gray‑scale roll‑outs and enables second‑level rollbacks across RPC, MQ, JOB, and Config layers, reducing fault impact and recovery time.

Online pressure‑testing is achieved by tagging test users and isolating them within the production environment, allowing realistic load tests without disturbing real users.

A fault‑drill platform, built on Alibaba's open‑source Chaosblade (renamed Venom), injects failures into weak dependencies, providing automated chaos experiments and visual management tools.

The runtime metadata platform collects real‑time information about JVMs, containers, and physical hosts, presenting a map that helps engineers pinpoint failures during incidents or drills.

Additional practices include mandatory rollback plans, code reviews, quality metrics, entry‑case coverage, dedicated stability owners, peak‑hour on‑call rotations, routine pressure‑testing, regular fault‑drill exercises, and standardized operational procedures to avoid human error.

Looking ahead, Manbang plans to further automate fault‑source tracing and self‑healing, with continued investment in monitoring, release systems, and intelligent infrastructure.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

fault tolerance

Written by

Manbang Technology Team

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.