Operations 16 min read

How Huolala Built a Scalable Tech Stability System – Key Lessons for Reliability

This article details Huolala's journey in establishing a comprehensive technical stability framework, covering organizational challenges, risk governance, incident response, cultural initiatives, and future automation to enhance system reliability at scale.

Huolala Tech

Apr 7, 2023

How Huolala Built a Scalable Tech Stability System – Key Lessons for Reliability

Introduction

Huolala, a well‑known internet logistics company, serves individuals, merchants and enterprises with efficient logistics solutions, reaching 360 mainland Chinese cities by December 2022, with over 680,000 active drivers and 9.5 million active users. The massive scale underscores its social responsibility, and system stability is essential to fulfill that duty.

Background and Challenges

Technical stability is defined as MTBF/(MTBF+MTTR); a value closer to 1 indicates higher stability. MTBF is the mean time between failures, and MTTR is the mean time to recover. Stability aims for low failure rates and short recovery times, while balancing cost and efficiency.

The main challenges are:

Organization : Stability spans the entire system lifecycle and involves development, testing, operations, product, and support teams. Coordinating a large technical team of over a thousand engineers requires effective cross‑team communication.

VUCA : The system exhibits volatility, uncertainty, complexity, and ambiguity, with 3,000+ services, intricate dependencies, and frequent refactoring, leading to high cognitive load and hidden risks.

Barrel Effect : Failures arise from diverse sources—architectural flaws, external dependencies, human error, or malicious attacks—necessitating a holistic approach that includes people, processes, and technology.

Process Review

Team Building

Initially a virtual stability group was formed, but lacked dedicated full‑time stability engineers. In mid‑2021 a full‑time stability team was created to design, build, and evolve the global stability system, clarifying responsibilities and improving collaboration.

Cognition Enhancement

Understanding business processes and system architecture is a prerequisite for stability work. Huolala mapped core business functions (e.g., order placement, driver dispatch) to service entry points and generated call‑chain maps, establishing a knowledge base for monitoring and governance.

Resilience Governance

Risk assessments identified issues such as overly long timeouts, improper retries, idempotency, rate limiting, and strong/weak dependencies. A risk‑governance model was defined, reviewed, and applied to core call chains, resulting in remediation of nearly 200 risk items and measurable resilience improvement.

Incident Response

Following SRE principles, the focus is on rapid recovery and learning. Key practices include:

Fast Recovery : Quick detection, organized response, root‑cause analysis, and execution of recovery steps.

Experience Capture : Post‑mortems, improvement tracking, and periodic fault reviews.

Change Control : Strict change windows, pre‑approval gates (CR, QA, gray‑release), and automated change‑record lookup for rapid rollback.

Preventive Measures : Early detection of minor incidents to avoid larger failures, guided by Heinrich’s law.

Exercise‑Driven Prevention : Regular emergency drills, pre‑plan rehearsals, chaos engineering injections, and disaster‑recovery simulations.

Cultural Construction

Stability is a company‑wide responsibility. Huolala promoted stability culture through expert talks, design‑for‑failure guidelines, printed handbooks with exams, and a “Stability Culture Month” featuring games and rewards.

Practical Playbook for New Teams

Focus on Emergency Response : Establish core business metrics, monitoring dashboards, alerting, on‑call rotation, and rapid incident triage.

Deep System Analysis : Map critical business flows, identify service dependencies, and address shortfalls to boost robustness.

Continuous Operational Evolution : Implement long‑term governance, define role responsibilities, and use metrics to maintain healthy stability over time.

Conclusion and Outlook

Stability work often suffers from low ownership and is seen as an extra task. Huolala’s experience shows that collective effort, clear processes, and cultural emphasis are vital. Future directions include stronger automation to reduce manual effort and AI‑driven insights to improve fault detection and mitigation, further enhancing user experience and social responsibility.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Operations SRE system reliability incident response Risk Governance stability culture

Written by

Huolala Tech

Technology reshapes logistics

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.