Operations 19 min read

How Huolala Achieved Zero Failures During Business Peaks for 3 Years

Huolala’s engineering team built a systematic, multi‑layered business‑peak assurance process—covering goal definition, project management, technical risk mitigation, cloud‑provider coordination, capacity planning, and post‑mortem analysis—that has kept its platform fault‑free for over three years of high‑traffic events.

Huolala Tech

Jun 13, 2024

How Huolala Achieved Zero Failures During Business Peaks for 3 Years

Background

Business peaks cause massive traffic spikes that stress systems, similar to fire departments during fire seasons. Huolala experiences peaks from city launches, discount flash sales, and pre‑holiday demand, requiring high stability and risk resistance. Since the second half of 2020, the team has conducted 34 peak‑assurance events and maintained zero failures for more than three consecutive years.

How to Conduct Business Peak Assurance

1.1 Define Objectives

The primary goal is zero system failures during peaks. Teams must align with business units on specific metrics (e.g., 1 million vs. 2 million orders) and assess feasibility and cost. Internal KPIs include SLA performance and incident response time. Cost management is also critical: improve human efficiency and reduce per‑unit server cost.

1.2 Project‑Management Perspective

Peak‑assurance projects involve many participants and cannot be delayed. A dedicated project team coordinates cross‑functional efforts, sets clear milestones, and uses regular kick‑offs, weekly meetings, and communication channels to keep work on track.

1.3 Technical Assurance Details

Risk management is the core, covering internal, external, and third‑party risks. Risks are categorized as capacity, change, link, and personnel. Specific actions include:

External customers: set traffic‑limiting thresholds based on system capacity.

Internal: capacity testing, pre‑heat marketing data, downgrade plans; enforce change freezes during peaks; conduct link robustness drills; schedule on‑call staff and a “black‑room” for rapid communication.

Third‑party providers: secure capacity early, prepare one‑click downgrade plans, coordinate change notifications, and arrange on‑site support.

1.4 Key Focus Areas

Two main focuses are project organization and system capacity. Project organization relies on PMO involvement, department liaison mechanisms, and fixed meeting cadences. Capacity planning emphasizes early provisioning, monitoring the ramp‑up phase, and preparing for worst‑case scenarios such as traffic throttling or graceful degradation.

Huolala’s Specific Practices and Results

2.1 Assurance Strategy

Adopt proven practices from industry leaders, localize them, continuously iterate, and keep team motivation high through operational incentives.

2.2 Implementation

Leverage Alibaba’s Double‑11 experience for initial framework, then refine for Huolala’s order‑matching model. During peaks, expand driver search radius and, if needed, disable heavy‑weight matching logic to reduce load.

2.3 Execution Details

Annual peak‑event calendar is prepared early, documentation is centralized, and a visual project board tracks progress. A “black‑room” on‑site provides a dedicated workspace, refreshments, and post‑event celebration to reinforce team cohesion.

2.4 Post‑Peak Review

After each peak, a thorough post‑mortem reviews goal achievement, analyzes each assurance sub‑task, gathers feedback, and feeds improvements into the next cycle.

2.5 Lessons Learned

Include non‑development changes (e.g., operations config, AB tests) in the change‑freeze scope.

Identify services whose load inversely correlates with traffic to avoid unnecessary scaling.

Optimize server shrink‑age by selecting the most cost‑effective machines, not just the newly added ones.

Impact

Since 2020, Huolala has completed 34 peak‑assurance events with zero failures from 2021 onward, and overall annual incident counts have steadily declined.

Future Outlook

Goals include cost reduction (resource and labor) through fine‑grained resource management, tool‑driven assurance, and cross‑platform pipelines, as well as making peak assurance a daily practice to continuously improve overall system stability.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

risk management system stability peak reliability

Written by

Huolala Tech

Technology reshapes logistics

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.