Operations 16 min read

Emergency Response Planning and Practice at Hello (哈啰) for Large‑Scale Promotions

Hello’s technical‑risk team created a comprehensive emergency‑response system for large‑scale promotions—prioritizing core scenarios, running high‑frequency drills, modeling fault‑portraits, defining metric‑based triggers and clear rollback actions—which delivered zero incidents during the 930 Big Sale, achieved over 80 % core‑line coverage, and now aims to automate plan selection and execution.

HelloTech

Mar 30, 2023

Emergency Response Planning and Practice at Hello (哈啰) for Large‑Scale Promotions

Background : In the days before the 2022 National Day holiday, Hello (哈啰) launched its first holiday‑themed promotion (the “930 Big Sale”), covering shared bikes, e‑bikes, ride‑hailing, car‑sharing, rental, hotels, train tickets and more. Most services hit annual peaks, and the platform’s Q3 active‑user count rose to the top of the travel‑industry ranking, with the Hello app DAU surpassing 15 million for the first time.

Problem Statement : Rapid user growth and increasingly complex business systems make failures inevitable. The key question is how to minimise the impact of failures on business and revenue.

Risk‑Team Role : Hello’s technical‑risk team improves both fault‑detection capability (knowing where the problem is) and rapid‑resolution capability (emergency handling).

Why Emergency Plans Are Hard :

Ensuring low false‑positive impact on normal business.

Guaranteeing that a plan targets a specific abnormal scenario.

Comprehensively enumerating abnormal scenarios.

Validating plan effectiveness and execution.

Three Main Difficulties :

Scenario abundance : Hello has many business lines (two‑wheel, four‑wheel, e‑commerce, etc.) with numerous user flows (scan‑to‑unlock, card purchase, ride‑coupon, etc.).

Plan freshness : Large‑scale plan creation is labour‑intensive, leading to low update frequency.

Plan complexity : Plans must cover technical components, middleware, storage, infrastructure, etc.

Overall Solution Steps :

Business tiering – start with core business scenarios and critical paths; avoid trying to cover everything at once.

High‑frequency production drills – validate plans in low‑traffic periods or in a sandbox for loss‑less plans.

Fault‑portrait modelling – abstract common loss‑mitigation actions (circuit‑breaker, degradation, custom ops).

How to Build an Emergency‑Plan System from 0‑1 :

Trigger condition : Quantifiable metric thresholds (e.g., KPI drop > X %).

Execution action : Clear, observable, rollback‑able steps.

Impact scope : Estimate user experience, data consistency, financial loss.

Operator : Designate primary and backup on‑call owners.

Sync mechanism : Define communication channels and information flow.

Four Key Points & Pitfalls in Plan Drafting (illustrated in the following diagram):

Case Studies :

Case 1 – Database Failure : High‑level alarm, NOC triggers, root cause identified as host slowdown, HA switch executed, service restored, post‑mortem compensation performed.

Case 2 – High‑Risk Scenario : Metric X drops > Y % for > M minutes, system switched to disaster‑recovery, user impact assessed, communication to customer‑service, compensation via coupons.

Case 3 – 930 Big Sale : Pre‑sale “pre‑plan” (disable non‑essential activities, warm‑up cache), emergency plan (degrade algorithms to manual recommendation), post‑sale “rollback” (re‑enable activities, scale‑down resources).

Results of the 930 Big Sale :

Zero incidents during the promotion.

Plans covered 10+ business lines, with > 80 % core‑line coverage.

Monthly regular drills and continuous verification.

Future Direction – Plan Platform Construction :

Standardised plan management, integration with downstream systems, decision‑awareness, and emergency collaboration.

Goal: automate plan selection and one‑click execution to reduce manual errors.

Takeaway : Building plans forces a deep analysis of system architecture, revealing design gaps and prompting the adoption of self‑healing mechanisms before relying on manual emergency actions.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

case study incident response emergency planning operational reliability

Written by

HelloTech

Official Hello technology account, sharing tech insights and developments.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.