Operations 15 min read

How Multi‑Cloud Disaster Recovery Boosts Site Availability: Lessons from Real‑World DR Drills

This article shares a detailed case study of building multi‑cloud site disaster‑recovery and fault‑drill practices at Kaixin Network, covering high‑availability concepts, architectural redesign, pain points, automated one‑click switching, and future self‑healing with chaos engineering to improve reliability.

Efficient Ops

Aug 11, 2020

How Multi‑Cloud Disaster Recovery Boosts Site Availability: Lessons from Real‑World DR Drills

Background

High availability (HA) means a system can run without interruption. Availability A = MTBF / (MTBF + MTTR). Reducing failure frequency or repair time improves A, with industry targets like 99.9% or 99.99%.

Key HA dimensions considered:

Do not put all eggs in one basket – avoid single‑cloud or single‑datacenter reliance.

Eliminate single points of failure – use dual or multiple replicas for services, databases, and load balancers.

Strengthen critical nodes – reinforce network, DB, and SLB layers.

Rapid recovery after failures – have clear runbooks for downgrade, failover, or circuit‑break.

Historical availability data showed 2018 with many P0 incidents and ~99% uptime, improving to 99.9% in 2019‑2020 and aiming for 99.99%.

City‑Level Disaster Recovery (DR)

Since 2018 the team rebuilt the architecture to achieve city‑level DR, choosing a DR approach over active‑active due to lower cost and complexity. The new design adds POP points and multi‑cloud redundancy, so a single line failure no longer brings down the service.

Core datacenters (Huawei Cloud and Tencent Cloud) are linked via VPC0 to edge datacenters. Edge sites host game services that can be quickly removed without affecting core infrastructure.

The multi‑cloud layout provides sufficient redundancy – the “eggs‑in‑multiple‑baskets” strategy.

For the XY platform, the previous single‑cloud path (Tencent Guangzhou → Zhongshan) was replaced with a multi‑cloud redundant setup, using multi‑master‑multi‑slave databases.

Failure scenarios now switch via DNS or a custom SLB to the backup cloud, ensuring service continuity.

Fault Drills

Standard Operations

Manual steps for failover were time‑consuming and error‑prone. The team introduced fault‑drill exercises to validate the redesign and uncover hidden issues.

During drills, nine major steps were executed, relying on monitoring alerts to confirm failures (e.g., DB down) and verify successful switch‑over.

Metrics showed traffic drop during a simulated outage and recovery after the switch.

Typical manual issues observed:

Lengthy manual procedures (≈16 minutes) for SLB and DB switches.

Tool availability problems – the management tool itself could fail.

Monitoring gaps – difficulty confirming service health after failover.

Pseudo‑active‑active situations where backup services were not fully synchronized.

Tool Multi‑Active

To address tool‑failure, the team duplicated the operational tool database across both datacenters, enabling dual‑write and achieving multi‑active capability.

During the drill, dependencies such as workflow tasks and read‑only DBs caused additional failures, which were identified and resolved.

One‑Click Switch

A playbook‑style automation (similar to Ansible) was created, allowing operators to trigger a one‑click switch that executes ~16 steps: health checks, DB master‑slave promotion, DNS update, and traffic migration.

This reduced the switch time to about five minutes.

Regular Drills

Continuous, scheduled drills are required to keep the process from degrading and to verify that the DR solution remains effective as the platform evolves.

Review

Key takeaways:

Multi‑cloud, multi‑datacenter redundancy prevents single‑basket failures.

Eliminating single points of failure across ELB/SLB, services, and databases.

Improving robustness of critical nodes via multi‑cloud paths and DR.

Accelerating post‑failure recovery with standardized procedures and one‑click automation.

Future Direction

The next goal is self‑healing and chaos engineering to proactively discover and mitigate unpredictable failures such as SSO outages, power loss, or WAF crashes.

By injecting controlled chaos, the team aims to continuously iterate on reliability improvements.

operations high availability multi-cloud disaster recovery fault drills

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.