MaGe Linux Operations
MaGe Linux Operations
Aug 19, 2025 · Big Data

Master Kafka High Availability: Replica Sync & Disaster Recovery Strategies

This article provides a comprehensive guide to building enterprise‑grade, highly available Kafka clusters, covering architecture design, hardware planning, production‑level broker configurations, ISR management, monitoring, fault‑tolerance procedures, rolling upgrades, capacity planning, and automation scripts for seamless operations.

KafkaMonitoringdisaster-recovery
0 likes · 16 min read
Master Kafka High Availability: Replica Sync & Disaster Recovery Strategies
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Jan 6, 2025 · Cloud Native

Regional Disaster Recovery Architecture Using ASM Service Mesh and GTM

This guide explains how to design and implement a multi‑region disaster‑recovery solution on Alibaba Cloud by deploying identical Kubernetes clusters, configuring ASM ingress gateways with global traffic manager (GTM) for automatic failover, enabling intra‑cluster traffic retention, and validating the setup with load‑testing tools.

GTMcloud-nativedisaster-recovery
0 likes · 15 min read
Regional Disaster Recovery Architecture Using ASM Service Mesh and GTM
Efficient Ops
Efficient Ops
Nov 14, 2024 · Operations

Why Alipay Crashed: Lessons on Backup and Disaster Recovery

The recent Alipay outage during Double‑11 revealed a partial failure in its system message database, prompting users to experience payment errors, duplicate charges, and delayed withdrawals, while the company’s response highlighted the importance of comprehensive backup, redundancy, disaster‑recovery planning, monitoring, and security measures to ensure service continuity.

AlipaySREdisaster-recovery
0 likes · 10 min read
Why Alipay Crashed: Lessons on Backup and Disaster Recovery
Architecture and Beyond
Architecture and Beyond
Jun 1, 2024 · Operations

Comprehensive Guide to Data Backup and Disaster Recovery Strategies

This article examines real-world backup failures, explains why backups are essential, outlines what data and system components should be backed up, describes backup principles, classifications, technologies, and disaster recovery planning, and offers practical guidance for building robust, multi-layered backup strategies.

BackupCloud BackupIT Operations
0 likes · 13 min read
Comprehensive Guide to Data Backup and Disaster Recovery Strategies
Tech Architecture Stories
Tech Architecture Stories
Jan 25, 2024 · Operations

Why 2023 Saw a Spike in Cloud Outages: Key Lessons for High‑Availability

2023 witnessed numerous high‑profile cloud service failures—from Alibaba’s Hong Kong data‑center cooling issue to Tencent’s storage outage—highlighting how cost‑cutting, reduced staffing, and insufficient disaster‑recovery planning amplify risk, and outlining essential high‑availability, failover, and multi‑region strategies for resilient operations.

Scalabilitycloud outagedisaster-recovery
0 likes · 19 min read
Why 2023 Saw a Spike in Cloud Outages: Key Lessons for High‑Availability
ITPUB
ITPUB
Jun 30, 2023 · Operations

How Tencent Search Supercharged Reliability: Inside Its Stability Governance Playbook

This article details Tencent Search’s end‑to‑end stability engineering framework, covering a layered reliability architecture, disaster‑recovery mechanisms, fast detection and monitoring, emergency response acceleration, pre‑release interception, automated defense, and collaborative governance that together improve MTTD and MTTR by an order of magnitude.

AutomationMonitoringReliability
0 likes · 30 min read
How Tencent Search Supercharged Reliability: Inside Its Stability Governance Playbook
Efficient Ops
Efficient Ops
Sep 14, 2020 · Cloud Native

How Dada Built a Dual‑Cloud Active‑Active Disaster Recovery Platform

This article details Dada's journey of designing and implementing a dual‑cloud active‑active architecture, covering high‑availability vs. disaster‑recovery concepts, Phase 1 and Phase 2 solutions, challenges faced, multi‑data‑center Consul deployment, bidirectional database replication, precise load‑balancing, capacity elasticity, and future plans.

Consulcloud-nativedatabase-replication
0 likes · 17 min read
How Dada Built a Dual‑Cloud Active‑Active Disaster Recovery Platform
dbaplus Community
dbaplus Community
Nov 18, 2019 · Backend Development

Designing an Off‑Heap Disaster Recovery Cache to Keep Recommendations Fast

When the recommendation service of the Mafengwo app experiences database disconnections, third‑party timeouts, or network jitter, a locally‑deployed off‑heap cache built with OHC and SpringBoot can return pre‑computed results, isolating business logic, reducing latency, and improving user experience during failures.

CachingJavaOff-Heap
0 likes · 12 min read
Designing an Off‑Heap Disaster Recovery Cache to Keep Recommendations Fast
Efficient Ops
Efficient Ops
Sep 18, 2019 · Operations

How a Bank’s Veteran Engineer Achieved Seamless Mainframe Disaster Recovery

In this interview, senior China Bank systems engineer Lu Yang shares his 34‑year journey in mainframe operations, detailing the 2018 seamless disaster‑recovery switch, the importance of focus, continuous learning, risk sense, and future trends such as AIOps, security, and the enduring value of mainframe technology.

IT careerMainframeaiops
0 likes · 17 min read
How a Bank’s Veteran Engineer Achieved Seamless Mainframe Disaster Recovery
Tencent Cloud Developer
Tencent Cloud Developer
Mar 12, 2019 · Cloud Native

Understanding Active-Active Disaster Recovery Architecture: Challenges and Implementation Strategies

The article argues that cold backup and active‑passive setups provide false security and outlines how true active‑active disaster‑recovery requires local‑datacenter request handling, business‑driven data sharding, and low‑latency cross‑site synchronization, recommending a staged rollout from city‑level to cross‑region architectures while weighing ROI.

Data ConsistencyNetwork Latencyactive-active-architecture
0 likes · 9 min read
Understanding Active-Active Disaster Recovery Architecture: Challenges and Implementation Strategies
Efficient Ops
Efficient Ops
May 9, 2017 · Backend Development

How Tencent Scaled QQ Red Packet to 100k QPS: Architecture & Lessons

This article details how Tencent's AMS system was analyzed, traffic‑estimated, and redesigned for high‑availability during the QQ Spring Festival Red Packet event, covering architecture mapping, scaling strategies, overload protection, flexible availability, disaster recovery, monitoring, and practical lessons learned.

Monitoringbackenddisaster-recovery
0 likes · 25 min read
How Tencent Scaled QQ Red Packet to 100k QPS: Architecture & Lessons
WeChat Backend Team
WeChat Backend Team
Jun 14, 2016 · Backend Development

How WeChat Generates Trillions of Sequence Numbers with Sub‑Millisecond Latency

This article explains how WeChat’s seqsvr service generates trillions of per‑user sequence numbers with sub‑millisecond latency, detailing its core architecture, pre‑allocation and section‑sharing strategies, engineering implementation with StoreSvr and AllocSvr, and the evolution of its disaster‑recovery designs from primary‑backup to embedded routing tables.

ScalabilitySequenceWeChat
0 likes · 20 min read
How WeChat Generates Trillions of Sequence Numbers with Sub‑Millisecond Latency