Tagged articles

13 articles

Page 1 of 1

Aug 19, 2025 · Big Data

Master Kafka High Availability: Replica Sync & Disaster Recovery Strategies

This article provides a comprehensive guide to building enterprise‑grade, highly available Kafka clusters, covering architecture design, hardware planning, production‑level broker configurations, ISR management, monitoring, fault‑tolerance procedures, rolling upgrades, capacity planning, and automation scripts for seamless operations.

KafkaOperationsScaling

0 likes · 16 min read

Master Kafka High Availability: Replica Sync & Disaster Recovery Strategies

Alibaba Cloud Infrastructure

Jan 6, 2025 · Cloud Native

Regional Disaster Recovery Architecture Using ASM Service Mesh and GTM

This guide explains how to design and implement a multi‑region disaster‑recovery solution on Alibaba Cloud by deploying identical Kubernetes clusters, configuring ASM ingress gateways with global traffic manager (GTM) for automatic failover, enabling intra‑cluster traffic retention, and validating the setup with load‑testing tools.

GTMcloud-nativedisaster-recovery

0 likes · 15 min read

Regional Disaster Recovery Architecture Using ASM Service Mesh and GTM

Efficient Ops

Nov 14, 2024 · Operations

Why Alipay Crashed: Lessons on Backup and Disaster Recovery

The recent Alipay outage during Double‑11 revealed a partial failure in its system message database, prompting users to experience payment errors, duplicate charges, and delayed withdrawals, while the company’s response highlighted the importance of comprehensive backup, redundancy, disaster‑recovery planning, monitoring, and security measures to ensure service continuity.

AlipaySREdisaster-recovery

0 likes · 10 min read

Why Alipay Crashed: Lessons on Backup and Disaster Recovery

Architecture and Beyond

Jun 1, 2024 · Operations

Comprehensive Guide to Data Backup and Disaster Recovery Strategies

This article examines real-world backup failures, explains why backups are essential, outlines what data and system components should be backed up, describes backup principles, classifications, technologies, and disaster recovery planning, and offers practical guidance for building robust, multi-layered backup strategies.

BackupCloud BackupCompliance

0 likes · 13 min read

Comprehensive Guide to Data Backup and Disaster Recovery Strategies

Tech Architecture Stories

Jan 25, 2024 · Operations

Why 2023 Saw a Spike in Cloud Outages: Key Lessons for High‑Availability

2023 witnessed numerous high‑profile cloud service failures—from Alibaba’s Hong Kong data‑center cooling issue to Tencent’s storage outage—highlighting how cost‑cutting, reduced staffing, and insufficient disaster‑recovery planning amplify risk, and outlining essential high‑availability, failover, and multi‑region strategies for resilient operations.

Scalabilitycloud outagedisaster-recovery

0 likes · 19 min read

Why 2023 Saw a Spike in Cloud Outages: Key Lessons for High‑Availability

ITPUB

Jun 30, 2023 · Operations

How Tencent Search Supercharged Reliability: Inside Its Stability Governance Playbook

This article details Tencent Search’s end‑to‑end stability engineering framework, covering a layered reliability architecture, disaster‑recovery mechanisms, fast detection and monitoring, emergency response acceleration, pre‑release interception, automated defense, and collaborative governance that together improve MTTD and MTTR by an order of magnitude.

AutomationReliabilitydisaster-recovery

0 likes · 30 min read

How Tencent Search Supercharged Reliability: Inside Its Stability Governance Playbook

Efficient Ops

Sep 14, 2020 · Cloud Native

How Dada Built a Dual‑Cloud Active‑Active Disaster Recovery Platform

This article details Dada's journey of designing and implementing a dual‑cloud active‑active architecture, covering high‑availability vs. disaster‑recovery concepts, Phase 1 and Phase 2 solutions, challenges faced, multi‑data‑center Consul deployment, bidirectional database replication, precise load‑balancing, capacity elasticity, and future plans.

ConsulMulti-Cloudcloud-native

0 likes · 17 min read

How Dada Built a Dual‑Cloud Active‑Active Disaster Recovery Platform

dbaplus Community

Nov 18, 2019 · Backend Development

Designing an Off‑Heap Disaster Recovery Cache to Keep Recommendations Fast

When the recommendation service of the Mafengwo app experiences database disconnections, third‑party timeouts, or network jitter, a locally‑deployed off‑heap cache built with OHC and SpringBoot can return pre‑computed results, isolating business logic, reducing latency, and improving user experience during failures.

CachingJavaOff-Heap

0 likes · 12 min read

Designing an Off‑Heap Disaster Recovery Cache to Keep Recommendations Fast

Efficient Ops

Sep 18, 2019 · Operations

How a Bank’s Veteran Engineer Achieved Seamless Mainframe Disaster Recovery

In this interview, senior China Bank systems engineer Lu Yang shares his 34‑year journey in mainframe operations, detailing the 2018 seamless disaster‑recovery switch, the importance of focus, continuous learning, risk sense, and future trends such as AIOps, security, and the enduring value of mainframe technology.

AIOpsIT careerMainframe

0 likes · 17 min read

How a Bank’s Veteran Engineer Achieved Seamless Mainframe Disaster Recovery

Tencent Cloud Developer

Mar 12, 2019 · Cloud Native

Understanding Active-Active Disaster Recovery Architecture: Challenges and Implementation Strategies

The article argues that cold backup and active‑passive setups provide false security and outlines how true active‑active disaster‑recovery requires local‑datacenter request handling, business‑driven data sharding, and low‑latency cross‑site synchronization, recommending a staged rollout from city‑level to cross‑region architectures while weighing ROI.

Data ConsistencyNetwork Latencyactive-active-architecture

0 likes · 9 min read

Understanding Active-Active Disaster Recovery Architecture: Challenges and Implementation Strategies

Efficient Ops

May 9, 2017 · Backend Development

How Tencent Scaled QQ Red Packet to 100k QPS: Architecture & Lessons

This article details how Tencent's AMS system was analyzed, traffic‑estimated, and redesigned for high‑availability during the QQ Spring Festival Red Packet event, covering architecture mapping, scaling strategies, overload protection, flexible availability, disaster recovery, monitoring, and practical lessons learned.

BackendScalingdisaster-recovery

0 likes · 25 min read

How Tencent Scaled QQ Red Packet to 100k QPS: Architecture & Lessons

Architects' Tech Alliance

Jul 19, 2016 · Operations

Overview of Backup Architecture, Features, and Data Protection Strategies

This article provides a comprehensive overview of backup and disaster‑recovery architectures, detailing backup methods, software components, multi‑domain management, data archiving, and the distinctions between backup and archiving, with examples from Simpana and industry products.

Backuparchivingdata-management

0 likes · 14 min read

Overview of Backup Architecture, Features, and Data Protection Strategies

WeChat Backend Team

Jun 14, 2016 · Backend Development

How WeChat Generates Trillions of Sequence Numbers with Sub‑Millisecond Latency

This article explains how WeChat’s seqsvr service generates trillions of per‑user sequence numbers with sub‑millisecond latency, detailing its core architecture, pre‑allocation and section‑sharing strategies, engineering implementation with StoreSvr and AllocSvr, and the evolution of its disaster‑recovery designs from primary‑backup to embedded routing tables.

ScalabilitySequenceWeChat

0 likes · 20 min read

How WeChat Generates Trillions of Sequence Numbers with Sub‑Millisecond Latency