Tagged articles
128 articles
Page 1 of 2
Raymond Ops
Raymond Ops
Jan 2, 2026 · Operations

Avoid 3 Fatal Nginx+Keepalived HA Pitfalls That 90% of Ops Engineers Miss

This article reveals three hidden traps in Nginx‑Keepalived high‑availability setups—network‑partition split‑brain, inadequate health‑check scripts, and unsafe configuration‑sync timing—explains real incidents caused by each, and provides concrete configuration changes, Bash scripts, and automation tips to prevent service outages.

NginxOpsautomation
0 likes · 16 min read
Avoid 3 Fatal Nginx+Keepalived HA Pitfalls That 90% of Ops Engineers Miss
Ray's Galactic Tech
Ray's Galactic Tech
Dec 20, 2025 · Operations

How RocketMQ Achieves High‑Availability Storage and Fast Fault Recovery

RocketMQ ensures durable, consistent, and highly available message storage through fixed‑length append‑only files, efficient index rebuilding, checkpoint tracking, and configurable master‑slave replication, offering both synchronous and asynchronous HA modes, detailed recovery steps, performance trade‑offs, and practical operational guidelines for robust fault tolerance.

OperationsRocketMQfault-recovery
0 likes · 10 min read
How RocketMQ Achieves High‑Availability Storage and Fast Fault Recovery
NiuNiu MaTe
NiuNiu MaTe
Dec 17, 2025 · Backend Development

Master Redis Distributed Locks: Prevent Race Conditions, Zombie Locks, and Expiration Issues

This guide explains how Redis implements distributed locks, outlines common pitfalls such as lock contention, zombie locks, and mismatched expiration times, and provides step‑by‑step solutions—including single‑node SET commands, Redlock high‑availability algorithm, Lua‑based safe release, and best‑practice recommendations for real‑world deployments.

Redlockdistributed-lockhigh-availability
0 likes · 15 min read
Master Redis Distributed Locks: Prevent Race Conditions, Zombie Locks, and Expiration Issues
Qunar Tech Salon
Qunar Tech Salon
Dec 4, 2025 · Backend Development

Why a Real‑Time/Offline Price Cache Is Critical for High‑Traffic Hotel Booking

The article explains why hotel booking platforms must implement a price‑cache layer, detailing performance bottlenecks, traffic spikes, and data freshness challenges, and describes a split real‑time and offline architecture with dual‑update strategies, cache‑freshness logic, and high‑availability mechanisms to ensure fast, reliable pricing.

cachinghigh-availabilityhotel
0 likes · 14 min read
Why a Real‑Time/Offline Price Cache Is Critical for High‑Traffic Hotel Booking
Architect's Guide
Architect's Guide
Aug 26, 2025 · Backend Development

Mastering Microservices: From Architecture Basics to Spring Cloud & Dubbo

This comprehensive guide explains microservice fundamentals, RPC frameworks, serialization, distributed transaction models (ACID, CAP, BASE, TCC), system monitoring, high‑availability strategies, load balancing, configuration management, service registration/discovery, Spring Cloud components, Dubbo fault‑tolerance clusters, and compares Spring Boot with Spring MVC, providing practical code examples and diagrams.

Configuration ManagementDubboMicroservices
0 likes · 40 min read
Mastering Microservices: From Architecture Basics to Spring Cloud & Dubbo
MaGe Linux Operations
MaGe Linux Operations
Aug 19, 2025 · Big Data

Master Kafka High Availability: Replica Sync & Disaster Recovery Strategies

This article provides a comprehensive guide to building enterprise‑grade, highly available Kafka clusters, covering architecture design, hardware planning, production‑level broker configurations, ISR management, monitoring, fault‑tolerance procedures, rolling upgrades, capacity planning, and automation scripts for seamless operations.

KafkaOperationsdisaster-recovery
0 likes · 16 min read
Master Kafka High Availability: Replica Sync & Disaster Recovery Strategies
DevOps Operations Practice
DevOps Operations Practice
Aug 11, 2025 · Operations

Zen Master’s Secrets to the Ultimate State of Operations

Through a series of dialogues with a Zen master, the article humorously explores the highest level of operations—automation that runs itself, balanced alerting, cloud migration, reliable backups, high‑availability, stability through chaos engineering, and the ultimate goal of making systems operate without human intervention.

BackupOperationsautomation
0 likes · 5 min read
Zen Master’s Secrets to the Ultimate State of Operations
MaGe Linux Operations
MaGe Linux Operations
May 11, 2025 · Cloud Native

How to Build a High‑Performance, Highly‑Available Cloud‑Native Ingress Gateway

When an Ingress gateway faces traffic exceeding 100,000 QPS, this guide outlines systematic performance optimizations, configuration tweaks, distributed architecture designs, traffic management, monitoring, and disaster‑recovery strategies—including hardware scaling, kernel tuning, DPDK, rate limiting, horizontal scaling, service mesh integration, and CDN offloading—to achieve high concurrency and high availability.

Scalabilitycloud-nativehigh-availability
0 likes · 8 min read
How to Build a High‑Performance, Highly‑Available Cloud‑Native Ingress Gateway
Alibaba Cloud Native
Alibaba Cloud Native
Dec 17, 2024 · Cloud Native

Achieving Full Cloud‑Native Migration: Hangzhou MingShitang’s Journey to 100% SLA

This case study details how Hangzhou MingShitang migrated its entire online‑education platform from self‑hosted IDC infrastructure to Alibaba Cloud, redesigning registration, configuration, micro‑service governance, safe release and gateway layers with MSE, Sentinel and cloud‑native technologies to attain 100% SLA, dramatically cut costs and boost performance.

Alibaba Cloudcloud-nativehigh-availability
0 likes · 19 min read
Achieving Full Cloud‑Native Migration: Hangzhou MingShitang’s Journey to 100% SLA
High Availability Architecture
High Availability Architecture
Sep 4, 2024 · Backend Development

Three‑High System Construction: Performance, Concurrency, and Availability – A Backend Engineering Methodology

This article presents a comprehensive backend engineering methodology for building "three‑high" systems that simultaneously achieve high performance, high concurrency, and high availability, covering performance tuning, horizontal and vertical scaling, hot‑key mitigation, fault‑tolerance mechanisms, isolation strategies, and practical DDD‑driven design.

BackendDDDScalability
0 likes · 26 min read
Three‑High System Construction: Performance, Concurrency, and Availability – A Backend Engineering Methodology
dbaplus Community
dbaplus Community
Aug 15, 2024 · Backend Development

How a Kafka‑Proxy Boosts Cluster Scalability and Resilience

This article explains the challenges of large‑scale Kafka clusters and introduces a lightweight Kafka‑Proxy layer that provides seamless cluster switching, traffic monitoring, online offset reset, and flow‑control mechanisms, ultimately improving availability, throughput, and operational efficiency.

BackendOffset ResetProxy
0 likes · 17 min read
How a Kafka‑Proxy Boosts Cluster Scalability and Resilience
Architecture and Beyond
Architecture and Beyond
Jul 28, 2024 · Frontend Development

Comprehensive Guide to Front‑End Stability: Observability, Full‑Chain Monitoring, High‑Availability Architecture, Performance Management, Risk Governance, Process Mechanisms, and Engineering Practices

This extensive article presents a systematic approach to front‑end stability, covering observability systems, full‑chain monitoring, high‑availability design, performance management, risk governance, process mechanisms, and engineering practices to ensure reliable user experiences and business continuity.

frontendhigh-availabilitymonitoring
0 likes · 44 min read
Comprehensive Guide to Front‑End Stability: Observability, Full‑Chain Monitoring, High‑Availability Architecture, Performance Management, Risk Governance, Process Mechanisms, and Engineering Practices
JD Tech Talk
JD Tech Talk
Jul 25, 2024 · Backend Development

Design and Architecture of JD.com’s Buffalo Distributed DAG Scheduling System

The article details the design, core technical solutions, high‑availability architecture, performance optimizations, and open capabilities of Buffalo, JD.com’s distributed DAG‑based job scheduling platform that supports massive task volumes, complex dependencies, and flexible resource management.

BackendDAGDistributed Scheduling
0 likes · 13 min read
Design and Architecture of JD.com’s Buffalo Distributed DAG Scheduling System
Top Architect
Top Architect
Jul 12, 2024 · Backend Development

Traffic Governance and High Availability in Backend Systems: Circuit Breakers, Isolation, Retries, Timeouts, and Rate Limiting

This article explains how high‑availability backend systems use traffic governance techniques—including circuit breakers, various isolation strategies, retry and timeout policies, degradation mechanisms, and rate‑limiting—to maintain balanced data flow, prevent cascading failures, and ensure performance, scalability, and reliability.

BackendRetryTimeout
0 likes · 30 min read
Traffic Governance and High Availability in Backend Systems: Circuit Breakers, Isolation, Retries, Timeouts, and Rate Limiting
Full-Stack Internet Architecture
Full-Stack Internet Architecture
May 24, 2024 · Databases

Redis Deployment Modes: Single Instance, Master‑Slave Replication, Sentinel, and Cluster

This article reviews the four common Redis deployment modes—single‑instance, master‑slave replication, Sentinel, and cluster—explaining their architectures, advantages, drawbacks, and suitable application scenarios, and provides a comparative table to help readers choose the appropriate setup.

Deploymenthigh-availabilitysentinel
0 likes · 7 min read
Redis Deployment Modes: Single Instance, Master‑Slave Replication, Sentinel, and Cluster
Tencent Cloud Developer
Tencent Cloud Developer
May 14, 2024 · Backend Development

Product Middle Platform Workflow Orchestration Engine: Use Cases, Architecture, and High‑Availability Solutions

Tencent’s product middle platform employs a self‑built, stateless workflow orchestration engine—configurable via drag‑and‑drop or DSL—to coordinate massive product processing and audit tasks, using load‑balancing, retry, rate‑limiting, circuit‑breaker and service isolation strategies that ensure high availability, performance, and horizontal scalability on TKE.

BackendDSLMicroservices
0 likes · 14 min read
Product Middle Platform Workflow Orchestration Engine: Use Cases, Architecture, and High‑Availability Solutions
Tech Architecture Stories
Tech Architecture Stories
Jan 25, 2024 · Operations

Why 2023 Saw a Spike in Cloud Outages: Key Lessons for High‑Availability

2023 witnessed numerous high‑profile cloud service failures—from Alibaba’s Hong Kong data‑center cooling issue to Tencent’s storage outage—highlighting how cost‑cutting, reduced staffing, and insufficient disaster‑recovery planning amplify risk, and outlining essential high‑availability, failover, and multi‑region strategies for resilient operations.

Scalabilitycloud outagedisaster-recovery
0 likes · 19 min read
Why 2023 Saw a Spike in Cloud Outages: Key Lessons for High‑Availability
dbaplus Community
dbaplus Community
Jan 14, 2024 · Operations

How Bilibili Achieves 99.99% Availability for Live Gift Systems

This article explains Bilibili's technical strategies—preloading, circuit breaking, sharding, multi‑active deployment, and Kubernetes auto‑scaling—that ensure the live‑gift panel, feeding flow, and supporting services maintain 99.99% uptime even during massive traffic spikes.

Microservicescircuit-breakerhigh-availability
0 likes · 14 min read
How Bilibili Achieves 99.99% Availability for Live Gift Systems
dbaplus Community
dbaplus Community
Jan 2, 2024 · Operations

How Xiaohongshu Scaled Its Metrics System Tenfold with Cloud‑Native Architecture

Facing exploding metric volumes, high resource consumption, and fragile operations, Xiaohongshu's observability team completely rebuilt its metrics pipeline using Victoriametrics, achieving ten‑fold performance gains, minute‑level scaling, high‑availability, cost reduction, and robust multi‑cloud active‑active deployment while preserving data safety and query speed.

MetricsPrometheusTime Series
0 likes · 34 min read
How Xiaohongshu Scaled Its Metrics System Tenfold with Cloud‑Native Architecture
Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Dec 14, 2023 · Cloud Native

Evolution of Xiaohongshu Metrics System: Cloud‑Native Observability, High Availability, and Performance Optimizations

Xiaohongshu’s observability team rebuilt its Prometheus‑based metrics platform using vmagent, dual‑active HA clusters, query push‑down, high‑cardinality governance and multi‑cloud active‑active design, delivering ten‑fold collection speed, up to 70× query capacity, massive CPU‑memory‑storage savings and fully automated scaling.

MetricsTime SeriesVictoriaMetrics
0 likes · 35 min read
Evolution of Xiaohongshu Metrics System: Cloud‑Native Observability, High Availability, and Performance Optimizations
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Nov 27, 2023 · Databases

How GaiaDB Redefines Cloud‑Native Databases with Fusion Architecture

GaiaDB, Baidu’s cloud‑native database, combines compute‑storage separation with a fused, log‑service architecture to boost performance, simplify consistency, and deliver multi‑level high availability across zones and regions, while supporting new features such as parallel query, HTAP replicas, and serverless scaling.

cloud-nativedistributed-systemshigh-availability
0 likes · 17 min read
How GaiaDB Redefines Cloud‑Native Databases with Fusion Architecture
Tencent Cloud Middleware
Tencent Cloud Middleware
Sep 14, 2023 · Backend Development

How Tencent Cloud’s Unitized Architecture Boosts Microservice Scalability and High Availability

This article explains the concept of unitized architecture, its characteristics and types, the performance and reliability challenges it solves for large‑scale microservice systems, and how Tencent Cloud’s TSF platform implements unit routing, gray release, and disaster‑recovery to achieve efficient, cross‑region, high‑availability deployments.

Tencent Cloudcloud-nativehigh-availability
0 likes · 15 min read
How Tencent Cloud’s Unitized Architecture Boosts Microservice Scalability and High Availability
Top Architect
Top Architect
Aug 8, 2023 · Backend Development

High‑Availability Architecture and Optimization Strategies for a Large‑Scale Membership System

This article describes the design, high‑availability solutions, traffic isolation, deep performance optimizations, caching strategies, dual‑center MySQL partitioning, seamless migration, and future fine‑grained flow‑control and degradation techniques employed to keep a billion‑user membership system stable and performant under extreme load.

BackendScalabilityhigh-availability
0 likes · 20 min read
High‑Availability Architecture and Optimization Strategies for a Large‑Scale Membership System
dbaplus Community
dbaplus Community
Jul 8, 2023 · Operations

How QQ Music Achieves High Availability: Architecture, Tools, and Observability

This article explains how QQ Music embraces inevitable faults by building a high‑availability architecture that combines redundant infrastructure, automated failover, stability strategies, a robust toolchain for chaos engineering and full‑link load testing, and comprehensive observability to ensure graceful fault handling at scale.

Microserviceschaos-engineeringdistributed-systems
0 likes · 27 min read
How QQ Music Achieves High Availability: Architecture, Tools, and Observability
Architecture & Thinking
Architecture & Thinking
Jun 2, 2023 · Databases

Mastering Database High Availability: From Basic Replication to Seamless Scaling

This article examines the evolution of database high‑availability architectures in large‑scale internet environments, covering basic direct‑connect setups, scale‑up/scale‑out sharding, master‑slave/master‑master with Keepalived, and advanced solutions such as MHA, Percona XtraDB Cluster, and MySQL Group Replication, plus smooth scaling steps.

Master‑Slavehigh-availabilitykeepalived
0 likes · 7 min read
Mastering Database High Availability: From Basic Replication to Seamless Scaling
Top Architect
Top Architect
May 5, 2023 · Backend Development

Using Redis Sentinel for High Availability: Design and Implementation

This article introduces Redis Sentinel as the official high‑availability solution for Redis, explains its core functions, provides configuration examples, compares three ways to receive failover notifications (script, client subscription, and indirect service), and offers design recommendations for robust production deployments.

DevOpsfailoverhigh-availability
0 likes · 10 min read
Using Redis Sentinel for High Availability: Design and Implementation
High Availability Architecture
High Availability Architecture
Apr 19, 2023 · Backend Development

Designing High‑Availability Services: Architecture Boundaries, Protocols, and Push Systems

This article explains how Tencent’s internal high‑availability service curriculum emphasizes architecture boundaries, unified protocol definitions using JCE, a unified PushAPI, monitoring and feedback mechanisms, and the organizational impact of aligning system and team boundaries to achieve scalable, reliable backend services.

Backenddistributed-systemshigh-availability
0 likes · 14 min read
Designing High‑Availability Services: Architecture Boundaries, Protocols, and Push Systems
vivo Internet Technology
vivo Internet Technology
Feb 8, 2023 · Operations

Design and Implementation of Vivo Jenkins Scheduler for High Availability and Resource Management

The paper presents Vivo’s Jenkins Scheduler, a master‑centric, high‑availability solution that replaces single‑master Jenkins by integrating an API gateway, event‑driven failure detection, label‑based multi‑dimensional scheduling, Redis/MySQL‑backed flow control, and callback monitoring, thereby balancing resources, enabling rapid failover, persisting queues, and improving build reliability, with plans to containerize Jenkins for Kubernetes workflows.

DevOpsJenkinsResource Management
0 likes · 10 min read
Design and Implementation of Vivo Jenkins Scheduler for High Availability and Resource Management
Top Architect
Top Architect
Dec 16, 2022 · Databases

Comprehensive Guide to Database Horizontal Scaling, Sharding, and High Availability with MariaDB and Keepalived

This article presents a detailed analysis and step‑by‑step implementation of horizontal database scaling, including sharding strategies, shutdown and stop‑write plans, log‑based migration, dual‑write approaches, and a smooth 2N expansion method, while also covering MariaDB master‑master configuration, dynamic data source addition, and Keepalived high‑availability setup.

MariaDBhigh-availabilityscaling
0 likes · 37 min read
Comprehensive Guide to Database Horizontal Scaling, Sharding, and High Availability with MariaDB and Keepalived
ITPUB
ITPUB
Nov 27, 2022 · Operations

Designing a Scalable, High‑Availability Monitoring System with Prometheus and Thanos

This article explores the challenges of building a fault‑tolerant monitoring platform, compares open‑source solutions, details why Prometheus is preferred, and shows how to achieve high availability and horizontal scaling using Thanos, remote‑write, hash‑ring sharding, and Kubernetes integration.

Thanoscloud-nativehigh-availability
0 likes · 18 min read
Designing a Scalable, High‑Availability Monitoring System with Prometheus and Thanos
Huolala Tech
Huolala Tech
Sep 22, 2022 · Operations

How HuoLala Engineered a Scalable, High‑Availability Monitoring System for Multi‑Cloud

This article details the evolution of monitoring technologies, HuoLala's three‑phase monitoring architecture, the integration of Prometheus, VictoriaMetrics and SkyWalking, zero‑intrusion bytecode instrumentation, full‑link trace sampling, visual dashboards, metric‑trace‑log correlation, and future plans for root‑cause analysis and intelligent alerting.

Operationsbytecodecloud
0 likes · 24 min read
How HuoLala Engineered a Scalable, High‑Availability Monitoring System for Multi‑Cloud
Top Architect
Top Architect
Aug 26, 2022 · Operations

Comprehensive Guide to Nginx Rewrite, Anti‑Hotlinking, Static/Dynamic Separation, and Keepalived High‑Availability Configuration

This article provides a step‑by‑step tutorial on configuring Nginx rewrite rules, implementing anti‑hotlinking protection, separating static and dynamic resources, and building a high‑availability architecture using Keepalived with detailed code examples and deployment instructions.

ConfigurationNginxanti-hotlinking
0 likes · 23 min read
Comprehensive Guide to Nginx Rewrite, Anti‑Hotlinking, Static/Dynamic Separation, and Keepalived High‑Availability Configuration
High Availability Architecture
High Availability Architecture
Aug 10, 2022 · Backend Development

Design and Migration of a High‑Performance Message Middleware Platform from RabbitMQ to RocketMQ

To address RabbitMQ’s scalability, reliability, and feature limitations, Vivo’s middleware team evaluated RocketMQ and Pulsar, selected RocketMQ, and built a next‑generation message middleware platform with an AMQP‑proxy gateway, metadata services, and high‑availability mechanisms, enabling seamless, high‑throughput migration and richer messaging capabilities.

MessagingRabbitMQRocketMQ
0 likes · 13 min read
Design and Migration of a High‑Performance Message Middleware Platform from RabbitMQ to RocketMQ
ITPUB
ITPUB
Aug 5, 2022 · Operations

How a Lua Bug Crashed Bilibili’s Load Balancer and What We Learned

On July 13, 2021, Bilibili’s OpenResty‑based SLB suffered a CPU‑100% outage caused by a Lua _gcd function bug triggered when a service’s weight was set to the string “0”, leading to a multi‑hour incident that was resolved by rebuilding SLB clusters and disabling JIT compilation.

Load BalancerOpenRestySRE
0 likes · 17 min read
How a Lua Bug Crashed Bilibili’s Load Balancer and What We Learned
dbaplus Community
dbaplus Community
Jul 25, 2022 · Operations

How to Build an Enterprise‑Grade Monitoring & Alerting Platform with Prometheus, Grafana, and AlertManager

Designing a unified, enterprise‑level monitoring and alerting platform, this article analyzes the shortcomings of standard Prometheus‑Grafana‑AlertManager setups, outlines platform‑vs‑business responsibilities, details architecture, user‑scenario requirements, component selection, high‑availability strategies, and deployment models to achieve scalable, easy‑to‑use observability.

Operationshigh-availabilitymonitoring
0 likes · 12 min read
How to Build an Enterprise‑Grade Monitoring & Alerting Platform with Prometheus, Grafana, and AlertManager
360 Smart Cloud
360 Smart Cloud
Jul 14, 2022 · Cloud Computing

Auto Scaling (AS) in Cloud Services: Architecture, Use Cases, and Optimization Strategies

This article explains the concept of elastic auto scaling in cloud services, describes typical scenarios such as high‑elastic web apps and compute‑intensive workloads, details the four‑layer architecture and workflow, and outlines functional features, stability improvements, and future optimization directions.

Auto Scalingcloud-computingelasticity
0 likes · 12 min read
Auto Scaling (AS) in Cloud Services: Architecture, Use Cases, and Optimization Strategies
Wukong Talks Architecture
Wukong Talks Architecture
Jun 14, 2022 · Databases

MySQL High‑Availability Incident Review and Recovery Steps

The article recounts a production‑like MySQL dual‑master HA setup using Keepalived, describes how a missing binary‑log index caused replication failure, and details step‑by‑step troubleshooting commands, configuration fixes, and preventive measures to restore reliable database synchronization.

databaseshigh-availabilitykeepalived
0 likes · 9 min read
MySQL High‑Availability Incident Review and Recovery Steps
Dada Group Technology
Dada Group Technology
Jun 6, 2022 · Backend Development

Evolution of JD Daojia Search System Architecture from Version 1.0 to 3.0

The article details the progressive architectural evolution of JD Daojia's search system—starting from a simple, single‑layer ES‑based 1.0 design, through the 2.0 overhaul that introduced full‑recall, independent ranking services, and index disaster‑recovery, to the 3.0 version that adds multi‑path recall, sophisticated ranking models, and automated routing for high availability.

ElasticsearchScalabilityhigh-availability
0 likes · 20 min read
Evolution of JD Daojia Search System Architecture from Version 1.0 to 3.0
Top Architect
Top Architect
Jun 2, 2022 · Cloud Native

A Beginner's Guide to Designing, Implementing, and Deploying Microservices on Kubernetes

This article walks readers through the complete lifecycle of a microservice system—from architectural design and Java Spring Boot implementation to Kubernetes deployment, high‑availability setup, monitoring with Prometheus/Grafana, tracing with Zipkin, and flow‑control with Sentinel—providing practical code snippets and step‑by‑step instructions.

KubernetesMicroservicescloud-native
0 likes · 21 min read
A Beginner's Guide to Designing, Implementing, and Deploying Microservices on Kubernetes
Cloud Native Technology Community
Cloud Native Technology Community
May 31, 2022 · Cloud Native

Building High‑Performance, High‑Availability Container Networks for Banking in a Two‑Site‑Three‑Center Architecture

This article explains the challenges of container networking in banks, especially under a two‑site‑three‑center architecture, and provides practical guidance on using underlay and overlay approaches, Kube‑OVN solutions, and best‑practice recommendations to achieve high‑availability, high‑concurrency, and high‑performance cloud‑native networks.

BankingKube-OVNOverlay
0 likes · 9 min read
Building High‑Performance, High‑Availability Container Networks for Banking in a Two‑Site‑Three‑Center Architecture
Wukong Talks Architecture
Wukong Talks Architecture
May 17, 2022 · Databases

Implementing MySQL Master‑Master High Availability with Keepalived: A Step‑by‑Step Guide

This article provides a comprehensive, English‑language tutorial on building MySQL master‑master high availability using Keepalived, covering architecture design, Docker‑based MySQL deployment, replication configuration, Keepalived installation, virtual IP setup, failover testing, and a detailed list of encountered pitfalls and their solutions.

Dockerhigh-availabilitymaster-master
0 likes · 22 min read
Implementing MySQL Master‑Master High Availability with Keepalived: A Step‑by‑Step Guide
IT Architects Alliance
IT Architects Alliance
Apr 27, 2022 · Operations

High‑Availability Architecture for a Billion‑Scale Membership System: ES Dual‑Center, Redis Caching, MySQL Migration, and Flow‑Control Strategies

This article details how a membership system serving billions of users achieves high performance and high availability through a dual‑center Elasticsearch cluster, traffic‑isolated ES clusters, Redis cache with distributed locks, MySQL dual‑center partitioning, and fine‑grained flow‑control and degradation mechanisms, all while ensuring zero‑downtime migrations and consistent data.

Flow Controldistributed-systemshigh-availability
0 likes · 20 min read
High‑Availability Architecture for a Billion‑Scale Membership System: ES Dual‑Center, Redis Caching, MySQL Migration, and Flow‑Control Strategies
Top Architect
Top Architect
Apr 7, 2022 · Backend Development

High‑Availability Architecture and Migration Strategies for a Large‑Scale Membership System

This article details the design and implementation of a highly available membership platform, covering Elasticsearch dual‑center primary‑backup clusters, traffic‑isolation architectures, deep ES optimizations, Redis caching and dual‑center clusters, MySQL partitioned clusters, seamless SqlServer‑to‑MySQL migration, abnormal member governance, and refined flow‑control and degradation strategies.

Backendhigh-availabilitymigration
0 likes · 20 min read
High‑Availability Architecture and Migration Strategies for a Large‑Scale Membership System
DataFunSummit
DataFunSummit
Mar 8, 2022 · Cloud Native

Design and Implementation of Cloud‑Native High‑Availability Solutions for Data Components at eBay

eBay’s data infrastructure engineers describe how they design and implement cloud‑native, multi‑cluster high‑availability architectures for stateful data components—covering background challenges, federated Kubernetes management, state handling, fault‑tolerance, backup, and chaos testing—to ensure reliable, scalable data services across global data centers.

Multi-Clustercloud-nativedata-components
0 likes · 16 min read
Design and Implementation of Cloud‑Native High‑Availability Solutions for Data Components at eBay
ITPUB
ITPUB
Jan 26, 2022 · Backend Development

How to Choose the Right Distributed Unique ID Strategy for Your System

This article explains why globally unique identifiers are essential in distributed systems, outlines the key characteristics of a good ID scheme, and compares several generation methods—including UUID, database auto‑increment, segmented DB ranges, Redis INCR, Zookeeper, Meituan Leaf, Snowflake, and Baidu uid‑generator—highlighting their advantages, drawbacks, and practical implementation details.

databasedistributed-idhigh-availability
0 likes · 18 min read
How to Choose the Right Distributed Unique ID Strategy for Your System
Cloud Native Technology Community
Cloud Native Technology Community
Jan 7, 2022 · Cloud Native

Designing High‑Availability, High‑Performance Cloud‑Native Container Networks for Banking

This article examines the challenges and solutions for building high‑availability, high‑concurrency, and high‑performance cloud‑native container networks in banks, covering two‑site three‑center architectures, underlay/overlay strategies, Kube‑OVN implementation, and practical recommendations for secure, scalable networking.

BankingContainer NetworkKube-OVN
0 likes · 10 min read
Designing High‑Availability, High‑Performance Cloud‑Native Container Networks for Banking
IT Architects Alliance
IT Architects Alliance
Oct 25, 2021 · Databases

Designing a High‑Availability Redis Service with Sentinel

This article explains how to build a highly available Redis service using Redis Sentinel, discusses common failure scenarios, compares several architectural options from a single instance to a three‑node Sentinel setup, and provides practical tips such as using virtual IPs for seamless client access.

architecturedatabasefailover
0 likes · 11 min read
Designing a High‑Availability Redis Service with Sentinel
Laravel Tech Community
Laravel Tech Community
Oct 19, 2021 · Backend Development

Redis Scaling Strategies: Partitioning, Master‑Slave Replication, Sentinel, and Cluster

This article introduces various Redis scaling solutions—including basic partitioning, master‑slave replication, Sentinel high‑availability, and Redis Cluster—explaining their concepts, typical usage patterns, configuration commands, advantages, and drawbacks to help developers choose the right approach for high‑traffic environments.

ClusterPartitioningReplication
0 likes · 12 min read
Redis Scaling Strategies: Partitioning, Master‑Slave Replication, Sentinel, and Cluster
Laravel Tech Community
Laravel Tech Community
Sep 28, 2021 · Operations

Nginx Rewrite Rules, Anti-Hotlinking, Static/Dynamic Separation, and Keepalived High‑Availability Configuration Guide

This article provides a comprehensive step‑by‑step guide on configuring Nginx rewrite rules, implementing anti‑hotlink protection, separating static and dynamic resources, and setting up Keepalived for high‑availability load balancing, complete with example configurations and shell scripts.

Nginxanti-hotlinkinghigh-availability
0 likes · 21 min read
Nginx Rewrite Rules, Anti-Hotlinking, Static/Dynamic Separation, and Keepalived High‑Availability Configuration Guide
Ops Development Stories
Ops Development Stories
Sep 17, 2021 · Operations

Master Keepalived: Build Reliable Linux Load‑Balancing and HA

This guide explains Keepalived’s role in Linux load‑balancing and high‑availability, covering its VRRP‑based architecture, core modules, layered operation, configuration syntax, practical deployment with Nginx, common split‑brain issues, and advanced settings such as nopreempt and multicast conflict resolution.

HAVRRPfailover
0 likes · 21 min read
Master Keepalived: Build Reliable Linux Load‑Balancing and HA
Qingyun Technology Community
Qingyun Technology Community
Sep 16, 2021 · Databases

Why Cloud‑Native Databases Are Redefining Elasticity and Resilience

Cloud‑native databases address the elasticity, resilience, and high‑availability demands of modern cloud computing by separating compute and storage, leveraging log‑based persistence, multi‑replica consensus, and distributed architectures such as Spanner, Aurora, and TiDB, offering higher performance, lower cost, and better resource utilization.

cloud-nativedatabasesdistributed-systems
0 likes · 13 min read
Why Cloud‑Native Databases Are Redefining Elasticity and Resilience
Dada Group Technology
Dada Group Technology
Aug 27, 2021 · Backend Development

Evolution of JD Daojia Product System Architecture: From Simple 1.0 Design to Domain‑Driven 3.0

This article details the step‑by‑step architectural evolution of JD Daojia's product system—from the initial 1.0 monolithic design, through 2.0 high‑availability and performance enhancements, to the 3.0 domain‑driven microservice architecture—highlighting the motivations, technical solutions, and future outlook.

BackendDomain-Driven DesignMicroservices
0 likes · 17 min read
Evolution of JD Daojia Product System Architecture: From Simple 1.0 Design to Domain‑Driven 3.0
Architect
Architect
Jul 7, 2021 · Big Data

Understanding Kafka High Availability and Resolving Consumer Offset Issues

This article explains Kafka's high‑availability architecture, including multi‑replica design, ISR synchronization, leader election, acks configuration, and how misconfigured __consumer_offset replication can cause consumer outages, offering practical steps to ensure reliable message delivery.

Consumer OffsetReplicationStreaming
0 likes · 8 min read
Understanding Kafka High Availability and Resolving Consumer Offset Issues
IT Architects Alliance
IT Architects Alliance
Jul 5, 2021 · R&D Management

System Architecture Design Overview and Principles for an Online Education Platform

This article presents a comprehensive architecture design for a rapidly growing online education platform, covering background challenges, high‑availability and scalability goals, core design principles, a multi‑layer solution including application, infrastructure, service topology, unified technology stack, standardization, modular services, micro‑service migration, and database and DevOps strategies.

DevOpsMicroservicesScalability
0 likes · 6 min read
System Architecture Design Overview and Principles for an Online Education Platform
Efficient Ops
Efficient Ops
Jun 28, 2021 · Backend Development

Why a Single Kafka Broker Failure Stops All Consumers – Understanding HA

This article explains Kafka's high‑availability mechanisms, covering multi‑replica design, ISR synchronization, leader election, the impact of the request.required.acks setting, and how the default __consumer_offset topic can become a single point of failure, with concrete steps to fix it.

KafkaReplicationconsumer-offset
0 likes · 9 min read
Why a Single Kafka Broker Failure Stops All Consumers – Understanding HA
ITFLY8 Architecture Home
ITFLY8 Architecture Home
May 24, 2021 · Operations

Designing a High‑Availability, High‑Performance, Scalable and Secure Web Architecture

This article walks through the evolution and design patterns of large‑scale web systems, covering initial single‑server setups, separation of application and data, caching strategies, server clustering, read‑write separation, CDN and reverse proxy usage, distributed storage, micro‑service decomposition, and essential considerations for performance, availability, scalability, extensibility and security.

Scalabilityarchitecturedistributed-systems
0 likes · 19 min read
Designing a High‑Availability, High‑Performance, Scalable and Secure Web Architecture
Full-Stack Internet Architecture
Full-Stack Internet Architecture
Apr 8, 2021 · Backend Development

Kafka Interview Guide: Concepts, Architecture, Configuration, and Performance

This article provides a comprehensive overview of Kafka, covering its role as a distributed messaging middleware, core concepts, architecture components, common interview questions, command‑line tools, producer and consumer configurations, high‑availability mechanisms, delivery semantics, and performance optimizations for backend developers.

ConsumerDistributed MessagingKafka
0 likes · 20 min read
Kafka Interview Guide: Concepts, Architecture, Configuration, and Performance
dbaplus Community
dbaplus Community
Mar 23, 2021 · Operations

Why RocketMQ Took 10 Minutes to Recover: Deep Dive into a Production Outage

A production RocketMQ cluster suffered a 10‑minute message‑send timeout after a broker’s memory failure caused a server reboot, and the post analyzes the routing registration, name‑server removal logic, client‑side timeout handling, root cause of name‑server “dead‑lock”, and proposes deployment and code‑level fixes to prevent recurrence.

MessagingOperationsRocketMQ
0 likes · 9 min read
Why RocketMQ Took 10 Minutes to Recover: Deep Dive into a Production Outage
Yanxuan Tech Team
Yanxuan Tech Team
Mar 5, 2021 · Backend Development

How We Built a High‑Availability Distributed ID Service for Order Management

This article explains why Yanxuan needed a distributed ID system, describes the selection of Leaf's segment mode, details architectural optimizations such as double‑buffering and dynamic step adjustment, shares operational safeguards, and outlines the pitfalls and solutions discovered during implementation.

Leafdistributed-idhigh-availability
0 likes · 13 min read
How We Built a High‑Availability Distributed ID Service for Order Management
High Availability Architecture
High Availability Architecture
Jan 19, 2021 · Cloud Native

Key Considerations for Building a Cloud‑Native Architecture

The article outlines the principles and practical considerations of cloud‑native architecture, covering platform‑agnostic design, container and Kubernetes foundations, microservice decomposition, CI/CD pipelines, monitoring, tracing, logging, and fault‑tolerant high‑availability strategies for building resilient distributed systems.

Microservicesci/cdcloud-native
0 likes · 13 min read
Key Considerations for Building a Cloud‑Native Architecture
AntTech
AntTech
Jan 5, 2021 · Cloud Native

Building Multi‑Active High‑Availability Platforms under Cloud‑Native Architecture – Insights from Ant Group’s SOFAStack

The article presents Ant Group’s SOFAStack experience in designing a cloud‑native, multi‑cluster, high‑availability platform for financial services, covering federation clusters, unified traffic governance with service mesh, unitized hybrid‑cloud evolution, and comprehensive disaster‑recovery mechanisms.

KubernetesSOFAStackcloud-native
0 likes · 14 min read
Building Multi‑Active High‑Availability Platforms under Cloud‑Native Architecture – Insights from Ant Group’s SOFAStack
Full-Stack Internet Architecture
Full-Stack Internet Architecture
Dec 15, 2020 · Backend Development

Designing High-Availability Caching Solutions in Production Environments

This article explains common causes of cache unavailability such as single‑point failures, cache penetration and avalanche, and provides practical high‑availability strategies—including multi‑node deployment, multi‑datacenter redundancy, consistent hashing, pre‑loading hot keys, local caches, and staggered expiration—to keep production systems resilient.

cachingdistributed-systemshigh-availability
0 likes · 7 min read
Designing High-Availability Caching Solutions in Production Environments
Full-Stack Internet Architecture
Full-Stack Internet Architecture
Dec 13, 2020 · Backend Development

Essential Backend Development Concepts: Distributed Systems, Caching, Asynchronous Architecture, Load Balancing, Microservices, High Availability, Security, and Big Data

This article provides a comprehensive overview of core backend engineering topics—including distributed architecture, vertical and horizontal scaling, cache strategies, asynchronous messaging, load‑balancing techniques, microservice design, high‑availability patterns, security mechanisms, and big‑data processing frameworks—aimed at helping fresh graduates and junior developers build interview‑ready knowledge.

Distributedcachinghigh-availability
0 likes · 33 min read
Essential Backend Development Concepts: Distributed Systems, Caching, Asynchronous Architecture, Load Balancing, Microservices, High Availability, Security, and Big Data
JD Cloud Developers
JD Cloud Developers
Nov 17, 2020 · Databases

How JD Cloud’s JCHDB Powered the 11.11 Shopping Festival’s Massive Data Surge

This article explains how JD Cloud’s JCHDB database handled PB‑level data growth during the 11.11 shopping festival, detailing the high‑availability architecture, performance optimizations, scaling techniques, and the eight‑step preparation process that enabled millions of queries per second and terabit‑level traffic.

cloude‑commercehigh-availability
0 likes · 8 min read
How JD Cloud’s JCHDB Powered the 11.11 Shopping Festival’s Massive Data Surge
dbaplus Community
dbaplus Community
Sep 29, 2020 · Backend Development

How JD Daojia Scaled Its Order System to Billion‑Scale: Architecture, Evolution, and High‑Availability Practices

This article details JD Daojia's order system architecture, tracing its evolution from a monolithic design to a micro‑service, multi‑cluster setup with Redis, MySQL, and Elasticsearch, and explains the high‑availability, disaster‑recovery, capacity‑planning, and alerting techniques that keep billions of orders running smoothly.

Backendarchitecturehigh-availability
0 likes · 26 min read
How JD Daojia Scaled Its Order System to Billion‑Scale: Architecture, Evolution, and High‑Availability Practices
Efficient Ops
Efficient Ops
Sep 14, 2020 · Cloud Native

How Dada Built a Dual‑Cloud Active‑Active Disaster Recovery Platform

This article details Dada's journey of designing and implementing a dual‑cloud active‑active architecture, covering high‑availability vs. disaster‑recovery concepts, Phase 1 and Phase 2 solutions, challenges faced, multi‑data‑center Consul deployment, bidirectional database replication, precise load‑balancing, capacity elasticity, and future plans.

Consulcloud-nativedatabase-replication
0 likes · 17 min read
How Dada Built a Dual‑Cloud Active‑Active Disaster Recovery Platform
Architecture Digest
Architecture Digest
Aug 25, 2020 · Operations

Best Practices and Advanced Topics for Prometheus Monitoring in Kubernetes

This article provides a comprehensive guide on using Prometheus for Kubernetes monitoring, covering fundamental principles, exporter selection, Grafana dashboard creation, memory and storage optimization, high‑availability designs, query performance, cardinality management, and integration with alerting and logging systems.

ExportersGrafanaKubernetes
0 likes · 33 min read
Best Practices and Advanced Topics for Prometheus Monitoring in Kubernetes
Tencent Cloud Middleware
Tencent Cloud Middleware
Aug 11, 2020 · Cloud Native

How Tencent’s TDMQ Achieves Cloud‑Native, High‑Performance Messaging for Finance

This article explains how Tencent’s cloud‑native message queue TDMQ, built on Apache Pulsar’s storage‑compute separation, meets financial‑grade reliability, strong consistency, horizontal scalability, and cross‑region disaster‑recovery requirements through a quorum‑based consistency model, multi‑protocol support, and read‑only broker design.

Message QueuePulsarcloud-native
0 likes · 28 min read
How Tencent’s TDMQ Achieves Cloud‑Native, High‑Performance Messaging for Finance
Xiao Lou's Tech Notes
Xiao Lou's Tech Notes
May 17, 2020 · Databases

How to Build a High‑Availability, High‑Performance Distributed ID Generator

Distributed systems need globally unique, often monotonic IDs, and this article examines common ID generation strategies—Snowflake, database auto‑increment, segment allocation, multi‑master databases, and Raft‑based consensus—evaluating each for high availability and high performance, and highlighting trade‑offs and implementation details.

Raftdatabasedistributed-id
0 likes · 8 min read
How to Build a High‑Availability, High‑Performance Distributed ID Generator
Ops Development Stories
Ops Development Stories
May 14, 2020 · Cloud Native

How to Build a Highly Available Kubernetes 1.18 Cluster with kubeadm, HAProxy, and Keepalived

This step‑by‑step guide shows how to set up a production‑grade Kubernetes 1.18 high‑availability cluster using kubeadm, HAProxy, Keepalived, Calico networking, the Kubernetes dashboard, and metrics‑server, covering node planning, environment preparation, component installation, cluster initialization, HA testing, and post‑deployment verification.

DashboardHAProxycluster-setup
0 likes · 30 min read
How to Build a Highly Available Kubernetes 1.18 Cluster with kubeadm, HAProxy, and Keepalived
Suning Technology
Suning Technology
Dec 25, 2019 · Backend Development

How Suning’s Bargain Group Platform Achieves High Availability and Scalability

This article examines Suning's bargain‑group platform transformation, detailing its strategic shift to a platform model, high‑availability architecture, vertical and horizontal decomposition, data sharding, cache design, dual‑data‑center deployment, and link optimizations for handling massive concurrent traffic.

Scalabilitybargain-groupdatabase sharding
0 likes · 19 min read
How Suning’s Bargain Group Platform Achieves High Availability and Scalability
Architecture Digest
Architecture Digest
Dec 19, 2019 · Databases

Design and Migration of Zhihu's Read Service: From Bloom Filter to TiDB

This article details Zhihu's read‑service architecture, its massive data scale and performance challenges, early Bloom‑filter and HBase solutions, the design goals of high availability, high performance and scalability, and the subsequent migration from MySQL to TiDB with cloud‑native practices.

TiDBcloud-nativedistributed database
0 likes · 25 min read
Design and Migration of Zhihu's Read Service: From Bloom Filter to TiDB
Architecture Digest
Architecture Digest
Nov 7, 2019 · Backend Development

Designing High‑Availability, High‑Performance Backend Architecture for Amap’s Real‑Time Services

This article explains how Amap (Gaode) handles billions of daily requests with sub‑millisecond latency by redesigning its gateway layer, adopting full‑asynchronous pipeline architecture, leveraging reactive frameworks like Vert.x and WebFlux, aggregating APIs, and implementing a unit‑based routing solution that paves the way for distributed sidecar and service‑mesh deployments.

Asynchronousgatewayhigh-availability
0 likes · 9 min read
Designing High‑Availability, High‑Performance Backend Architecture for Amap’s Real‑Time Services
Qunar Tech Salon
Qunar Tech Salon
Sep 11, 2019 · Backend Development

SIA‑Gateway: A Distributed Microservice Gateway System – Architecture, Features, and High Availability

This article introduces the evolution of software architecture toward microservices, explains the key characteristics of microservice architectures, describes microservice gateway concepts and classifications, and details the design, features, deployment, and high‑availability mechanisms of the SpringCloud‑based SIA‑Gateway solution.

Cloud NativeSpringCloudgateway
0 likes · 14 min read
SIA‑Gateway: A Distributed Microservice Gateway System – Architecture, Features, and High Availability
21CTO
21CTO
Jul 17, 2019 · Backend Development

From Single Server to Cloud Native: How Taobao Scaled to Millions of Users

This article traces Taobao’s backend architecture evolution from a single‑server setup to a cloud‑native, micro‑service ecosystem, detailing each scaling stage—separating Tomcat and database, adding caches, load balancers, read/write splitting, sharding, NoSQL, ESB, containers, and finally public‑cloud deployment—while highlighting the associated technologies and design principles.

BackendMicroservicesScalability
0 likes · 19 min read
From Single Server to Cloud Native: How Taobao Scaled to Millions of Users
Youzan Coder
Youzan Coder
Mar 27, 2019 · Databases

MySQL Slave Crash-Safe Feature Analysis

The article examines MySQL 5.6’s crash‑safe slave replication, explaining how earlier versions’ unsafe relay‑log handling could corrupt position data, describing the atomic update of mysql.slave_relay_log_info via table‑based relay‑log info and transaction coordination, and covering configuration options, recovery behavior, GTID implications, performance trade‑offs, and implementation guidance.

BinlogCrash SafeGTID
0 likes · 9 min read
MySQL Slave Crash-Safe Feature Analysis
Tencent Cloud Developer
Tencent Cloud Developer
Mar 12, 2019 · Cloud Native

Understanding Active-Active Disaster Recovery Architecture: Challenges and Implementation Strategies

The article argues that cold backup and active‑passive setups provide false security and outlines how true active‑active disaster‑recovery requires local‑datacenter request handling, business‑driven data sharding, and low‑latency cross‑site synchronization, recommending a staged rollout from city‑level to cross‑region architectures while weighing ROI.

Data ConsistencyNetwork Latencyactive-active-architecture
0 likes · 9 min read
Understanding Active-Active Disaster Recovery Architecture: Challenges and Implementation Strategies
21CTO
21CTO
Oct 17, 2018 · Databases

Scaling Payment Systems: Sharding, Snowflake IDs, and High‑Availability

This article explains how to design a high‑throughput payment system using database sharding, Snowflake‑style globally unique order IDs, eventual consistency via message queues, high‑availability architectures, data tiering, and coarse‑fine traffic control to handle massive request spikes.

Data Tieringdatabaseeventual consistency
0 likes · 15 min read
Scaling Payment Systems: Sharding, Snowflake IDs, and High‑Availability
ITFLY8 Architecture Home
ITFLY8 Architecture Home
Oct 16, 2018 · Databases

Scaling Payment Systems: Sharding, Snowflake IDs, and High‑Availability Databases

This article explains how a high‑throughput payment platform uses database sharding by user ID, Snowflake‑style globally unique order IDs, asynchronous replication for eventual consistency, multi‑level data caching, and coarse‑fine traffic pipelines to achieve millions of requests per second with robust high‑availability.

Data Tieringhigh-availabilityorder ID
0 likes · 16 min read
Scaling Payment Systems: Sharding, Snowflake IDs, and High‑Availability Databases
Programmer DD
Programmer DD
Jun 7, 2018 · Operations

How to Build a High‑Availability RabbitMQ Cluster with Load Balancing

This guide explains the principles behind RabbitMQ clustering, shows how metadata synchronization works, compares design choices, and provides step‑by‑step instructions—including component installation, node configuration, HAProxy load‑balancing setup, and a sample architecture diagram—to create a reliable, scalable RabbitMQ cluster for production use.

HAProxyOperationsclustering
0 likes · 16 min read
How to Build a High‑Availability RabbitMQ Cluster with Load Balancing
ITPUB
ITPUB
Jan 30, 2018 · Operations

Eliminating Network Black Holes in Dell Blade Server Deployments

This article explains how misconfigured links in Dell blade server networks can create black‑hole failures, illustrates two fault scenarios, and provides step‑by‑step switch configuration techniques—including link‑dependency groups and uplink‑state groups—to ensure automatic NIC failover and maintain high availability.

Dellblade-servershigh-availability
0 likes · 13 min read
Eliminating Network Black Holes in Dell Blade Server Deployments