Tagged articles

high-availability

128 articles · Page 1 of 2

Jan 2, 2026 · Operations

Avoid 3 Fatal Nginx+Keepalived HA Pitfalls That 90% of Ops Engineers Miss

This article reveals three hidden traps in Nginx‑Keepalived high‑availability setups—network‑partition split‑brain, inadequate health‑check scripts, and unsafe configuration‑sync timing—explains real incidents caused by each, and provides concrete configuration changes, Bash scripts, and automation tips to prevent service outages.

AutomationKeepalivedNGINX

0 likes · 16 min read

Avoid 3 Fatal Nginx+Keepalived HA Pitfalls That 90% of Ops Engineers Miss

Ray's Galactic Tech

Dec 20, 2025 · Operations

How RocketMQ Achieves High‑Availability Storage and Fast Fault Recovery

RocketMQ ensures durable, consistent, and highly available message storage through fixed‑length append‑only files, efficient index rebuilding, checkpoint tracking, and configurable master‑slave replication, offering both synchronous and asynchronous HA modes, detailed recovery steps, performance trade‑offs, and practical operational guidelines for robust fault tolerance.

OperationsRocketMQfault-recovery

0 likes · 10 min read

How RocketMQ Achieves High‑Availability Storage and Fast Fault Recovery

NiuNiu MaTe

Dec 17, 2025 · Backend Development

Master Redis Distributed Locks: Prevent Race Conditions, Zombie Locks, and Expiration Issues

This guide explains how Redis implements distributed locks, outlines common pitfalls such as lock contention, zombie locks, and mismatched expiration times, and provides step‑by‑step solutions—including single‑node SET commands, Redlock high‑availability algorithm, Lua‑based safe release, and best‑practice recommendations for real‑world deployments.

Distributed LockRedisRedlock

0 likes · 15 min read

Master Redis Distributed Locks: Prevent Race Conditions, Zombie Locks, and Expiration Issues

Qunar Tech Salon

Dec 4, 2025 · Backend Development

Why a Real‑Time/Offline Price Cache Is Critical for High‑Traffic Hotel Booking

The article explains why hotel booking platforms must implement a price‑cache layer, detailing performance bottlenecks, traffic spikes, and data freshness challenges, and describes a split real‑time and offline architecture with dual‑update strategies, cache‑freshness logic, and high‑availability mechanisms to ensure fast, reliable pricing.

Cachinghigh-availabilityhotel

0 likes · 14 min read

Why a Real‑Time/Offline Price Cache Is Critical for High‑Traffic Hotel Booking

Architect's Guide

Aug 26, 2025 · Backend Development

Mastering Microservices: From Architecture Basics to Spring Cloud & Dubbo

This comprehensive guide explains microservice fundamentals, RPC frameworks, serialization, distributed transaction models (ACID, CAP, BASE, TCC), system monitoring, high‑availability strategies, load balancing, configuration management, service registration/discovery, Spring Cloud components, Dubbo fault‑tolerance clusters, and compares Spring Boot with Spring MVC, providing practical code examples and diagrams.

DubboMicroservicesconfiguration management

0 likes · 40 min read

Mastering Microservices: From Architecture Basics to Spring Cloud & Dubbo

MaGe Linux Operations

Aug 19, 2025 · Big Data

Master Kafka High Availability: Replica Sync & Disaster Recovery Strategies

This article provides a comprehensive guide to building enterprise‑grade, highly available Kafka clusters, covering architecture design, hardware planning, production‑level broker configurations, ISR management, monitoring, fault‑tolerance procedures, rolling upgrades, capacity planning, and automation scripts for seamless operations.

KafkaOperationsdisaster-recovery

0 likes · 16 min read

Master Kafka High Availability: Replica Sync & Disaster Recovery Strategies

DevOps Operations Practice

Aug 11, 2025 · Operations

Zen Master’s Secrets to the Ultimate State of Operations

Through a series of dialogues with a Zen master, the article humorously explores the highest level of operations—automation that runs itself, balanced alerting, cloud migration, reliable backups, high‑availability, stability through chaos engineering, and the ultimate goal of making systems operate without human intervention.

AutomationCloudOperations

0 likes · 5 min read

Zen Master’s Secrets to the Ultimate State of Operations

MaGe Linux Operations

May 11, 2025 · Cloud Native

How to Build a High‑Performance, Highly‑Available Cloud‑Native Ingress Gateway

When an Ingress gateway faces traffic exceeding 100,000 QPS, this guide outlines systematic performance optimizations, configuration tweaks, distributed architecture designs, traffic management, monitoring, and disaster‑recovery strategies—including hardware scaling, kernel tuning, DPDK, rate limiting, horizontal scaling, service mesh integration, and CDN offloading—to achieve high concurrency and high availability.

cloud-nativehigh-availabilitymonitoring

0 likes · 8 min read

How to Build a High‑Performance, Highly‑Available Cloud‑Native Ingress Gateway

MaGe Linux Operations

Feb 18, 2025 · Operations

Mastering Keepalived: Build High‑Availability LVS Clusters with VRRP

This guide explains Keepalived’s role in providing VRRP‑based high availability for LVS, covering its architecture, VRRP operation, installation, configuration of master and backup nodes, health checks, and practical testing steps to achieve seamless failover in a Linux environment.

KeepalivedLVSLinux

0 likes · 17 min read

Mastering Keepalived: Build High‑Availability LVS Clusters with VRRP

Alibaba Cloud Native

Dec 17, 2024 · Cloud Native

Achieving Full Cloud‑Native Migration: Hangzhou MingShitang’s Journey to 100% SLA

This case study details how Hangzhou MingShitang migrated its entire online‑education platform from self‑hosted IDC infrastructure to Alibaba Cloud, redesigning registration, configuration, micro‑service governance, safe release and gateway layers with MSE, Sentinel and cloud‑native technologies to attain 100% SLA, dramatically cut costs and boost performance.

Alibaba Cloudcloud-nativehigh-availability

0 likes · 19 min read

Achieving Full Cloud‑Native Migration: Hangzhou MingShitang’s Journey to 100% SLA

High Availability Architecture

Sep 4, 2024 · Backend Development

Three‑High System Construction: Performance, Concurrency, and Availability – A Backend Engineering Methodology

This article presents a comprehensive backend engineering methodology for building "three‑high" systems that simultaneously achieve high performance, high concurrency, and high availability, covering performance tuning, horizontal and vertical scaling, hot‑key mitigation, fault‑tolerance mechanisms, isolation strategies, and practical DDD‑driven design.

DDDbackendhigh-availability

0 likes · 26 min read

Three‑High System Construction: Performance, Concurrency, and Availability – A Backend Engineering Methodology

dbaplus Community

Aug 15, 2024 · Backend Development

How a Kafka‑Proxy Boosts Cluster Scalability and Resilience

This article explains the challenges of large‑scale Kafka clusters and introduces a lightweight Kafka‑Proxy layer that provides seamless cluster switching, traffic monitoring, online offset reset, and flow‑control mechanisms, ultimately improving availability, throughput, and operational efficiency.

Offset ResetTraffic Managementbackend

0 likes · 17 min read

How a Kafka‑Proxy Boosts Cluster Scalability and Resilience

Architecture and Beyond

Jul 28, 2024 · Frontend Development

Comprehensive Guide to Front‑End Stability: Observability, Full‑Chain Monitoring, High‑Availability Architecture, Performance Management, Risk Governance, Process Mechanisms, and Engineering Practices

This extensive article presents a systematic approach to front‑end stability, covering observability systems, full‑chain monitoring, high‑availability design, performance management, risk governance, process mechanisms, and engineering practices to ensure reliable user experiences and business continuity.

ObservabilityPerformanceStability

0 likes · 44 min read

Comprehensive Guide to Front‑End Stability: Observability, Full‑Chain Monitoring, High‑Availability Architecture, Performance Management, Risk Governance, Process Mechanisms, and Engineering Practices

JD Tech Talk

Jul 25, 2024 · Backend Development

Design and Architecture of JD.com’s Buffalo Distributed DAG Scheduling System

The article details the design, core technical solutions, high‑availability architecture, performance optimizations, and open capabilities of Buffalo, JD.com’s distributed DAG‑based job scheduling platform that supports massive task volumes, complex dependencies, and flexible resource management.

DAGDistributed SchedulingOperations

0 likes · 13 min read

Design and Architecture of JD.com’s Buffalo Distributed DAG Scheduling System

Top Architect

Jul 12, 2024 · Backend Development

Traffic Governance and High Availability in Backend Systems: Circuit Breakers, Isolation, Retries, Timeouts, and Rate Limiting

This article explains how high‑availability backend systems use traffic governance techniques—including circuit breakers, various isolation strategies, retry and timeout policies, degradation mechanisms, and rate‑limiting—to maintain balanced data flow, prevent cascading failures, and ensure performance, scalability, and reliability.

backendcircuit-breakerhigh-availability

0 likes · 30 min read

Traffic Governance and High Availability in Backend Systems: Circuit Breakers, Isolation, Retries, Timeouts, and Rate Limiting

IT Architects Alliance

Jun 25, 2024 · Operations

Traffic Governance and High‑Availability Strategies for Microservice Systems

The article explains how traffic governance—including circuit breaking, isolation, retries, degradation, timeout handling, and rate limiting—maintains the three‑high goals of high performance, high availability, and easy scalability in microservice architectures, using practical examples and formulas.

circuit-breakerdegradationhigh-availability

0 likes · 29 min read

Traffic Governance and High‑Availability Strategies for Microservice Systems

Full-Stack Internet Architecture

May 24, 2024 · Databases

Redis Deployment Modes: Single Instance, Master‑Slave Replication, Sentinel, and Cluster

This article reviews the four common Redis deployment modes—single‑instance, master‑slave replication, Sentinel, and cluster—explaining their architectures, advantages, drawbacks, and suitable application scenarios, and provides a comparative table to help readers choose the appropriate setup.

DeploymentSentinelhigh-availability

0 likes · 7 min read

Redis Deployment Modes: Single Instance, Master‑Slave Replication, Sentinel, and Cluster

Tencent Cloud Developer

May 14, 2024 · Backend Development

Product Middle Platform Workflow Orchestration Engine: Use Cases, Architecture, and High‑Availability Solutions

Tencent’s product middle platform employs a self‑built, stateless workflow orchestration engine—configurable via drag‑and‑drop or DSL—to coordinate massive product processing and audit tasks, using load‑balancing, retry, rate‑limiting, circuit‑breaker and service isolation strategies that ensure high availability, performance, and horizontal scalability on TKE.

MicroservicesOrchestrationbackend

0 likes · 14 min read

Product Middle Platform Workflow Orchestration Engine: Use Cases, Architecture, and High‑Availability Solutions

JD Retail Technology

May 10, 2024 · Operations

High Availability and the Dispersal Principle: Concepts, Practices, and Benefits

This article explains the concept of high availability, introduces the dispersal principle, demonstrates its application in microservice architectures and distributed storage, and outlines the benefits such as improved reliability, scalability, fault tolerance, and reduced single‑point failures.

Microservicesdistributed-systemsfault-tolerance

0 likes · 10 min read

High Availability and the Dispersal Principle: Concepts, Practices, and Benefits

Tongcheng Travel Technology Center

May 10, 2024 · Backend Development

Design and Implementation of a Kafka‑Proxy for High Availability and Traffic Governance

This article presents a Kafka‑Proxy solution that enhances cluster availability, traffic governance, seamless cluster switching, near‑source production/consumption, non‑disruptive offset resetting, and message flow control through metadata sharing and a lightweight proxy layer.

Traffic Managementbackenddistributed-systems

0 likes · 16 min read

Design and Implementation of a Kafka‑Proxy for High Availability and Traffic Governance

Tech Architecture Stories

Jan 25, 2024 · Operations

Why 2023 Saw a Spike in Cloud Outages: Key Lessons for High‑Availability

2023 witnessed numerous high‑profile cloud service failures—from Alibaba’s Hong Kong data‑center cooling issue to Tencent’s storage outage—highlighting how cost‑cutting, reduced staffing, and insufficient disaster‑recovery planning amplify risk, and outlining essential high‑availability, failover, and multi‑region strategies for resilient operations.

cloud outagedisaster-recoveryhigh-availability

0 likes · 19 min read

Why 2023 Saw a Spike in Cloud Outages: Key Lessons for High‑Availability

dbaplus Community

Jan 14, 2024 · Operations

How Bilibili Achieves 99.99% Availability for Live Gift Systems

This article explains Bilibili's technical strategies—preloading, circuit breaking, sharding, multi‑active deployment, and Kubernetes auto‑scaling—that ensure the live‑gift panel, feeding flow, and supporting services maintain 99.99% uptime even during massive traffic spikes.

Microservicescircuit-breakerhigh-availability

0 likes · 14 min read

How Bilibili Achieves 99.99% Availability for Live Gift Systems

dbaplus Community

Jan 2, 2024 · Operations

How Xiaohongshu Scaled Its Metrics System Tenfold with Cloud‑Native Architecture

Facing exploding metric volumes, high resource consumption, and fragile operations, Xiaohongshu's observability team completely rebuilt its metrics pipeline using Victoriametrics, achieving ten‑fold performance gains, minute‑level scaling, high‑availability, cost reduction, and robust multi‑cloud active‑active deployment while preserving data safety and query speed.

ObservabilityPrometheuscloud-native

0 likes · 34 min read

How Xiaohongshu Scaled Its Metrics System Tenfold with Cloud‑Native Architecture

Xiaohongshu Tech REDtech

Dec 14, 2023 · Cloud Native

Evolution of Xiaohongshu Metrics System: Cloud‑Native Observability, High Availability, and Performance Optimizations

Xiaohongshu’s observability team rebuilt its Prometheus‑based metrics platform using vmagent, dual‑active HA clusters, query push‑down, high‑cardinality governance and multi‑cloud active‑active design, delivering ten‑fold collection speed, up to 70× query capacity, massive CPU‑memory‑storage savings and fully automated scaling.

VictoriaMetricscloud-nativehigh-availability

0 likes · 35 min read

Evolution of Xiaohongshu Metrics System: Cloud‑Native Observability, High Availability, and Performance Optimizations

Baidu Intelligent Cloud Tech Hub

Nov 27, 2023 · Databases

How GaiaDB Redefines Cloud‑Native Databases with Fusion Architecture

GaiaDB, Baidu’s cloud‑native database, combines compute‑storage separation with a fused, log‑service architecture to boost performance, simplify consistency, and deliver multi‑level high availability across zones and regions, while supporting new features such as parallel query, HTAP replicas, and serverless scaling.

Performancecloud-nativedistributed-systems

0 likes · 17 min read

How GaiaDB Redefines Cloud‑Native Databases with Fusion Architecture

Tencent Cloud Middleware

Sep 14, 2023 · Backend Development

How Tencent Cloud’s Unitized Architecture Boosts Microservice Scalability and High Availability

This article explains the concept of unitized architecture, its characteristics and types, the performance and reliability challenges it solves for large‑scale microservice systems, and how Tencent Cloud’s TSF platform implements unit routing, gray release, and disaster‑recovery to achieve efficient, cross‑region, high‑availability deployments.

Tencent Cloudcloud-nativehigh-availability

0 likes · 15 min read

How Tencent Cloud’s Unitized Architecture Boosts Microservice Scalability and High Availability

Top Architect

Aug 8, 2023 · Backend Development

High‑Availability Architecture and Optimization Strategies for a Large‑Scale Membership System

This article describes the design, high‑availability solutions, traffic isolation, deep performance optimizations, caching strategies, dual‑center MySQL partitioning, seamless migration, and future fine‑grained flow‑control and degradation techniques employed to keep a billion‑user membership system stable and performant under extreme load.

backendhigh-availabilityscalability

0 likes · 20 min read

High‑Availability Architecture and Optimization Strategies for a Large‑Scale Membership System

dbaplus Community

Jul 8, 2023 · Operations

How QQ Music Achieves High Availability: Architecture, Tools, and Observability

This article explains how QQ Music embraces inevitable faults by building a high‑availability architecture that combines redundant infrastructure, automated failover, stability strategies, a robust toolchain for chaos engineering and full‑link load testing, and comprehensive observability to ensure graceful fault handling at scale.

MicroservicesObservabilitychaos-engineering

0 likes · 27 min read

How QQ Music Achieves High Availability: Architecture, Tools, and Observability

Architecture & Thinking

Jun 2, 2023 · Databases

Mastering Database High Availability: From Basic Replication to Seamless Scaling

This article examines the evolution of database high‑availability architectures in large‑scale internet environments, covering basic direct‑connect setups, scale‑up/scale‑out sharding, master‑slave/master‑master with Keepalived, and advanced solutions such as MHA, Percona XtraDB Cluster, and MySQL Group Replication, plus smooth scaling steps.

KeepalivedMaster‑SlaveMySQL

0 likes · 7 min read

Mastering Database High Availability: From Basic Replication to Seamless Scaling

Top Architect

May 5, 2023 · Backend Development

Using Redis Sentinel for High Availability: Design and Implementation

This article introduces Redis Sentinel as the official high‑availability solution for Redis, explains its core functions, provides configuration examples, compares three ways to receive failover notifications (script, client subscription, and indirect service), and offers design recommendations for robust production deployments.

RedisSentineldevops

0 likes · 10 min read

Using Redis Sentinel for High Availability: Design and Implementation

High Availability Architecture

Apr 19, 2023 · Backend Development

Designing High‑Availability Services: Architecture Boundaries, Protocols, and Push Systems

This article explains how Tencent’s internal high‑availability service curriculum emphasizes architecture boundaries, unified protocol definitions using JCE, a unified PushAPI, monitoring and feedback mechanisms, and the organizational impact of aligning system and team boundaries to achieve scalable, reliable backend services.

backenddistributed-systemshigh-availability

0 likes · 14 min read

Designing High‑Availability Services: Architecture Boundaries, Protocols, and Push Systems

dbaplus Community

Feb 26, 2023 · Backend Development

Why Does HikariCP Hang After MySQL Failover? A Deep Dive into Connection Pool Blocking

This article recounts a real‑world investigation of a MySQL high‑availability outage where HikariCP connections appeared to dead‑lock, detailing the architecture, observed symptoms, code analysis, thread‑dump findings, and the final fix of adding socket timeout parameters to the JDBC URL.

HikariCPKeepalivedMySQL

0 likes · 22 min read

Why Does HikariCP Hang After MySQL Failover? A Deep Dive into Connection Pool Blocking

vivo Internet Technology

Feb 8, 2023 · Operations

Design and Implementation of Vivo Jenkins Scheduler for High Availability and Resource Management

The paper presents Vivo’s Jenkins Scheduler, a master‑centric, high‑availability solution that replaces single‑master Jenkins by integrating an API gateway, event‑driven failure detection, label‑based multi‑dimensional scheduling, Redis/MySQL‑backed flow control, and callback monitoring, thereby balancing resources, enabling rapid failover, persisting queues, and improving build reliability, with plans to containerize Jenkins for Kubernetes workflows.

CI/CDJenkinsResource Management

0 likes · 10 min read

Design and Implementation of Vivo Jenkins Scheduler for High Availability and Resource Management

Top Architect

Dec 16, 2022 · Databases

Comprehensive Guide to Database Horizontal Scaling, Sharding, and High Availability with MariaDB and Keepalived

This article presents a detailed analysis and step‑by‑step implementation of horizontal database scaling, including sharding strategies, shutdown and stop‑write plans, log‑based migration, dual‑write approaches, and a smooth 2N expansion method, while also covering MariaDB master‑master configuration, dynamic data source addition, and Keepalived high‑availability setup.

MariaDBhigh-availabilityscaling

0 likes · 37 min read

Comprehensive Guide to Database Horizontal Scaling, Sharding, and High Availability with MariaDB and Keepalived

ITPUB

Nov 27, 2022 · Operations

Designing a Scalable, High‑Availability Monitoring System with Prometheus and Thanos

This article explores the challenges of building a fault‑tolerant monitoring platform, compares open‑source solutions, details why Prometheus is preferred, and shows how to achieve high availability and horizontal scaling using Thanos, remote‑write, hash‑ring sharding, and Kubernetes integration.

Thanoscloud-nativehigh-availability

0 likes · 18 min read

Designing a Scalable, High‑Availability Monitoring System with Prometheus and Thanos

Huolala Tech

Sep 22, 2022 · Operations

How HuoLala Engineered a Scalable, High‑Availability Monitoring System for Multi‑Cloud

This article details the evolution of monitoring technologies, HuoLala's three‑phase monitoring architecture, the integration of Prometheus, VictoriaMetrics and SkyWalking, zero‑intrusion bytecode instrumentation, full‑link trace sampling, visual dashboards, metric‑trace‑log correlation, and future plans for root‑cause analysis and intelligent alerting.

CloudOperationsTracing

0 likes · 24 min read

How HuoLala Engineered a Scalable, High‑Availability Monitoring System for Multi‑Cloud

Top Architect

Aug 26, 2022 · Operations

Comprehensive Guide to Nginx Rewrite, Anti‑Hotlinking, Static/Dynamic Separation, and Keepalived High‑Availability Configuration

This article provides a step‑by‑step tutorial on configuring Nginx rewrite rules, implementing anti‑hotlinking protection, separating static and dynamic resources, and building a high‑availability architecture using Keepalived with detailed code examples and deployment instructions.

ConfigurationKeepalivedNGINX

0 likes · 23 min read

Comprehensive Guide to Nginx Rewrite, Anti‑Hotlinking, Static/Dynamic Separation, and Keepalived High‑Availability Configuration

High Availability Architecture

Aug 10, 2022 · Backend Development

Design and Migration of a High‑Performance Message Middleware Platform from RabbitMQ to RocketMQ

To address RabbitMQ’s scalability, reliability, and feature limitations, Vivo’s middleware team evaluated RocketMQ and Pulsar, selected RocketMQ, and built a next‑generation message middleware platform with an AMQP‑proxy gateway, metadata services, and high‑availability mechanisms, enabling seamless, high‑throughput migration and richer messaging capabilities.

MiddlewareRabbitMQRocketMQ

0 likes · 13 min read

Design and Migration of a High‑Performance Message Middleware Platform from RabbitMQ to RocketMQ

ITPUB

Aug 5, 2022 · Operations

How a Lua Bug Crashed Bilibili’s Load Balancer and What We Learned

On July 13, 2021, Bilibili’s OpenResty‑based SLB suffered a CPU‑100% outage caused by a Lua _gcd function bug triggered when a service’s weight was set to the string “0”, leading to a multi‑hour incident that was resolved by rebuilding SLB clusters and disabling JIT compilation.

Load BalancerOpenRestySRE

0 likes · 17 min read

How a Lua Bug Crashed Bilibili’s Load Balancer and What We Learned

dbaplus Community

Jul 25, 2022 · Operations

How to Build an Enterprise‑Grade Monitoring & Alerting Platform with Prometheus, Grafana, and AlertManager

Designing a unified, enterprise‑level monitoring and alerting platform, this article analyzes the shortcomings of standard Prometheus‑Grafana‑AlertManager setups, outlines platform‑vs‑business responsibilities, details architecture, user‑scenario requirements, component selection, high‑availability strategies, and deployment models to achieve scalable, easy‑to‑use observability.

Operationshigh-availabilitymonitoring

0 likes · 12 min read

How to Build an Enterprise‑Grade Monitoring & Alerting Platform with Prometheus, Grafana, and AlertManager

360 Smart Cloud

Jul 14, 2022 · Cloud Computing

Auto Scaling (AS) in Cloud Services: Architecture, Use Cases, and Optimization Strategies

This article explains the concept of elastic auto scaling in cloud services, describes typical scenarios such as high‑elastic web apps and compute‑intensive workloads, details the four‑layer architecture and workflow, and outlines functional features, stability improvements, and future optimization directions.

Auto Scalingcloud-computingelasticity

0 likes · 12 min read

Auto Scaling (AS) in Cloud Services: Architecture, Use Cases, and Optimization Strategies

Wukong Talks Architecture

Jun 29, 2022 · Operations

Understanding Keepalived: High‑Availability, VRRP Election, and Load‑Balancing Mechanisms

This article explains the principles and configuration of Keepalived, covering its role in providing high‑availability virtual IPs, the VRRP election process, traffic forwarding, load‑balancing algorithms, and practical configuration examples with vrrp_instance and vrrp_script directives.

IPVSOperationsVRRP

0 likes · 16 min read

Understanding Keepalived: High‑Availability, VRRP Election, and Load‑Balancing Mechanisms

Practical DevOps Architecture

Jun 28, 2022 · Operations

Understanding Redis Sentinel: High‑Availability Mechanism and Failover Process

The article explains how Redis Sentinel provides high availability by monitoring master‑slave instances, detecting failures through periodic pings, distinguishing subjective and objective down states, performing quorum arbitration, and automatically promoting a slave to master to ensure continuous service.

Master‑SlaveOperationsfailover

0 likes · 4 min read

Understanding Redis Sentinel: High‑Availability Mechanism and Failover Process

Wukong Talks Architecture

Jun 14, 2022 · Databases

MySQL High‑Availability Incident Review and Recovery Steps

The article recounts a production‑like MySQL dual‑master HA setup using Keepalived, describes how a missing binary‑log index caused replication failure, and details step‑by‑step troubleshooting commands, configuration fixes, and preventive measures to restore reliable database synchronization.

DatabasesKeepalivedMySQL

0 likes · 9 min read

MySQL High‑Availability Incident Review and Recovery Steps

Wukong Talks Architecture

Jun 8, 2022 · Databases

Deploying Redis Master‑Slave Architecture and Sentinel Cluster for High Availability

This guide walks through upgrading a single‑node Redis deployment to a high‑availability setup by building a master‑slave cluster, configuring Sentinel services, testing replication and failover, and enabling client auto‑detection of master changes, all using Docker containers and configuration files.

DockerRedisfailover

0 likes · 15 min read

Deploying Redis Master‑Slave Architecture and Sentinel Cluster for High Availability

Dada Group Technology

Jun 6, 2022 · Backend Development

Evolution of JD Daojia Search System Architecture from Version 1.0 to 3.0

The article details the progressive architectural evolution of JD Daojia's search system—starting from a simple, single‑layer ES‑based 1.0 design, through the 2.0 overhaul that introduced full‑recall, independent ranking services, and index disaster‑recovery, to the 3.0 version that adds multi‑path recall, sophisticated ranking models, and automated routing for high availability.

ElasticsearchRankingSearch

0 likes · 20 min read

Evolution of JD Daojia Search System Architecture from Version 1.0 to 3.0

Top Architect

Jun 2, 2022 · Cloud Native

A Beginner's Guide to Designing, Implementing, and Deploying Microservices on Kubernetes

This article walks readers through the complete lifecycle of a microservice system—from architectural design and Java Spring Boot implementation to Kubernetes deployment, high‑availability setup, monitoring with Prometheus/Grafana, tracing with Zipkin, and flow‑control with Sentinel—providing practical code snippets and step‑by‑step instructions.

KubernetesMicroservicesTracing

0 likes · 21 min read

A Beginner's Guide to Designing, Implementing, and Deploying Microservices on Kubernetes

Cloud Native Technology Community

May 31, 2022 · Cloud Native

Building High‑Performance, High‑Availability Container Networks for Banking in a Two‑Site‑Three‑Center Architecture

This article explains the challenges of container networking in banks, especially under a two‑site‑three‑center architecture, and provides practical guidance on using underlay and overlay approaches, Kube‑OVN solutions, and best‑practice recommendations to achieve high‑availability, high‑concurrency, and high‑performance cloud‑native networks.

Kube-OVNOverlayUnderlay

0 likes · 9 min read

Building High‑Performance, High‑Availability Container Networks for Banking in a Two‑Site‑Three‑Center Architecture

Wukong Talks Architecture

May 17, 2022 · Databases

Implementing MySQL Master‑Master High Availability with Keepalived: A Step‑by‑Step Guide

This article provides a comprehensive, English‑language tutorial on building MySQL master‑master high availability using Keepalived, covering architecture design, Docker‑based MySQL deployment, replication configuration, Keepalived installation, virtual IP setup, failover testing, and a detailed list of encountered pitfalls and their solutions.

Dockerhigh-availabilitymaster-master

0 likes · 22 min read

Implementing MySQL Master‑Master High Availability with Keepalived: A Step‑by‑Step Guide

IT Architects Alliance

Apr 27, 2022 · Operations

High‑Availability Architecture for a Billion‑Scale Membership System: ES Dual‑Center, Redis Caching, MySQL Migration, and Flow‑Control Strategies

This article details how a membership system serving billions of users achieves high performance and high availability through a dual‑center Elasticsearch cluster, traffic‑isolated ES clusters, Redis cache with distributed locks, MySQL dual‑center partitioning, and fine‑grained flow‑control and degradation mechanisms, all while ensuring zero‑downtime migrations and consistent data.

Flow Controldistributed-systemshigh-availability

0 likes · 20 min read

High‑Availability Architecture for a Billion‑Scale Membership System: ES Dual‑Center, Redis Caching, MySQL Migration, and Flow‑Control Strategies

Top Architect

Apr 7, 2022 · Backend Development

High‑Availability Architecture and Migration Strategies for a Large‑Scale Membership System

This article details the design and implementation of a highly available membership platform, covering Elasticsearch dual‑center primary‑backup clusters, traffic‑isolation architectures, deep ES optimizations, Redis caching and dual‑center clusters, MySQL partitioned clusters, seamless SqlServer‑to‑MySQL migration, abnormal member governance, and refined flow‑control and degradation strategies.

backendhigh-availabilitymigration

0 likes · 20 min read

High‑Availability Architecture and Migration Strategies for a Large‑Scale Membership System

DataFunSummit

Mar 8, 2022 · Cloud Native

Design and Implementation of Cloud‑Native High‑Availability Solutions for Data Components at eBay

eBay’s data infrastructure engineers describe how they design and implement cloud‑native, multi‑cluster high‑availability architectures for stateful data components—covering background challenges, federated Kubernetes management, state handling, fault‑tolerance, backup, and chaos testing—to ensure reliable, scalable data services across global data centers.

Multi-Clustercloud-nativedata-components

0 likes · 16 min read

Design and Implementation of Cloud‑Native High‑Availability Solutions for Data Components at eBay

ITPUB

Jan 26, 2022 · Backend Development

How to Choose the Right Distributed Unique ID Strategy for Your System

This article explains why globally unique identifiers are essential in distributed systems, outlines the key characteristics of a good ID scheme, and compares several generation methods—including UUID, database auto‑increment, segmented DB ranges, Redis INCR, Zookeeper, Meituan Leaf, Snowflake, and Baidu uid‑generator—highlighting their advantages, drawbacks, and practical implementation details.

DatabaseDistributed IDSnowflake

0 likes · 18 min read

How to Choose the Right Distributed Unique ID Strategy for Your System

Cloud Native Technology Community

Jan 7, 2022 · Cloud Native

Designing High‑Availability, High‑Performance Cloud‑Native Container Networks for Banking

This article examines the challenges and solutions for building high‑availability, high‑concurrency, and high‑performance cloud‑native container networks in banks, covering two‑site three‑center architectures, underlay/overlay strategies, Kube‑OVN implementation, and practical recommendations for secure, scalable networking.

Container NetworkKube-OVNOverlay

0 likes · 10 min read

Designing High‑Availability, High‑Performance Cloud‑Native Container Networks for Banking

IT Architects Alliance

Oct 25, 2021 · Databases

Designing a High‑Availability Redis Service with Sentinel

This article explains how to build a highly available Redis service using Redis Sentinel, discusses common failure scenarios, compares several architectural options from a single instance to a three‑node Sentinel setup, and provides practical tips such as using virtual IPs for seamless client access.

DatabaseSentinelarchitecture

0 likes · 11 min read

Designing a High‑Availability Redis Service with Sentinel

Laravel Tech Community

Oct 19, 2021 · Backend Development

Redis Scaling Strategies: Partitioning, Master‑Slave Replication, Sentinel, and Cluster

This article introduces various Redis scaling solutions—including basic partitioning, master‑slave replication, Sentinel high‑availability, and Redis Cluster—explaining their concepts, typical usage patterns, configuration commands, advantages, and drawbacks to help developers choose the right approach for high‑traffic environments.

Redisclusterhigh-availability

0 likes · 12 min read

Redis Scaling Strategies: Partitioning, Master‑Slave Replication, Sentinel, and Cluster

Laravel Tech Community

Sep 28, 2021 · Operations

Nginx Rewrite Rules, Anti-Hotlinking, Static/Dynamic Separation, and Keepalived High‑Availability Configuration Guide

This article provides a comprehensive step‑by‑step guide on configuring Nginx rewrite rules, implementing anti‑hotlink protection, separating static and dynamic resources, and setting up Keepalived for high‑availability load balancing, complete with example configurations and shell scripts.

KeepalivedNGINXanti-hotlinking

0 likes · 21 min read

Nginx Rewrite Rules, Anti-Hotlinking, Static/Dynamic Separation, and Keepalived High‑Availability Configuration Guide

Ops Development Stories

Sep 17, 2021 · Operations

Master Keepalived: Build Reliable Linux Load‑Balancing and HA

This guide explains Keepalived’s role in Linux load‑balancing and high‑availability, covering its VRRP‑based architecture, core modules, layered operation, configuration syntax, practical deployment with Nginx, common split‑brain issues, and advanced settings such as nopreempt and multicast conflict resolution.

HAKeepalivedVRRP

0 likes · 21 min read

Master Keepalived: Build Reliable Linux Load‑Balancing and HA

Qingyun Technology Community

Sep 16, 2021 · Databases

Why Cloud‑Native Databases Are Redefining Elasticity and Resilience

Cloud‑native databases address the elasticity, resilience, and high‑availability demands of modern cloud computing by separating compute and storage, leveraging log‑based persistence, multi‑replica consensus, and distributed architectures such as Spanner, Aurora, and TiDB, offering higher performance, lower cost, and better resource utilization.

Databasescloud-nativedistributed-systems

0 likes · 13 min read

Why Cloud‑Native Databases Are Redefining Elasticity and Resilience

Dada Group Technology

Aug 27, 2021 · Backend Development

Evolution of JD Daojia Product System Architecture: From Simple 1.0 Design to Domain‑Driven 3.0

This article details the step‑by‑step architectural evolution of JD Daojia's product system—from the initial 1.0 monolithic design, through 2.0 high‑availability and performance enhancements, to the 3.0 domain‑driven microservice architecture—highlighting the motivations, technical solutions, and future outlook.

CachingDomain-Driven DesignMicroservices

0 likes · 17 min read

Architect

Jul 7, 2021 · Big Data

Understanding Kafka High Availability and Resolving Consumer Offset Issues

This article explains Kafka's high‑availability architecture, including multi‑replica design, ISR synchronization, leader election, acks configuration, and how misconfigured __consumer_offset replication can cause consumer outages, offering practical steps to ensure reliable message delivery.

Consumer offsetStreamingdistributed-systems

0 likes · 8 min read

Understanding Kafka High Availability and Resolving Consumer Offset Issues

IT Architects Alliance

Jul 5, 2021 · R&D Management

System Architecture Design Overview and Principles for an Online Education Platform

This article presents a comprehensive architecture design for a rapidly growing online education platform, covering background challenges, high‑availability and scalability goals, core design principles, a multi‑layer solution including application, infrastructure, service topology, unified technology stack, standardization, modular services, micro‑service migration, and database and DevOps strategies.

Microservicesarchitecturecloud-native

0 likes · 6 min read

System Architecture Design Overview and Principles for an Online Education Platform

Efficient Ops

Jun 28, 2021 · Backend Development

Why a Single Kafka Broker Failure Stops All Consumers – Understanding HA

This article explains Kafka's high‑availability mechanisms, covering multi‑replica design, ISR synchronization, leader election, the impact of the request.required.acks setting, and how the default __consumer_offset topic can become a single point of failure, with concrete steps to fix it.

KafkaLeader Electionconsumer-offset

0 likes · 9 min read

Why a Single Kafka Broker Failure Stops All Consumers – Understanding HA

ITFLY8 Architecture Home

May 24, 2021 · Operations

Designing a High‑Availability, High‑Performance, Scalable and Secure Web Architecture

This article walks through the evolution and design patterns of large‑scale web systems, covering initial single‑server setups, separation of application and data, caching strategies, server clustering, read‑write separation, CDN and reverse proxy usage, distributed storage, micro‑service decomposition, and essential considerations for performance, availability, scalability, extensibility and security.

architecturedistributed-systemshigh-availability

0 likes · 19 min read

Designing a High‑Availability, High‑Performance, Scalable and Secure Web Architecture

Full-Stack Internet Architecture

May 21, 2021 · Operations

Resolving Firewall and VIP Migration Issues in Keepalived on CentOS 7

This guide explains how to troubleshoot firewalld command errors, enable VRRP multicast traffic, and fix VIP migration problems in a Keepalived high‑availability setup on CentOS 7, providing step‑by‑step commands, configuration file adjustments, and verification procedures.

LinuxNetworkcentos7

0 likes · 6 min read

Resolving Firewall and VIP Migration Issues in Keepalived on CentOS 7

Programmer DD

May 12, 2021 · Backend Development

How Zhihu Scales Its Read Service: Architecture, Performance, and TiDB Migration

This article explains how Zhihu built a highly available, high‑performance, and easily extensible read‑service for its homepage, detailing the system architecture, caching strategies, query performance requirements, and the migration from MySQL to TiDB with TiDB 3.0 enhancements.

CachingPerformanceTiDB

0 likes · 20 min read

How Zhihu Scales Its Read Service: Architecture, Performance, and TiDB Migration

Full-Stack Internet Architecture

Apr 8, 2021 · Backend Development

Kafka Interview Guide: Concepts, Architecture, Configuration, and Performance

This article provides a comprehensive overview of Kafka, covering its role as a distributed messaging middleware, core concepts, architecture components, common interview questions, command‑line tools, producer and consumer configurations, high‑availability mechanisms, delivery semantics, and performance optimizations for backend developers.

Distributed MessagingKafkabackend-development

0 likes · 20 min read

Kafka Interview Guide: Concepts, Architecture, Configuration, and Performance

dbaplus Community

Mar 23, 2021 · Operations

Why RocketMQ Took 10 Minutes to Recover: Deep Dive into a Production Outage

A production RocketMQ cluster suffered a 10‑minute message‑send timeout after a broker’s memory failure caused a server reboot, and the post analyzes the routing registration, name‑server removal logic, client‑side timeout handling, root cause of name‑server “dead‑lock”, and proposes deployment and code‑level fixes to prevent recurrence.

OperationsRocketMQbest-practices

0 likes · 9 min read

Why RocketMQ Took 10 Minutes to Recover: Deep Dive into a Production Outage

Yanxuan Tech Team

Mar 5, 2021 · Backend Development

How We Built a High‑Availability Distributed ID Service for Order Management

This article explains why Yanxuan needed a distributed ID system, describes the selection of Leaf's segment mode, details architectural optimizations such as double‑buffering and dynamic step adjustment, shares operational safeguards, and outlines the pitfalls and solutions discovered during implementation.

Distributed IDLeafhigh-availability

0 likes · 13 min read

How We Built a High‑Availability Distributed ID Service for Order Management

Zhuanzhuan Tech

Feb 26, 2021 · Backend Development

Design and Architecture of a High‑Availability Instant Messaging System

This article explains the overall architecture, data structures, login and messaging flows, common real‑time, reliability and consistency challenges, and high‑availability and high‑concurrency techniques used in a production instant‑messaging service.

Data FlowIMarchitecture

0 likes · 13 min read

Design and Architecture of a High‑Availability Instant Messaging System

High Availability Architecture

Jan 19, 2021 · Cloud Native

Key Considerations for Building a Cloud‑Native Architecture

The article outlines the principles and practical considerations of cloud‑native architecture, covering platform‑agnostic design, container and Kubernetes foundations, microservice decomposition, CI/CD pipelines, monitoring, tracing, logging, and fault‑tolerant high‑availability strategies for building resilient distributed systems.

CI/CDMicroservicesObservability

0 likes · 13 min read

Key Considerations for Building a Cloud‑Native Architecture

21CTO

Jan 8, 2021 · Backend Development

Mastering RabbitMQ: AMQP Fundamentals, Reliability Guarantees, and Cluster Architectures

This article provides a comprehensive guide to RabbitMQ, covering core AMQP concepts, exchange types, message reliability mechanisms, dead‑letter handling, QoS settings, and various high‑availability cluster modes such as mirrored, federated, and multi‑active deployments.

AMQPClusteringMessage Queue

0 likes · 15 min read

Mastering RabbitMQ: AMQP Fundamentals, Reliability Guarantees, and Cluster Architectures

AntTech

Jan 5, 2021 · Cloud Native

Building Multi‑Active High‑Availability Platforms under Cloud‑Native Architecture – Insights from Ant Group’s SOFAStack

The article presents Ant Group’s SOFAStack experience in designing a cloud‑native, multi‑cluster, high‑availability platform for financial services, covering federation clusters, unified traffic governance with service mesh, unitized hybrid‑cloud evolution, and comprehensive disaster‑recovery mechanisms.

KubernetesSOFAStackcloud-native

0 likes · 14 min read

Building Multi‑Active High‑Availability Platforms under Cloud‑Native Architecture – Insights from Ant Group’s SOFAStack

Full-Stack Internet Architecture

Dec 15, 2020 · Backend Development

Designing High-Availability Caching Solutions in Production Environments

This article explains common causes of cache unavailability such as single‑point failures, cache penetration and avalanche, and provides practical high‑availability strategies—including multi‑node deployment, multi‑datacenter redundancy, consistent hashing, pre‑loading hot keys, local caches, and staggered expiration—to keep production systems resilient.

Cachingdistributed-systemshigh-availability

0 likes · 7 min read

Designing High-Availability Caching Solutions in Production Environments

Full-Stack Internet Architecture

Dec 13, 2020 · Backend Development

Essential Backend Development Concepts: Distributed Systems, Caching, Asynchronous Architecture, Load Balancing, Microservices, High Availability, Security, and Big Data

This article provides a comprehensive overview of core backend engineering topics—including distributed architecture, vertical and horizontal scaling, cache strategies, asynchronous messaging, load‑balancing techniques, microservice design, high‑availability patterns, security mechanisms, and big‑data processing frameworks—aimed at helping fresh graduates and junior developers build interview‑ready knowledge.

CachingDistributedhigh-availability

0 likes · 33 min read

Essential Backend Development Concepts: Distributed Systems, Caching, Asynchronous Architecture, Load Balancing, Microservices, High Availability, Security, and Big Data

Practical DevOps Architecture

Dec 4, 2020 · Operations

Step-by-Step Guide to Building a Keepalived + Nginx High‑Availability Setup

This tutorial walks through preparing two servers, configuring Keepalived and Nginx on both master and backup nodes, restarting services, and testing failover to demonstrate a functional high‑availability architecture using VRRP virtual IPs.

KeepalivedNGINXVRRP

0 likes · 4 min read

Step-by-Step Guide to Building a Keepalived + Nginx High‑Availability Setup

JD Cloud Developers

Nov 17, 2020 · Databases

How JD Cloud’s JCHDB Powered the 11.11 Shopping Festival’s Massive Data Surge

This article explains how JD Cloud’s JCHDB database handled PB‑level data growth during the 11.11 shopping festival, detailing the high‑availability architecture, performance optimizations, scaling techniques, and the eight‑step preparation process that enabled millions of queries per second and terabit‑level traffic.

Cloude-commercehigh-availability

0 likes · 8 min read

How JD Cloud’s JCHDB Powered the 11.11 Shopping Festival’s Massive Data Surge

MaGe Linux Operations

Oct 27, 2020 · Operations

Build a Highly Available RabbitMQ Cluster with Docker, HAProxy, and Keepalived

This guide walks through creating a resilient RabbitMQ cluster using two disk nodes and one RAM node, Docker and docker‑compose for deployment, HAProxy for load balancing with VIP failover, and Keepalived for master‑backup high availability, including configuration, scripts, and testing steps.

HAHAProxyKeepalived

0 likes · 17 min read

Build a Highly Available RabbitMQ Cluster with Docker, HAProxy, and Keepalived

dbaplus Community

Sep 29, 2020 · Backend Development

How JD Daojia Scaled Its Order System to Billion‑Scale: Architecture, Evolution, and High‑Availability Practices

This article details JD Daojia's order system architecture, tracing its evolution from a monolithic design to a micro‑service, multi‑cluster setup with Redis, MySQL, and Elasticsearch, and explains the high‑availability, disaster‑recovery, capacity‑planning, and alerting techniques that keep billions of orders running smoothly.

architecturebackendhigh-availability

0 likes · 26 min read

How JD Daojia Scaled Its Order System to Billion‑Scale: Architecture, Evolution, and High‑Availability Practices

Efficient Ops

Sep 14, 2020 · Cloud Native

How Dada Built a Dual‑Cloud Active‑Active Disaster Recovery Platform

This article details Dada's journey of designing and implementing a dual‑cloud active‑active architecture, covering high‑availability vs. disaster‑recovery concepts, Phase 1 and Phase 2 solutions, challenges faced, multi‑data‑center Consul deployment, bidirectional database replication, precise load‑balancing, capacity elasticity, and future plans.

ConsulMulti-Cloudcloud-native

0 likes · 17 min read

How Dada Built a Dual‑Cloud Active‑Active Disaster Recovery Platform

Architecture Digest

Aug 25, 2020 · Operations

Best Practices and Advanced Topics for Prometheus Monitoring in Kubernetes

This article provides a comprehensive guide on using Prometheus for Kubernetes monitoring, covering fundamental principles, exporter selection, Grafana dashboard creation, memory and storage optimization, high‑availability designs, query performance, cardinality management, and integration with alerting and logging systems.

ExportersGrafanaKubernetes

0 likes · 33 min read

Best Practices and Advanced Topics for Prometheus Monitoring in Kubernetes

MaGe Linux Operations

Aug 17, 2020 · Operations

How to Build a Highly Available RabbitMQ Cluster with HAProxy and Keepalived

This guide walks through installing Erlang and RabbitMQ, configuring a mirrored RabbitMQ cluster, setting up HAProxy load balancing, and using Keepalived for automatic failover, providing a complete high‑availability solution for RabbitMQ on Linux.

HAProxyKeepalivedLinux

0 likes · 12 min read

How to Build a Highly Available RabbitMQ Cluster with HAProxy and Keepalived

Tencent Cloud Middleware

Aug 11, 2020 · Cloud Native

How Tencent’s TDMQ Achieves Cloud‑Native, High‑Performance Messaging for Finance

This article explains how Tencent’s cloud‑native message queue TDMQ, built on Apache Pulsar’s storage‑compute separation, meets financial‑grade reliability, strong consistency, horizontal scalability, and cross‑region disaster‑recovery requirements through a quorum‑based consistency model, multi‑protocol support, and read‑only broker design.

Message QueuePulsarcloud-native

0 likes · 28 min read

How Tencent’s TDMQ Achieves Cloud‑Native, High‑Performance Messaging for Finance

Xiao Lou's Tech Notes

May 17, 2020 · Databases

How to Build a High‑Availability, High‑Performance Distributed ID Generator

Distributed systems need globally unique, often monotonic IDs, and this article examines common ID generation strategies—Snowflake, database auto‑increment, segment allocation, multi‑master databases, and Raft‑based consensus—evaluating each for high availability and high performance, and highlighting trade‑offs and implementation details.

DatabaseDistributed IDRaft

0 likes · 8 min read

How to Build a High‑Availability, High‑Performance Distributed ID Generator

Ops Development Stories

May 14, 2020 · Cloud Native

How to Build a Highly Available Kubernetes 1.18 Cluster with kubeadm, HAProxy, and Keepalived

This step‑by‑step guide shows how to set up a production‑grade Kubernetes 1.18 high‑availability cluster using kubeadm, HAProxy, Keepalived, Calico networking, the Kubernetes dashboard, and metrics‑server, covering node planning, environment preparation, component installation, cluster initialization, HA testing, and post‑deployment verification.

HAProxyMetrics Servercluster-setup

0 likes · 30 min read

How to Build a Highly Available Kubernetes 1.18 Cluster with kubeadm, HAProxy, and Keepalived

Architecture Digest

Feb 23, 2020 · Operations

Configuring Keepalived for High Availability with Nginx Load Balancing

This guide explains how to install Keepalived, configure VRRP‑based high‑availability for Nginx load balancers, modify master and backup configuration files, test failover scenarios, and add a Bash watchdog script to ensure seamless service continuity.

KeepalivedVRRPfailover

0 likes · 8 min read

Configuring Keepalived for High Availability with Nginx Load Balancing

Tencent Tech

Jan 15, 2020 · Big Data

How Tencent Scales Elasticsearch for Billions of Queries: Challenges & Optimizations

This article explains how Tencent leverages Elasticsearch for real‑time log analysis, search services, and time‑series data at massive scale, detailing the application scenarios, industry use cases, key challenges, optimization techniques, and future open‑source contributions.

CloudSearchcost optimization

0 likes · 16 min read

How Tencent Scales Elasticsearch for Billions of Queries: Challenges & Optimizations

Suning Technology

Dec 25, 2019 · Backend Development

How Suning’s Bargain Group Platform Achieves High Availability and Scalability

This article examines Suning's bargain‑group platform transformation, detailing its strategic shift to a platform model, high‑availability architecture, vertical and horizontal decomposition, data sharding, cache design, dual‑data‑center deployment, and link optimizations for handling massive concurrent traffic.

Redisbargain-groupdatabase sharding

0 likes · 19 min read

How Suning’s Bargain Group Platform Achieves High Availability and Scalability

Architecture Digest

Dec 19, 2019 · Databases

Design and Migration of Zhihu's Read Service: From Bloom Filter to TiDB

This article details Zhihu's read‑service architecture, its massive data scale and performance challenges, early Bloom‑filter and HBase solutions, the design goals of high availability, high performance and scalability, and the subsequent migration from MySQL to TiDB with cloud‑native practices.

TiDBcloud-nativedistributed database

0 likes · 25 min read

Design and Migration of Zhihu's Read Service: From Bloom Filter to TiDB

Architecture Digest

Nov 7, 2019 · Backend Development

Designing High‑Availability, High‑Performance Backend Architecture for Amap’s Real‑Time Services

This article explains how Amap (Gaode) handles billions of daily requests with sub‑millisecond latency by redesigning its gateway layer, adopting full‑asynchronous pipeline architecture, leveraging reactive frameworks like Vert.x and WebFlux, aggregating APIs, and implementing a unit‑based routing solution that paves the way for distributed sidecar and service‑mesh deployments.

AsynchronousPerformanceReactive

0 likes · 9 min read

Designing High‑Availability, High‑Performance Backend Architecture for Amap’s Real‑Time Services

Qunar Tech Salon

Sep 11, 2019 · Backend Development

SIA‑Gateway: A Distributed Microservice Gateway System – Architecture, Features, and High Availability

This article introduces the evolution of software architecture toward microservices, explains the key characteristics of microservice architectures, describes microservice gateway concepts and classifications, and details the design, features, deployment, and high‑availability mechanisms of the SpringCloud‑based SIA‑Gateway solution.

Cloud NativeService GovernanceSpringCloud

0 likes · 14 min read

SIA‑Gateway: A Distributed Microservice Gateway System – Architecture, Features, and High Availability

21CTO

Jul 17, 2019 · Backend Development

From Single Server to Cloud Native: How Taobao Scaled to Millions of Users

This article traces Taobao’s backend architecture evolution from a single‑server setup to a cloud‑native, micro‑service ecosystem, detailing each scaling stage—separating Tomcat and database, adding caches, load balancers, read/write splitting, sharding, NoSQL, ESB, containers, and finally public‑cloud deployment—while highlighting the associated technologies and design principles.

CloudMicroservicesarchitecture

0 likes · 19 min read

From Single Server to Cloud Native: How Taobao Scaled to Millions of Users

Youzan Coder

Mar 27, 2019 · Databases

MySQL Slave Crash-Safe Feature Analysis

The article examines MySQL 5.6’s crash‑safe slave replication, explaining how earlier versions’ unsafe relay‑log handling could corrupt position data, describing the atomic update of mysql.slave_relay_log_info via table‑based relay‑log info and transaction coordination, and covering configuration options, recovery behavior, GTID implications, performance trade‑offs, and implementation guidance.

BinlogCrash SafeDatabase

0 likes · 9 min read

Tencent Cloud Developer

Mar 12, 2019 · Cloud Native

Understanding Active-Active Disaster Recovery Architecture: Challenges and Implementation Strategies

The article argues that cold backup and active‑passive setups provide false security and outlines how true active‑active disaster‑recovery requires local‑datacenter request handling, business‑driven data sharding, and low‑latency cross‑site synchronization, recommending a staged rollout from city‑level to cross‑region architectures while weighing ROI.

Data Consistencyactive-active-architecturecloud-native

0 likes · 9 min read

Understanding Active-Active Disaster Recovery Architecture: Challenges and Implementation Strategies

21CTO

Oct 17, 2018 · Databases

Scaling Payment Systems: Sharding, Snowflake IDs, and High‑Availability

This article explains how to design a high‑throughput payment system using database sharding, Snowflake‑style globally unique order IDs, eventual consistency via message queues, high‑availability architectures, data tiering, and coarse‑fine traffic control to handle massive request spikes.

Data TieringDatabaseSharding

0 likes · 15 min read

Scaling Payment Systems: Sharding, Snowflake IDs, and High‑Availability

ITFLY8 Architecture Home

Oct 16, 2018 · Databases

Scaling Payment Systems: Sharding, Snowflake IDs, and High‑Availability Databases

This article explains how a high‑throughput payment platform uses database sharding by user ID, Snowflake‑style globally unique order IDs, asynchronous replication for eventual consistency, multi‑level data caching, and coarse‑fine traffic pipelines to achieve millions of requests per second with robust high‑availability.

Data TieringSnowflakehigh-availability

0 likes · 16 min read

Scaling Payment Systems: Sharding, Snowflake IDs, and High‑Availability Databases

Programmer DD

Jun 7, 2018 · Operations

How to Build a High‑Availability RabbitMQ Cluster with Load Balancing

This guide explains the principles behind RabbitMQ clustering, shows how metadata synchronization works, compares design choices, and provides step‑by‑step instructions—including component installation, node configuration, HAProxy load‑balancing setup, and a sample architecture diagram—to create a reliable, scalable RabbitMQ cluster for production use.

ClusteringHAProxyOperations

0 likes · 16 min read

How to Build a High‑Availability RabbitMQ Cluster with Load Balancing

Efficient Ops

Jun 4, 2018 · Operations

How QQ Built Multi‑Region Resilience with Set‑Based Deployment and Smart Scheduling

This article explains how QQ’s operations team designed a multi‑region, set‑based deployment architecture, tackled data synchronization, employed sharding strategies, and implemented flexible scheduling policies to ensure high availability and rapid disaster recovery for hundreds of millions of users.

DeploymentOperationsSet-Based

0 likes · 16 min read

How QQ Built Multi‑Region Resilience with Set‑Based Deployment and Smart Scheduling

ITPUB

Jan 30, 2018 · Operations

Eliminating Network Black Holes in Dell Blade Server Deployments

This article explains how misconfigured links in Dell blade server networks can create black‑hole failures, illustrates two fault scenarios, and provides step‑by‑step switch configuration techniques—including link‑dependency groups and uplink‑state groups—to ensure automatic NIC failover and maintain high availability.

DellNetworkblade-servers

0 likes · 13 min read

Eliminating Network Black Holes in Dell Blade Server Deployments

Efficient Ops

Dec 25, 2017 · Databases

How Ele.me Achieved Cross‑Region Active‑Active MySQL: Architecture, Challenges & Lessons

This article details Ele.me's practical experience building a cross‑region active‑active database system, covering latency challenges, architectural design, extensive database refactoring, DBA operational hurdles, consistency verification tools, and future scalability plans.

DBADDLData Consistency

0 likes · 22 min read

How Ele.me Achieved Cross‑Region Active‑Active MySQL: Architecture, Challenges & Lessons