Tagged articles

High Availability

1447 articles · Page 1 of 15
Raymond Ops
Raymond Ops
Jun 27, 2026 · Operations

Hands‑On DNS Ops: Deploy BIND and CoreDNS with Full Troubleshooting Guide

This comprehensive guide walks you through DNS fundamentals, compares BIND, CoreDNS, PowerDNS and Unbound, provides step‑by‑step deployment scripts for BIND 9.20 and CoreDNS 1.12, explains DNSSEC configuration, caching optimizations, security hardening, high‑availability designs, monitoring, backup and recovery procedures, and advanced troubleshooting techniques.

BINDCoreDNSDNS
0 likes · 43 min read
Hands‑On DNS Ops: Deploy BIND and CoreDNS with Full Troubleshooting Guide
Architect Chen
Architect Chen
Jun 24, 2026 · Operations

Designing Million-Request Architecture with LVS, Keepalived, and Nginx

The article explains how to build a million‑level high‑performance system by layering LVS with Keepalived for entry‑level redundancy and Nginx for flexible proxying, while adding health checks, logging, and auto‑scaling to ensure stability and rapid recovery.

High AvailabilityKeepalivedLVS
0 likes · 4 min read
Designing Million-Request Architecture with LVS, Keepalived, and Nginx
Raymond Ops
Raymond Ops
Jun 20, 2026 · Operations

Eliminate Monitoring Blind Spots: Hands‑On Enterprise‑Grade Prometheus + Grafana Deployment

This comprehensive guide walks you through the end‑to‑end setup of a production‑grade Prometheus and Grafana monitoring stack, covering architecture choices, installation steps, configuration details, high‑availability designs, performance tuning, security hardening, troubleshooting, backup strategies, and best‑practice recommendations.

AlertingHigh AvailabilityMonitoring
0 likes · 49 min read
Eliminate Monitoring Blind Spots: Hands‑On Enterprise‑Grade Prometheus + Grafana Deployment
Java Architect Handbook
Java Architect Handbook
Jun 18, 2026 · Cloud Native

Designing an Enterprise-Grade Message Push Architecture: A Deep Dive

The article outlines the evolution from isolated push modules to a unified framework and finally a dedicated push service, detailing functional and non‑functional requirements, component responsibilities, priority handling, and a scalable micro‑service architecture for enterprise notifications.

Enterprise ArchitectureHigh AvailabilityMessage Push
0 likes · 15 min read
Designing an Enterprise-Grade Message Push Architecture: A Deep Dive
Raymond Ops
Raymond Ops
Jun 17, 2026 · Databases

Redis Sentinel Mode Explained: Automatic Failure Detection and Master‑Slave Switching in Practice

This guide walks through Redis Sentinel’s architecture, explains subjective and objective down states, details the leader election and failover workflow, shows step‑by‑step configuration of a three‑node Sentinel cluster, client integration in Python and Java, and provides best‑practice recommendations, monitoring metrics, and troubleshooting tips.

ConfigurationHigh AvailabilityJava
0 likes · 27 min read
Redis Sentinel Mode Explained: Automatic Failure Detection and Master‑Slave Switching in Practice
Raymond Ops
Raymond Ops
Jun 17, 2026 · Operations

Enterprise Monitoring with Prometheus: Rule Hierarchy and Alertmanager Notification Orchestration

This guide explains how to turn a fully built Prometheus monitoring system into a closed‑loop alerting solution by designing layered PromQL rules, configuring Alertmanager routing, grouping, inhibition and silencing, integrating DingTalk and WeChat webhooks, and applying best‑practice performance, security, high‑availability, and troubleshooting techniques.

AlertingAlertmanagerHigh Availability
0 likes · 34 min read
Enterprise Monitoring with Prometheus: Rule Hierarchy and Alertmanager Notification Orchestration
Mike Chen's Internet Architecture
Mike Chen's Internet Architecture
Jun 16, 2026 · Operations

How Alibaba’s Two‑Region Three‑Center Design Achieves 99.99% Availability

The article explains Alibaba’s “two‑region three‑center” architecture, detailing how geographically separated primary, backup, and disaster‑recovery data centers work together to provide financial‑grade high availability and protect against single‑site failures or regional catastrophes.

AlibabaData Center ArchitectureDisaster Recovery
0 likes · 3 min read
How Alibaba’s Two‑Region Three‑Center Design Achieves 99.99% Availability
Coder Trainee
Coder Trainee
Jun 14, 2026 · Artificial Intelligence

Production‑Ready AI Agent Architecture: High Availability, Asynchrony, Caching, Cost & Security

After mastering core AI Agent capabilities, this article shows how to transform a prototype into a production‑grade service by covering a full architecture overview, stateless design, health‑check and graceful shutdown, asynchronous task queues, multi‑level caching, token‑cost optimization, model fallback, input/output filtering, rate limiting, monitoring, and deployment recommendations for different scales.

AI AgentCachingHigh Availability
0 likes · 15 min read
Production‑Ready AI Agent Architecture: High Availability, Asynchrony, Caching, Cost & Security
Mike Chen's Internet Architecture
Mike Chen's Internet Architecture
Jun 12, 2026 · Industry Insights

Inside Alibaba’s Same‑City Active‑Active Architecture: A Complete Visual Guide

The article breaks down Alibaba’s same‑city active‑active high‑availability architecture, detailing its four design layers—traffic scheduling, stateless application services, data replication, and operational automation—while illustrating how each component ensures continuous service during data‑center failures.

Active-ActiveAlibabaData Replication
0 likes · 5 min read
Inside Alibaba’s Same‑City Active‑Active Architecture: A Complete Visual Guide
Architect's Guide
Architect's Guide
Jun 12, 2026 · Operations

Common Disaster Recovery Models and How to Choose Them

The article outlines the main disaster‑recovery architectures—city‑level, remote, two‑site three‑center, and active‑active data centers—explains their characteristics, compares costs and performance, and presents key selection metrics such as RPO, RTO, disaster radius and ROI, illustrated with Huawei and ZTE case studies.

Data CenterDisaster RecoveryHigh Availability
0 likes · 13 min read
Common Disaster Recovery Models and How to Choose Them
Xiao Liu Lab
Xiao Liu Lab
Jun 11, 2026 · Operations

Ops Engineer Core Skills: From Basic Commands to High‑Availability Architecture

This article provides a comprehensive roadmap for operations engineers, covering essential Linux commands, core system concepts, service principles, fault‑diagnosis methods, high‑availability architecture designs, data security, backup strategies, performance tuning, and automation scripts to handle both single‑machine and large‑scale cluster environments.

AutomationDockerHigh Availability
0 likes · 13 min read
Ops Engineer Core Skills: From Basic Commands to High‑Availability Architecture
Java Tech Workshop
Java Tech Workshop
Jun 8, 2026 · Databases

Advanced SpringBoot Read‑Write Splitting: Master‑Slave Switching and Automatic Failover

In high‑concurrency internet architectures, a MySQL master‑slave setup with read‑write splitting is the baseline for high availability, but static routing suffers from node failures and lag; this article explains how ShardingSphere provides health checks, auto‑failover, load‑balancing, and degradation to achieve resilient read‑write separation.

Automatic FailoverHigh AvailabilityRead‑Write Splitting
0 likes · 14 min read
Advanced SpringBoot Read‑Write Splitting: Master‑Slave Switching and Automatic Failover
Coder Trainee
Coder Trainee
Jun 6, 2026 · Backend Development

Spring Cloud Message‑Driven Part 5: High‑Availability RocketMQ Deployment & Message Tracing

This tutorial walks through deploying a highly available RocketMQ cluster with Docker Compose, configuring master‑slave brokers, enabling message tracing, integrating Prometheus‑Grafana monitoring, setting up Spring Boot HA properties, applying performance tweaks, validating failover, and troubleshooting common issues.

Docker ComposeHigh AvailabilityMessage Tracing
0 likes · 16 min read
Spring Cloud Message‑Driven Part 5: High‑Availability RocketMQ Deployment & Message Tracing
Raymond Ops
Raymond Ops
Jun 5, 2026 · Operations

Dual‑Master Nginx + Keepalived Architecture: Eliminate Single Points of Failure

This guide walks through building a dual‑master Nginx + Keepalived high‑availability setup that doubles resource utilization, removes the idle‑backup drawback of traditional active‑passive designs, and provides step‑by‑step configuration, health‑check scripts, failover testing, best‑practice tips, and troubleshooting procedures.

High AvailabilityKeepalivedLinux
0 likes · 33 min read
Dual‑Master Nginx + Keepalived Architecture: Eliminate Single Points of Failure
ITPUB
ITPUB
Jun 4, 2026 · Backend Development

How to Ensure High Availability When Third‑Party Services Fail?

The article explains how to protect a system from unstable third‑party APIs by building an isolated defense layer that offers a unified abstraction, client‑side rate limiting and retry, comprehensive observability, and mock testing, and shows how to present these solutions in technical interviews.

High AvailabilityObservabilitycircuit breaking
0 likes · 21 min read
How to Ensure High Availability When Third‑Party Services Fail?
MaGe Linux Operations
MaGe Linux Operations
May 20, 2026 · Operations

How to Choose Among the Four Common Load‑Balancing Solutions: LVS, Nginx, HAProxy or F5

This article explains why single‑server capacity is limited, lists typical load‑balancing problems, and provides a detailed comparison of four mainstream solutions—LVS, Nginx, HAProxy, and F5—covering their principles, architectures, configuration steps, pros, cons, suitable scenarios, a decision‑tree guide, common fault‑diagnosis procedures, and production‑risk warnings.

F5HAProxyHigh Availability
0 likes · 38 min read
How to Choose Among the Four Common Load‑Balancing Solutions: LVS, Nginx, HAProxy or F5
Architects' Tech Alliance
Architects' Tech Alliance
May 16, 2026 · Industry Insights

Designing a 2026 Ultra‑Large Green AI Data Center: Full Infrastructure Blueprint

This article presents a comprehensive 2026 design plan for an ultra‑large green AI data center with 5,000 cabinets, 150 MW IT load, and 200 MW capacity, detailing market drivers, core metrics, six design principles, site and power architecture, liquid‑cooling, networking, security, and AI‑driven autonomous operations.

2026 designAI data centerDCIM
0 likes · 5 min read
Designing a 2026 Ultra‑Large Green AI Data Center: Full Infrastructure Blueprint
AI Agent Super App
AI Agent Super App
May 13, 2026 · Operations

Server Virtualization Deep Dive: Feature Comparison of VMware, KVM, Proxmox and Practical High‑Availability

This comprehensive guide walks through server virtualization fundamentals, compares major hypervisors such as VMware vSphere, KVM, Xen, Proxmox VE and Hyper‑V, and then details Linux‑level monitoring, performance tuning, backup strategies, and cross‑node high‑availability solutions for production environments.

High AvailabilityKVMMonitoring
0 likes · 24 min read
Server Virtualization Deep Dive: Feature Comparison of VMware, KVM, Proxmox and Practical High‑Availability
Ops Community
Ops Community
May 9, 2026 · Operations

Achieve Seamless Nginx High Availability with Keepalived: A Practical Guide

This article walks through building a simple, cost‑effective high‑availability solution for Nginx using Keepalived’s VRRP‑based VIP failover, covering environment setup, configuration of master and backup nodes, health‑check scripts, testing procedures, troubleshooting tips, and rollback steps.

High AvailabilityKeepalivedLinux
0 likes · 29 min read
Achieve Seamless Nginx High Availability with Keepalived: A Practical Guide
JD Tech
JD Tech
May 8, 2026 · Databases

Engineering Wisdom Behind High‑Availability Architecture for E‑Commerce Storage Layers

The article analyzes how to design a high‑availability architecture for large‑scale e‑commerce systems, detailing layered risk isolation, stateful storage strategies for flow and state data, unified document‑ID routing, multi‑replica databases, multi‑datacenter synchronization, and real‑world JD case studies that demonstrate elastic scaling and disaster recovery.

High Availabilitydatabase replicationdistributed architecture
0 likes · 17 min read
Engineering Wisdom Behind High‑Availability Architecture for E‑Commerce Storage Layers
Linyb Geek Road
Linyb Geek Road
May 7, 2026 · Operations

A Decade of E‑Commerce Ops: How to Prevent System Outages and Ensure High Availability

The article outlines why e‑commerce systems fail, presents a four‑layer high‑availability defense—including load balancing, service isolation, data protection, and fallback mechanisms—plus concrete monitoring, alerting, and emergency response practices illustrated with real‑world scenarios and code samples.

Disaster RecoveryHigh AvailabilityMonitoring
0 likes · 6 min read
A Decade of E‑Commerce Ops: How to Prevent System Outages and Ensure High Availability
dbaplus Community
dbaplus Community
Apr 28, 2026 · Backend Development

Designing High‑Availability for Unreliable Third‑Party Services

When downstream APIs are unstable and slow, this article walks through building a dedicated defensive layer that provides a unified abstraction, client‑side governance (rate limiting, retries with idempotency checks), comprehensive observability, and mock‑based testing to keep your system highly available and interview‑ready.

High AvailabilityMicroservicesObservability
0 likes · 22 min read
Designing High‑Availability for Unreliable Third‑Party Services
Java Backend Full-Stack
Java Backend Full-Stack
Apr 27, 2026 · Databases

Proven Redis Tuning Techniques for Production Environments

This article compiles practical, interview‑ready Redis tuning tips—from strict memory limits and eviction policies to avoiding big keys, hot keys, slow commands, and optimizing persistence, networking, and high‑availability settings—so you can confidently handle Redis performance questions in real‑world deployments.

ConfigurationHigh AvailabilityMemory Management
0 likes · 9 min read
Proven Redis Tuning Techniques for Production Environments
Software Engineering 3.0 Era
Software Engineering 3.0 Era
Apr 22, 2026 · Operations

Is This the Longest Service Outage Ever Recorded?

The article examines the航旅纵横 app’s service disruption that began at 12:30 PM on April 21 and lasted until 7:47 AM on April 22—over 19 hours—questioning whether this duration ranks among the longest outages ever, citing official posts, AI‑generated rankings, and a reminder that high‑availability depends on disciplined engineering rather than tools.

AIHigh Availabilitydowntime
0 likes · 3 min read
Is This the Longest Service Outage Ever Recorded?
Lobster Programming
Lobster Programming
Apr 15, 2026 · Databases

Choosing the Right Redis Architecture: From Single Node to Cluster

This article reviews the main Redis deployment options—including single‑node, master‑slave with Sentinel, sharding via consistent hashing, and Redis Cluster—explaining their advantages, high‑availability mechanisms, scalability limits, and recommending suitable scenarios for each architecture.

High AvailabilityRedisSharding
0 likes · 7 min read
Choosing the Right Redis Architecture: From Single Node to Cluster
Ray's Galactic Tech
Ray's Galactic Tech
Apr 13, 2026 · Cloud Native

How to Build a Production‑Ready Kubernetes Cluster with kubeasz: From Architecture to Full Lifecycle

This guide explains how to use kubeasz and Ansible to design, deploy, scale, secure, monitor, and maintain a production‑grade Kubernetes cluster, covering control‑plane HA, etcd reliability, networking, storage, capacity planning, upgrade strategies, and disaster‑recovery practices.

AnsibleCluster DeploymentHigh Availability
0 likes · 39 min read
How to Build a Production‑Ready Kubernetes Cluster with kubeasz: From Architecture to Full Lifecycle
Ops Community
Ops Community
Apr 9, 2026 · Operations

Mastering Nginx Reverse Proxy: From Basics to Advanced Load Balancing and High Availability

This comprehensive guide explains the fundamentals of reverse proxy, walks through Nginx configuration, load‑balancing algorithms, health‑check setups, caching strategies, session‑persistence methods, high‑availability designs, performance tuning, monitoring, and troubleshooting, providing practical code snippets for real‑world deployments.

High AvailabilityNGINXReverse Proxy
0 likes · 30 min read
Mastering Nginx Reverse Proxy: From Basics to Advanced Load Balancing and High Availability
Ops Community
Ops Community
Mar 27, 2026 · Backend Development

Master Nginx Reverse Proxy on Ubuntu 24.04 & Rocky Linux 9.4 – From Installation to Monitoring

This comprehensive guide walks you through installing Nginx 1.27 on Ubuntu 24.04 LTS and Rocky Linux 9.4, configuring reverse proxy, load balancing, SSL/TLS, WebSocket and gRPC support, tuning kernel and Nginx parameters, setting up health checks, high‑availability with Keepalived, and monitoring with Prometheus and Grafana, all with ready‑to‑use code snippets and scripts.

High AvailabilityMonitoringNGINX
0 likes · 59 min read
Master Nginx Reverse Proxy on Ubuntu 24.04 & Rocky Linux 9.4 – From Installation to Monitoring
Cognitive Technology Team
Cognitive Technology Team
Mar 27, 2026 · Operations

How to Build a Rock‑Solid High‑Availability Architecture: Redundancy, Defense, and Smooth Deployments

This article breaks down high‑availability architecture into redundancy, defensive degradation, and release mechanisms, offering concrete techniques, real‑world failure case studies, and step‑by‑step configurations to ensure continuous service even under heavy load or component failures.

CI/CDHigh Availabilitycircuit breaker
0 likes · 16 min read
How to Build a Rock‑Solid High‑Availability Architecture: Redundancy, Defense, and Smooth Deployments
Mike Chen's Internet Architecture
Mike Chen's Internet Architecture
Mar 26, 2026 · Industry Insights

How Alibaba Achieves Multi‑Site High Availability: Architecture Deep Dive

This article explains Alibaba's multi‑site high‑availability architecture, covering its origins after Double 11 bottlenecks, core principles like decentralization and consistency‑availability trade‑offs, layered design from traffic routing to data storage, and a real‑world deployment example.

AlibabaCloud NativeHigh Availability
0 likes · 5 min read
How Alibaba Achieves Multi‑Site High Availability: Architecture Deep Dive
Raymond Ops
Raymond Ops
Mar 19, 2026 · Operations

Zero‑Downtime HAProxy Load Balancing: Complete L4/L7 Deployment Guide

This guide walks through installing HAProxy 2.x, configuring L4 TCP and L7 HTTP/HTTPS load balancing for web, MySQL, and Redis, setting up health checks, session persistence, monitoring, high‑availability with Keepalived, performance tuning, security hardening, and step‑by‑step zero‑downtime deployment and rollback procedures.

HAProxyHigh AvailabilityZero Downtime
0 likes · 36 min read
Zero‑Downtime HAProxy Load Balancing: Complete L4/L7 Deployment Guide
MaGe Linux Operations
MaGe Linux Operations
Mar 12, 2026 · Backend Development

How to Deploy vLLM Inference Service on Kubernetes with Ingress and Service Load Balancing

This guide walks through deploying a production‑grade vLLM inference service on Kubernetes, covering GPU resource scheduling, Service and Ingress configuration, session affinity, health checks, performance tuning, scaling, monitoring, fault‑tolerance, and best‑practice recommendations for high‑availability AI workloads.

GPUHigh AvailabilityIngress
0 likes · 47 min read
How to Deploy vLLM Inference Service on Kubernetes with Ingress and Service Load Balancing
Cognitive Technology Team
Cognitive Technology Team
Mar 9, 2026 · Operations

Mastering Kafka ISR: How In‑Sync Replicas Ensure Consistency and High Availability

This article explains Kafka's In‑Sync Replicas (ISR) mechanism, detailing its definitions, dynamic scaling, interaction with High Watermark, extreme unclean leader election scenarios, and practical tuning and troubleshooting tips for maintaining strong consistency and high availability in production clusters.

High AvailabilityISRPerformance Tuning
0 likes · 15 min read
Mastering Kafka ISR: How In‑Sync Replicas Ensure Consistency and High Availability
LuTiao Programming
LuTiao Programming
Mar 5, 2026 · Cloud Native

How to Achieve 99.99% Availability in Spring Boot Microservices: 7 Essential Steps

This article outlines seven production‑grade design principles—design for failure, circuit breaking, timeout control, service isolation, automatic retries, multi‑instance deployment, and comprehensive monitoring—each illustrated with Spring Boot and Resilience4j configurations to help microservices consistently meet four‑nine availability.

High AvailabilityMicroservicesMonitoring
0 likes · 7 min read
How to Achieve 99.99% Availability in Spring Boot Microservices: 7 Essential Steps
ITPUB
ITPUB
Mar 3, 2026 · Databases

Why Is Installing Modern Databases Still So Painful?

Even in 2026, installing databases like Oracle remains a complex, error‑prone process, and this article explores the historical roots, recent AI‑assisted attempts, and four key reasons why database installation still challenges engineers.

AIDatabasesHigh Availability
0 likes · 8 min read
Why Is Installing Modern Databases Still So Painful?
Amazon Cloud Developers
Amazon Cloud Developers
Mar 3, 2026 · Cloud Computing

Designing a Resilient Direct Connect Architecture to Ensure Business Continuity

This guide explains how to build a highly resilient AWS Direct Connect network—distinguishing redundancy from true resilience, modeling failure and maintenance scenarios, applying AS‑Path prepend and route withdrawal, deploying a maximum‑resilience topology with dual connections per location, enabling BFD for sub‑second fault detection, and regularly testing failover—to keep critical workloads online during planned windows or unexpected incidents.

AWS Direct ConnectBFDBGP
0 likes · 14 min read
Designing a Resilient Direct Connect Architecture to Ensure Business Continuity
Senior Xiao Ying
Senior Xiao Ying
Feb 23, 2026 · Databases

MySQL Practical Guide #17: Building a High‑Availability Service with Master‑Slave Replication, Read‑Write Splitting, and Load Balancing

By configuring master‑slave replication, implementing read‑write splitting with ProxySQL, and selecting appropriate load‑balancing strategies, you can significantly improve MySQL’s scalability and availability while addressing replication lag through parallel or semi‑synchronous replication, hardware tuning, and monitoring.

High AvailabilityProxySQLRead‑Write Splitting
0 likes · 12 min read
MySQL Practical Guide #17: Building a High‑Availability Service with Master‑Slave Replication, Read‑Write Splitting, and Load Balancing
ITPUB
ITPUB
Feb 15, 2026 · Backend Development

Mastering Message Queues: From Flash‑Sale Basics to RabbitMQ Production

This guide walks through why a high‑traffic flash‑sale system needs a message queue, explains the three core benefits of async processing, decoupling and traffic‑shaping, and then details RabbitMQ installation, common work patterns, durability, idempotency, ordering, dead‑letter handling, high‑availability clustering and advanced features such as delayed and priority queues.

Backend DevelopmentHigh AvailabilityMessage Queue
0 likes · 16 min read
Mastering Message Queues: From Flash‑Sale Basics to RabbitMQ Production
ITPUB
ITPUB
Feb 5, 2026 · Databases

Master Oracle 19c RAC Architecture in 41 Diagrams – Quick Technical Guide

This article translates and consolidates Oracle Real Application Clusters 19c Technical Architecture, using 41 detailed diagrams to explain RAC concepts, configurations, cluster components, storage options, tools, and management commands for building and operating high‑availability Oracle databases.

ASMEnterprise ManagerGrid Infrastructure
0 likes · 52 min read
Master Oracle 19c RAC Architecture in 41 Diagrams – Quick Technical Guide
ITPUB
ITPUB
Jan 31, 2026 · Databases

How OpenAI Scaled PostgreSQL to Support 800 Million Users and Millions of QPS

OpenAI’s engineering team expanded a single‑primary PostgreSQL cluster with nearly 50 read‑only replicas, migrated write‑heavy workloads to Azure Cosmos DB, and applied extensive optimizations to reliably serve the global traffic of ChatGPT and the OpenAI API for 800 million users at multi‑million queries per second.

AzureHigh AvailabilityPostgreSQL
0 likes · 24 min read
How OpenAI Scaled PostgreSQL to Support 800 Million Users and Millions of QPS
MaGe Linux Operations
MaGe Linux Operations
Jan 30, 2026 · Cloud Computing

Mastering Alibaba Cloud SLB: Build High‑Availability Load Balancing with Terraform

This guide walks through Alibaba Cloud SLB’s architecture, product variants, and environment prerequisites, and step‑by‑step Terraform provisioning for CLB, ALB, and NLB, covering health checks, HTTPS setup, traffic routing, performance testing, best practices, security hardening, monitoring, and disaster‑recovery procedures.

Alibaba CloudCloud ComputingHigh Availability
0 likes · 28 min read
Mastering Alibaba Cloud SLB: Build High‑Availability Load Balancing with Terraform
Java Architect Handbook
Java Architect Handbook
Jan 28, 2026 · Databases

How to Prevent Redis Split‑Brain Disasters with min‑replicas‑to‑write

This article explains the Redis split‑brain problem that can occur in master‑replica clusters, outlines the interview points interviewers look for, and provides a detailed solution using the min‑replicas‑to‑write (or min‑slaves‑to‑write) configuration to sacrifice write availability for data consistency, along with best‑practice recommendations and common pitfalls.

ConfigurationHigh AvailabilityRedis
0 likes · 12 min read
How to Prevent Redis Split‑Brain Disasters with min‑replicas‑to‑write
Architect Chen
Architect Chen
Jan 26, 2026 · Databases

Mastering MySQL Master‑Slave Replication: Architecture, Threads, and Setup

This article explains MySQL master‑slave replication, covering its purpose for high availability and read‑write separation, typical one‑master‑multiple‑slaves architecture, the binlog‑based synchronization mechanism, and the roles of the master’s dump thread and the slave’s I/O and SQL threads.

Database ArchitectureHigh Availabilitybinary log
0 likes · 3 min read
Mastering MySQL Master‑Slave Replication: Architecture, Threads, and Setup
Ray's Galactic Tech
Ray's Galactic Tech
Jan 25, 2026 · Operations

Why Redis High Availability Fails: Split‑Brain and Replication Storm Explained

The article examines the two most dangerous production failures in Redis high‑availability—split‑brain and replication storm—explaining their causes, real‑world impact, and practical engineering safeguards such as write‑protection parameters, network isolation, backlog sizing, and cascading replication.

High AvailabilityRedisReplication Storm
0 likes · 7 min read
Why Redis High Availability Fails: Split‑Brain and Replication Storm Explained
Ops Community
Ops Community
Jan 22, 2026 · Operations

Master HAProxy 3.0: From System Tuning to Advanced Load‑Balancing Practices

This comprehensive guide walks you through HAProxy 3.0’s new features, hardware and OS requirements, step‑by‑step installation, detailed global, frontend, backend configurations, health‑check optimization, monitoring with Prometheus, troubleshooting tips, backup strategies, and best‑practice recommendations for high‑performance load balancing in production environments.

HAProxyHigh AvailabilityLinux
0 likes · 29 min read
Master HAProxy 3.0: From System Tuning to Advanced Load‑Balancing Practices
Ray's Galactic Tech
Ray's Galactic Tech
Jan 20, 2026 · Databases

Mastering Redis High Availability: Replication, Sentinel, and Cluster Deep Dive

This guide walks through Redis's evolution from single‑node replication to Sentinel and native Cluster, explaining each architecture's principles, configuration steps, advantages, drawbacks, performance trade‑offs, and practical deployment recommendations for building highly available and scalable caching systems.

High AvailabilityRedisSentinel
0 likes · 11 min read
Mastering Redis High Availability: Replication, Sentinel, and Cluster Deep Dive
Ray's Galactic Tech
Ray's Galactic Tech
Jan 11, 2026 · Operations

Master Elasticsearch Clusters: From Basics to Production Best Practices

This guide explains Elasticsearch clusters—from fundamental concepts and node roles to health monitoring, scaling strategies, security measures, and practical command‑line tips—helping you build, operate, and optimize a resilient, high‑performance search infrastructure.

ElasticsearchHigh AvailabilityMonitoring
0 likes · 10 min read
Master Elasticsearch Clusters: From Basics to Production Best Practices
java1234
java1234
Jan 10, 2026 · Backend Development

Designing a Highly Available Service Registry: Key Principles and Java Example

This article explains how to design a highly available service registry for microservice architectures, covering high‑availability mechanisms, performance optimizations, scalability strategies, core registry functions, and provides a complete Java Spring Boot implementation using Redis.

High AvailabilityJavaMicroservices
0 likes · 6 min read
Designing a Highly Available Service Registry: Key Principles and Java Example
Raymond Ops
Raymond Ops
Jan 10, 2026 · Operations

Designing Enterprise‑Grade RabbitMQ HA: Architecture, Config, and Best Practices

This guide explains why high availability is critical for RabbitMQ in micro‑service environments, compares cluster modes, provides step‑by‑step commands for building a resilient three‑node cluster, and covers monitoring, failover, performance tuning, and common pitfalls to ensure reliable message delivery.

High AvailabilityRabbitMQcluster
0 likes · 12 min read
Designing Enterprise‑Grade RabbitMQ HA: Architecture, Config, and Best Practices
ITPUB
ITPUB
Jan 5, 2026 · Backend Development

How Apache Pulsar Solved Our Financial Messaging Challenges

Facing limited visibility, routing, and security in traditional MQ-based financial systems, a company evaluated its needs for identity control, routing, auditing, low latency, scalability, ordering, and replay, and chose Apache Pulsar for its multi‑cluster, compute‑storage separation, pluggable authentication, rich API, and functions, outlining practical experiences and solutions.

Apache PulsarHigh Availabilitydistributed architecture
0 likes · 15 min read
How Apache Pulsar Solved Our Financial Messaging Challenges
Cognitive Technology Team
Cognitive Technology Team
Dec 30, 2025 · Backend Development

How to Prevent Message Queue Reordering: 4 Proven High‑Availability Solutions

This article examines why message queue ordering failures can corrupt data and cause outages, explains four root causes such as concurrent consumption and partitioning, and presents four production‑tested high‑availability patterns—including ordered messages, pre‑condition checks, state‑machine driving, and monitoring—to reliably mitigate MQ disorder.

High AvailabilityOrderingbackend
0 likes · 9 min read
How to Prevent Message Queue Reordering: 4 Proven High‑Availability Solutions
Ray's Galactic Tech
Ray's Galactic Tech
Dec 29, 2025 · Databases

Mastering PostgreSQL Backup & Replication: A Complete Enterprise Guide

An in‑depth enterprise guide explains why backup and replication are critical for PostgreSQL, compares physical, logical, and logical replication methods, provides step‑by‑step command examples, outlines high‑availability architectures, automation scripts, disaster‑recovery procedures, monitoring queries, and common pitfalls to ensure robust data protection.

Disaster RecoveryHigh AvailabilityPostgreSQL
0 likes · 8 min read
Mastering PostgreSQL Backup & Replication: A Complete Enterprise Guide
Xiao Liu Lab
Xiao Liu Lab
Dec 26, 2025 · Operations

How to Achieve RabbitMQ High Availability with HAProxy: A Step‑by‑Step Guide

This tutorial explains why HAProxy is essential for RabbitMQ clusters, walks through installing HAProxy on Ubuntu, configuring load‑balancing and health‑check parameters, integrating with Java applications, and validating automatic failover to ensure high availability and efficient resource utilization.

HAProxyHigh AvailabilityJava
0 likes · 8 min read
How to Achieve RabbitMQ High Availability with HAProxy: A Step‑by‑Step Guide
Xiao Liu Lab
Xiao Liu Lab
Dec 23, 2025 · Databases

Mastering Redis Master‑Slave Replication: Core Concepts, Workflow, and Configuration

This article explains how Redis master‑slave replication provides hot backup, read‑write separation, high availability, and horizontal scaling by detailing its three‑stage workflow, full and partial synchronization mechanisms, key configuration options, and practical analogies for clear understanding.

Data synchronizationHigh AvailabilityRedis
0 likes · 11 min read
Mastering Redis Master‑Slave Replication: Core Concepts, Workflow, and Configuration
Ray's Galactic Tech
Ray's Galactic Tech
Dec 23, 2025 · Operations

20 Essential Kubernetes Ops Tips to Keep Production Clusters Stable

This guide compiles twenty practical Kubernetes operations tips drawn from real‑world production experience, covering high availability, performance tuning, monitoring, automation, security, and advanced learning to help teams build and maintain reliable, resilient clusters.

High AvailabilityMonitoringOps
0 likes · 8 min read
20 Essential Kubernetes Ops Tips to Keep Production Clusters Stable
Raymond Ops
Raymond Ops
Dec 23, 2025 · Databases

Master MySQL in Production: From Configuration Tuning to SQL Performance Optimization

This comprehensive guide walks you through a real‑world MySQL outage, then details step‑by‑step configuration tweaks, InnoDB parameter tuning, connection and thread settings, index design, query rewrites, monitoring scripts, backup strategies, high‑availability replication, and essential tooling to keep your database fast and reliable.

Database ConfigurationHigh AvailabilityMonitoring
0 likes · 13 min read
Master MySQL in Production: From Configuration Tuning to SQL Performance Optimization
Raymond Ops
Raymond Ops
Dec 22, 2025 · Operations

Build a High‑Availability Prometheus Monitoring System from Scratch: Pitfalls & Performance Tuning

This guide walks you through constructing a production‑grade, highly available Prometheus monitoring stack, covering architecture choices, sharding strategies, common pitfalls such as memory bloat, query latency and storage growth, and provides concrete tuning steps, Kubernetes deployment examples, and advanced optimisation techniques.

AlertingHigh AvailabilityMonitoring
0 likes · 11 min read
Build a High‑Availability Prometheus Monitoring System from Scratch: Pitfalls & Performance Tuning
Raymond Ops
Raymond Ops
Dec 17, 2025 · Operations

Build a Production‑Ready Prometheus HA Architecture with Federation and Remote Storage

Learn how to design and implement a robust, production‑grade Prometheus high‑availability solution using a federated global cluster, multiple business‑level instances, remote storage with Thanos or VictoriaMetrics, Docker‑Compose deployment, health‑check scripts, performance metrics, alerting rules, and best‑practice operational guidelines.

Docker ComposeFederationHigh Availability
0 likes · 17 min read
Build a Production‑Ready Prometheus HA Architecture with Federation and Remote Storage
Mike Chen's Internet Architecture
Mike Chen's Internet Architecture
Dec 16, 2025 · Databases

Designing a Million‑QPS Database Architecture: Sharding, Caching, and High Availability

This article explains how to architect a database system that can sustain tens of millions of queries per second by combining sharding, read‑write separation, multi‑layer caching, traffic shaping, and robust high‑availability strategies to keep most requests off the database and ensure reliable data storage.

High Availabilityperformance
0 likes · 5 min read
Designing a Million‑QPS Database Architecture: Sharding, Caching, and High Availability
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
Dec 15, 2025 · Databases

How TiDB Achieves Multi‑Datacenter High Availability with Multi‑Raft and TiCDC

This article explains TiDB's distributed, financial‑grade high‑availability architecture, covering single‑cluster same‑zone multi‑datacenter deployment, cross‑cluster DTS synchronization, underlying Raft and label mechanisms, configuration examples, performance trade‑offs, and real‑world monitoring results on the HULK cloud platform.

High AvailabilityTiCDCTiDB
0 likes · 16 min read
How TiDB Achieves Multi‑Datacenter High Availability with Multi‑Raft and TiCDC
Ray's Galactic Tech
Ray's Galactic Tech
Dec 12, 2025 · Cloud Native

Inside the Kubernetes Master: A Complete Breakdown of Core Components

Master nodes act as the brain of a Kubernetes cluster, hosting essential components such as kube‑apiserver, etcd, kube‑scheduler, kube‑controller‑manager and optionally cloud‑controller‑manager, each with distinct roles, high‑availability designs, security considerations, and operational workflows that together orchestrate and maintain cluster state.

Control PlaneEtcdHigh Availability
0 likes · 8 min read
Inside the Kubernetes Master: A Complete Breakdown of Core Components
Ray's Galactic Tech
Ray's Galactic Tech
Dec 10, 2025 · Information Security

Secure Your Elasticsearch with INFINI Gateway: TLS, Auth, Multi‑Tenant & HA Guide

This guide explains why Elasticsearch often becomes a security risk, then shows how to use INFINI Gateway as a non‑intrusive front‑end proxy to add TLS encryption, basic authentication, unified entry, multi‑tenant routing, rate‑limiting, auditing, and high‑availability for any 6.x/7.x/8.x version.

Basic AuthElasticsearchHigh Availability
0 likes · 9 min read
Secure Your Elasticsearch with INFINI Gateway: TLS, Auth, Multi‑Tenant & HA Guide
Java Web Project
Java Web Project
Dec 7, 2025 · Databases

What Makes TiDB a NewSQL Powerhouse? A Deep Dive into Architecture, Features, and Use Cases

This article analyzes TiDB as a distributed NewSQL database, explaining the evolution from traditional SQL to NoSQL and NewSQL, detailing TiDB's core components, elastic scaling, ACID transactions, HTAP capabilities, high‑availability design, compatibility with MySQL, real‑world use cases, and its limitations compared to conventional databases.

HTAPHigh AvailabilityMySQL Compatibility
0 likes · 24 min read
What Makes TiDB a NewSQL Powerhouse? A Deep Dive into Architecture, Features, and Use Cases
Ctrip Technology
Ctrip Technology
Dec 5, 2025 · Databases

How Ctrip’s DRC Enables High‑Performance Cross‑Region MySQL Replication

This article explains the design and implementation of Ctrip's Data Replication Center (DRC), a MySQL‑based high‑availability system that solves cross‑region data loop, progress tracking, concurrency, DDL handling, and conflict resolution to achieve low‑latency, reliable data replication for global travel services.

Data ReplicationGTIDHigh Availability
0 likes · 21 min read
How Ctrip’s DRC Enables High‑Performance Cross‑Region MySQL Replication
Architect's Journey
Architect's Journey
Dec 1, 2025 · Backend Development

Designing Three‑High Systems: Practical Performance Tuning and Fault‑Tolerant Architecture

The article breaks down the design logic and implementation steps for high‑performance, high‑concurrency, and high‑availability systems, covering bottleneck identification, read/write optimization, three‑dimensional scaling, and concrete fault‑tolerance strategies to build resilient, scalable services.

High AvailabilityHigh concurrencyfault tolerance
0 likes · 15 min read
Designing Three‑High Systems: Practical Performance Tuning and Fault‑Tolerant Architecture
Old Meng AI Explorer
Old Meng AI Explorer
Nov 26, 2025 · Operations

How Alertmanager Turns Chaos into Calm: Mastering Alert Management for DevOps

Alertmanager, the official Prometheus alert manager, consolidates redundant alerts, supports silencing, inhibition, multi‑channel routing, and high‑availability clustering, enabling DevOps teams to quickly pinpoint critical issues, reduce noise, and streamline incident response across large server fleets with simple YAML configuration and command‑line tools.

Alert ManagementAlertmanagerHigh Availability
0 likes · 10 min read
How Alertmanager Turns Chaos into Calm: Mastering Alert Management for DevOps
Aikesheng Open Source Community
Aikesheng Open Source Community
Nov 25, 2025 · Databases

How to Diagnose and Recover MySQL InnoDB Cluster Failures: Real‑World Scenarios

This article walks MySQL DBAs through common MySQL InnoDB Cluster fault scenarios—node restarts, crashes, network partitions, and full‑cluster reboots—providing step‑by‑step commands, status outputs, recovery actions, and impact analysis to ensure high availability and data safety.

Database operationsHigh AvailabilityInnoDB Cluster
0 likes · 26 min read
How to Diagnose and Recover MySQL InnoDB Cluster Failures: Real‑World Scenarios
DevOps Coach
DevOps Coach
Nov 11, 2025 · Cloud Computing

Why the US‑East‑1 AWS Outage Happened and How to Guard Against It

On October 19‑20 a massive AWS failure in the US‑East‑1 region crippled a large portion of the internet, exposing how a faulty internal monitoring tool, DynamoDB’s lack of cross‑region replication, and unchecked retry storms can cascade into a widespread outage, and offering concrete operational lessons for cloud teams.

AWSCloud ComputingDynamoDB
0 likes · 7 min read
Why the US‑East‑1 AWS Outage Happened and How to Guard Against It
MaGe Linux Operations
MaGe Linux Operations
Nov 9, 2025 · Backend Development

How to Stop Redis Cache Penetration, Breakdown, and Avalanche – Proven Solutions Inside

This comprehensive guide explains the causes of Redis cache penetration, breakdown, and avalanche, and provides production‑tested solutions such as Bloom filters, distributed locks, logical expiration, random TTL, cache pre‑warming, multi‑level caching, high‑availability deployment, monitoring, and backup strategies.

High AvailabilityRedisSpring Boot
0 likes · 42 min read
How to Stop Redis Cache Penetration, Breakdown, and Avalanche – Proven Solutions Inside
Ops Community
Ops Community
Nov 9, 2025 · Operations

How to Achieve 99.99% Uptime with Keepalived Dual‑Node HA

This guide explains how to design a high‑availability architecture using Keepalived's VRRP‑based active‑passive failover, covering technical features, applicable scenarios, environment requirements, step‑by‑step installation and configuration for services like Nginx, MySQL and Redis, plus best practices, troubleshooting, monitoring and backup strategies.

High AvailabilityKeepalivedNGINX
0 likes · 46 min read
How to Achieve 99.99% Uptime with Keepalived Dual‑Node HA
Ops Community
Ops Community
Nov 8, 2025 · Operations

Mastering Nginx Reverse Proxy & Load Balancing: Best Practices for High‑Performance Deployments

This comprehensive guide walks you through Nginx reverse proxy and load balancing fundamentals, key features, suitable scenarios, environment prerequisites, step‑by‑step installation, core configuration, performance tuning, security hardening, high‑availability designs, troubleshooting, monitoring, backup strategies, real‑world case studies, and advanced learning paths for production‑grade deployments.

High AvailabilityPerformance Optimizationsecurity
0 likes · 56 min read
Mastering Nginx Reverse Proxy & Load Balancing: Best Practices for High‑Performance Deployments
MaGe Linux Operations
MaGe Linux Operations
Nov 8, 2025 · Backend Development

Mastering Redis Cache: Prevent Penetration, Breakdown, and Avalanche with Proven Solutions

This comprehensive guide explains the three major Redis cache issues—penetration, breakdown, and avalanche—detailing their causes, impacts, and production‑ready solutions such as Bloom filters, distributed locks, logical expiration, random TTL, multi‑level caching, high‑availability setups, monitoring, backup, and best‑practice recommendations.

High AvailabilityPerformance OptimizationRedis
0 likes · 56 min read
Mastering Redis Cache: Prevent Penetration, Breakdown, and Avalanche with Proven Solutions
MaGe Linux Operations
MaGe Linux Operations
Nov 5, 2025 · Databases

Deploy Redis Sentinel for High Availability in 30 Minutes – Step‑by‑Step Guide

Learn how to set up Redis Sentinel for high‑availability caching, covering prerequisites, anti‑patterns, detailed configuration of master, replicas and Sentinel nodes, firewall rules, monitoring, failover testing, troubleshooting, performance tuning, backup, rollback and best practices—all achievable within a 30‑minute deployment.

High AvailabilityLinuxRedis
0 likes · 38 min read
Deploy Redis Sentinel for High Availability in 30 Minutes – Step‑by‑Step Guide
Top Architect
Top Architect
Nov 3, 2025 · Operations

How to Build Nginx High Availability with Keepalived on Two VMs

This guide walks through installing Nginx on two CentOS 7 virtual machines, configuring keepalived for VRRP‑based high availability, creating a virtual IP, and demonstrating failover scenarios to ensure continuous web service availability in production environments.

High AvailabilityKeepalivedLinux
0 likes · 10 min read
How to Build Nginx High Availability with Keepalived on Two VMs
Linux Ops Smart Journey
Linux Ops Smart Journey
Nov 3, 2025 · Cloud Native

How to Build a Production-Ready High-Availability Keycloak Cluster

Learn step‑by‑step how to design and deploy a production‑grade, high‑availability Keycloak cluster using external databases, distributed session management with Infinispan, HAProxy reverse proxy, TLS termination, and Docker‑Compose orchestration, ensuring scalability, fault tolerance, and secure identity management for cloud‑native applications.

Cloud NativeDocker ComposeHAProxy
0 likes · 8 min read
How to Build a Production-Ready High-Availability Keycloak Cluster
Mike Chen's Internet Architecture
Mike Chen's Internet Architecture
Nov 3, 2025 · Databases

Mastering MySQL Replication: Asynchronous, Semi‑Sync, and Full Sync Explained

This article explains MySQL master‑slave replication, covering asynchronous, semi‑synchronous, and full synchronous modes, their mechanisms, advantages, disadvantages, and ideal use cases for high availability, read/write separation, and strong consistency in large‑scale systems.

Database ArchitectureHigh AvailabilityRead‑Write Separation
0 likes · 3 min read
Mastering MySQL Replication: Asynchronous, Semi‑Sync, and Full Sync Explained
MaGe Linux Operations
MaGe Linux Operations
Nov 1, 2025 · Operations

Zero‑Downtime HAProxy Load Balancing: Full 4‑Layer & 7‑Layer Deployment Guide

This guide walks through installing HAProxy, configuring both layer‑4 TCP and layer‑7 HTTP/HTTPS load balancing with health checks, session persistence, advanced algorithms, high‑availability via Keepalived, monitoring with HAProxy stats and Prometheus, performance tuning, security hardening, and step‑by‑step rollback procedures for zero‑downtime deployments.

HAProxyHigh AvailabilityOps
0 likes · 36 min read
Zero‑Downtime HAProxy Load Balancing: Full 4‑Layer & 7‑Layer Deployment Guide
DataFunSummit
DataFunSummit
Oct 29, 2025 · Big Data

How Huolala Scaled to 40PB: Inside Their Evolving Big Data Storage Architecture

Huolala, founded in 2013, runs a massive cross‑cloud hybrid big‑data storage platform of over 40 PB across 3,000+ machines, evolving through four online‑storage phases, robust HA design, performance‑cost optimizations, AI vector storage, and a cost‑governance system that saved more than half of its storage expenses.

AI vector storageBig DataHigh Availability
0 likes · 18 min read
How Huolala Scaled to 40PB: Inside Their Evolving Big Data Storage Architecture
Senior Brother's Insights
Senior Brother's Insights
Oct 27, 2025 · Databases

How Does MySQL Power High‑Performance OLTP Workloads?

This article explains what OLTP (Online Transaction Processing) is, outlines its key characteristics, and details how MySQL—through ACID‑compliant transactions, the InnoDB storage engine, various indexing strategies, fast locking mechanisms, query optimization, and high‑availability features—effectively supports high‑concurrency, low‑latency transactional workloads.

Database TransactionsHigh AvailabilityIndexing
0 likes · 9 min read
How Does MySQL Power High‑Performance OLTP Workloads?
Ops Community
Ops Community
Oct 23, 2025 · Operations

Zero‑Downtime Nginx Load Balancing: Build a 99.99% HA Architecture

This guide walks through designing and implementing a highly available Nginx load‑balancing solution—covering applicable scenarios, prerequisites, environment matrix, step‑by‑step configuration of Nginx, Keepalived, SSL termination, health checks, monitoring, performance tuning, security hardening, troubleshooting, and a concise list of best‑practice recommendations.

High AvailabilityKeepalivedMonitoring
0 likes · 29 min read
Zero‑Downtime Nginx Load Balancing: Build a 99.99% HA Architecture
Ray's Galactic Tech
Ray's Galactic Tech
Oct 17, 2025 · Backend Development

Prevent Redis Cache Avalanche, Penetration & Breakdown: A Practical High‑Availability Guide

This guide explains the three major Redis cache failure patterns—avalanche, penetration, and breakdown—detailing their causes and offering concrete mitigation techniques such as staggered TTLs, empty‑object caching, Bloom filters, logical expiration, distributed locks, high‑availability clusters, and comprehensive monitoring to ensure robust high‑availability systems.

CacheCache AvalancheCache Breakdown
0 likes · 7 min read
Prevent Redis Cache Avalanche, Penetration & Breakdown: A Practical High‑Availability Guide
dbaplus Community
dbaplus Community
Oct 16, 2025 · Backend Development

How to Build a Billion‑Scale Open Platform: Architecture, Caching, and Resilience

This article presents a step‑by‑step engineering guide for designing, evolving, and operating a high‑traffic open platform, covering three‑layer decoupled architecture, multi‑level caching, asynchronous message queues, distributed transaction models, high‑availability strategies, and phased rollout plans to sustain billions of daily API calls.

CachingHigh AvailabilityHigh concurrency
0 likes · 20 min read
How to Build a Billion‑Scale Open Platform: Architecture, Caching, and Resilience
Su San Talks Tech
Su San Talks Tech
Oct 10, 2025 · Operations

How to Boost System Stability: Observability, Resilience, and High‑Availability Strategies

This comprehensive guide explains how to improve system stability and reduce online incidents by building observability, implementing distributed tracing, applying rate‑limiting and circuit‑breaker patterns, adopting blue‑green and gray deployments, managing data consistency with distributed transactions, planning capacity, optimizing performance, and preparing emergency response plans.

Deployment StrategiesDistributed TracingHigh Availability
0 likes · 19 min read
How to Boost System Stability: Observability, Resilience, and High‑Availability Strategies