Tagged articles

High Availability

1447 articles · Page 3 of 15
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Dec 25, 2024 · Cloud Native

Ensuring Stability of Large‑Scale Kubernetes Clusters: Lessons from the OpenAI Incident and Alibaba Cloud Practices

This article analyses the OpenAI large‑scale Kubernetes outage, explains the inherent risks of massive K8s clusters, and presents Alibaba Cloud's architectural enhancements, observability improvements, and best‑practice guidelines to achieve high‑availability and reliable operation of thousands‑node Kubernetes environments.

Cloud NativeHigh AvailabilityLarge-Scale Clusters
0 likes · 21 min read
Ensuring Stability of Large‑Scale Kubernetes Clusters: Lessons from the OpenAI Incident and Alibaba Cloud Practices
IT Architects Alliance
IT Architects Alliance
Dec 24, 2024 · Cloud Native

Unlock Scalable, Highly Available IT Architecture: Key Strategies Explained

This article examines the modern challenges of IT architecture and presents proven techniques—microservices, container orchestration, distributed caching, redundancy, load balancing, and automated fault recovery—illustrated with Amazon and Google case studies, while forecasting future AI and cloud‑native trends.

Cloud NativeHigh AvailabilityMicroservices
0 likes · 10 min read
Unlock Scalable, Highly Available IT Architecture: Key Strategies Explained
Efficient Ops
Efficient Ops
Nov 19, 2024 · Operations

Mastering System Stability: Proven SRE Practices for Reliable, High‑Availability Services

This article explains how system stability depends on architecture and code details, defines SLA and the “nines” metric, outlines Google’s SRE hierarchy, and provides practical governance steps—including development and release processes, high‑availability design, capacity planning, monitoring, incident response, and team culture—to achieve reliable, high‑availability services.

High AvailabilitySREcapacity planning
0 likes · 34 min read
Mastering System Stability: Proven SRE Practices for Reliable, High‑Availability Services
Bilibili Tech
Bilibili Tech
Nov 19, 2024 · Operations

Building a Lightweight Disaster‑Recovery Drill System at Bilibili: Architecture, Practices, and Lessons

Bilibili’s infrastructure team created a lightweight, multi‑layered disaster‑recovery drill platform—combining an atomic fault library, scenario catalogs, chaos‑experiment orchestration, real‑time observation, and a product‑level interface—backed by standardized governance and CI‑integrated automation, cutting drill preparation from weeks to days and boosting weekly resilience testing across the organization.

Disaster RecoveryHigh Availabilitysite reliability
0 likes · 39 min read
Building a Lightweight Disaster‑Recovery Drill System at Bilibili: Architecture, Practices, and Lessons
Cognitive Technology Team
Cognitive Technology Team
Nov 15, 2024 · Operations

Building Redundancy in Applications to Avoid Single Points of Failure

The article explains how to design resilient applications by identifying critical paths, adding redundant components, using formulas for overall availability, and applying best‑practice recommendations such as multi‑zone/region deployment, load‑balanced VMs, database replication, and thorough testing of failover mechanisms.

High Availabilitycloud architectureload balancing
0 likes · 6 min read
Building Redundancy in Applications to Avoid Single Points of Failure
ByteDance Cloud Native
ByteDance Cloud Native
Nov 8, 2024 · Databases

Designing Reliable Cross-Cloud Database Disaster Recovery with Volcano Engine

This article explains how to design and implement cross-cloud database disaster recovery, covering background goals, common challenges, step-by-step migration stages, the role of Volcano Engine’s Database Transmission Service, cold-hot separation, HTAP analysis, and practical business value with real-world examples.

DTSDisaster RecoveryHigh Availability
0 likes · 12 min read
Designing Reliable Cross-Cloud Database Disaster Recovery with Volcano Engine
Tencent Cloud Middleware
Tencent Cloud Middleware
Oct 22, 2024 · Operations

Scaling Apache Pulsar on Tencent Cloud: Multi‑Network Access, Cluster Migration & HA Tips

This article details Tencent Cloud engineers' technical solutions for large‑scale Apache Pulsar deployments, covering multi‑network access challenges, a routing‑addressing redesign, product deployment models, a four‑step cluster migration process with subscription‑progress compensation, and high‑availability best practices such as rack‑aware and cross‑AZ replica distribution.

Apache PulsarCluster MigrationHigh Availability
0 likes · 11 min read
Scaling Apache Pulsar on Tencent Cloud: Multi‑Network Access, Cluster Migration & HA Tips
Architect
Architect
Oct 17, 2024 · Operations

Designing Multi‑Active Distributed Systems: Key Factors and Replication Strategies

This article analyzes the architectural challenges of building large‑scale distributed systems with multi‑active (cross‑city) capabilities, focusing on data‑layer design, write latency, replication models, sharding techniques, and routing impacts to guide reliable, high‑performance infrastructure decisions.

Data ReplicationHigh Availabilityarchitecture
0 likes · 22 min read
Designing Multi‑Active Distributed Systems: Key Factors and Replication Strategies
ITPUB
ITPUB
Oct 15, 2024 · Databases

Choosing the Right Database High‑Availability Architecture: Lessons from GBase 8s

The article explores the evolution of database high‑availability architectures, compares mainstream solutions like Oracle's HA, RAC and ADG, examines domestic offerings such as GBase 8s with HAC, RHAC and SSC clusters, and provides practical guidance for selecting cost‑effective HA designs to ensure continuous business operations.

EnterpriseGBaseHA Architecture
0 likes · 14 min read
Choosing the Right Database High‑Availability Architecture: Lessons from GBase 8s
Tencent Cloud Developer
Tencent Cloud Developer
Oct 15, 2024 · Industry Insights

Why Write Latency Drives Multi‑Active Distributed Architecture Design

This article analyzes how write latency, write volume, isolation, and data replication strategies influence the design of multi‑active distributed systems, offering practical guidance on sharding, synchronous and asynchronous replication, routing, and architecture selection for high availability and performance across regions.

Data ReplicationHigh AvailabilitySharding
0 likes · 23 min read
Why Write Latency Drives Multi‑Active Distributed Architecture Design
IT Services Circle
IT Services Circle
Oct 4, 2024 · Databases

Understanding Redis Split‑Brain: Causes, Data Loss, and Prevention Strategies

This article explains Redis split‑brain behavior, describing its definition, causes such as network failures and Sentinel elections, the resulting data loss during master‑slave switches, and practical prevention measures including quorum configuration, timeout tuning, network monitoring, proxy layers, and the min‑slaves‑to‑write and min‑slaves‑max‑lag settings.

High AvailabilityMaster‑SlaveSentinel
0 likes · 7 min read
Understanding Redis Split‑Brain: Causes, Data Loss, and Prevention Strategies
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Sep 30, 2024 · Cloud Native

Best Practices for High Availability and Stability in Alibaba Cloud Container Service for Kubernetes (ACK)

This article presents a comprehensive overview of high‑availability design patterns and best‑practice recommendations for Alibaba Cloud Container Service for Kubernetes (ACK), covering common error scenarios, single‑cluster and multi‑cluster architectures, workload resilience, monitoring, and real‑world case studies.

ACKCloud NativeHigh Availability
0 likes · 13 min read
Best Practices for High Availability and Stability in Alibaba Cloud Container Service for Kubernetes (ACK)
Open Source Linux
Open Source Linux
Sep 20, 2024 · Databases

Redis Master‑Slave Replication and Sentinel: How They Work and Scale

This article explains Redis master‑slave replication, synchronization steps, handling of network partitions, and how Sentinel provides automatic failover through monitoring, leader election, and notification, offering strategies to reduce master load and ensure high availability.

High AvailabilityMaster‑SlaveRedis
0 likes · 9 min read
Redis Master‑Slave Replication and Sentinel: How They Work and Scale
Bilibili Tech
Bilibili Tech
Sep 10, 2024 · Backend Development

Design and Implementation of a Scalable Reward System for Bilibili Live Platform

The paper presents a scalable, message‑queue‑driven reward system for Bilibili Live that unifies diverse reward types and distribution scenarios through standardized APIs, layered fast/slow queues, idempotent processing, multi‑stage retries, and comprehensive monitoring to ensure low latency, no over‑issuance, and reliable delivery.

BilibiliHigh AvailabilityMessage Queue
0 likes · 16 min read
Design and Implementation of a Scalable Reward System for Bilibili Live Platform
dbaplus Community
dbaplus Community
Sep 7, 2024 · Operations

What Hidden Costs Do You Face When Chasing 5‑Nines Availability?

Achieving five‑nine (99.999%) uptime demands massive capital, operational, and human investments, and this article breaks down the infrastructure, monitoring, testing, staffing expenses and explains why the marginal benefits sharply diminish as availability targets rise.

High AvailabilityOperational Costavailability engineering
0 likes · 8 min read
What Hidden Costs Do You Face When Chasing 5‑Nines Availability?
JD Tech Talk
JD Tech Talk
Sep 4, 2024 · Backend Development

Methodology and Practices for Building High‑Performance, High‑Concurrency, High‑Availability Backend Systems

This article shares a backend‑centric methodology and practical experiences for constructing systems that simultaneously achieve high performance, high concurrency, and high availability, covering performance optimization, read/write strategies, scaling techniques, fault‑tolerance mechanisms, and deployment considerations.

High AvailabilityHigh concurrencyMicroservices
0 likes · 24 min read
Methodology and Practices for Building High‑Performance, High‑Concurrency, High‑Availability Backend Systems
JD Tech
JD Tech
Sep 3, 2024 · Backend Development

Designing High‑Performance, High‑Concurrency, High‑Availability Backend Systems: Methodologies and Practices

This article shares a backend engineer’s comprehensive methodology and practical experiences for building systems that simultaneously achieve high performance, high concurrency, and high availability, covering performance optimization, caching strategies, scaling techniques, fault tolerance, and operational best practices across application, storage, and deployment layers.

High AvailabilityHigh concurrencySystem Design
0 likes · 28 min read
Designing High‑Performance, High‑Concurrency, High‑Availability Backend Systems: Methodologies and Practices
Volcano Engine Developer Services
Volcano Engine Developer Services
Sep 2, 2024 · Operations

How ByteDance Scales Disaster Recovery: From Single Data Center to Multi‑Region Active‑Active

This article details ByteDance’s disaster‑recovery evolution—from a single‑room deployment to same‑city multi‑data‑center setups and finally to active‑active multi‑region architectures—explaining the challenges, specific failure scenarios, and the strategic practices used to ensure continuous service during outages.

Disaster RecoveryHigh AvailabilityOperations
0 likes · 15 min read
How ByteDance Scales Disaster Recovery: From Single Data Center to Multi‑Region Active‑Active
macrozheng
macrozheng
Aug 23, 2024 · Databases

NewSQL vs Middleware Sharding: Which Architecture Truly Wins?

This article objectively compares NewSQL databases with middleware‑based sharding solutions, examining architecture, distributed transactions, CAP constraints, high availability, scaling, SQL support, storage engines, and maturity to help readers choose the right approach for their workloads.

CAP theoremHigh AvailabilityNewSQL
0 likes · 19 min read
NewSQL vs Middleware Sharding: Which Architecture Truly Wins?
Linux Ops Smart Journey
Linux Ops Smart Journey
Aug 22, 2024 · Cloud Native

How to Deploy a Highly Available Harbor Registry on Kubernetes with Helm

Learn step‑by‑step how to set up a production‑grade, highly available Harbor container registry on a Kubernetes cluster using Helm, covering prerequisites, architecture, chart installation, TLS certificate creation, secret management, PostgreSQL setup, and verification procedures.

Cloud NativeContainer RegistryHigh Availability
0 likes · 10 min read
How to Deploy a Highly Available Harbor Registry on Kubernetes with Helm
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
Aug 8, 2024 · Big Data

How to Migrate HBase and HDFS Clusters Safely Without Downtime

This guide details a step‑by‑step migration plan for HBase and HDFS clusters, covering background, high‑availability architecture, role assignments, expansion and shrinkage of ZooKeeper and JournalNode, NameNode and DataNode migration, rolling restarts, and common upgrade pitfalls.

Big DataCluster MigrationHBase
0 likes · 12 min read
How to Migrate HBase and HDFS Clusters Safely Without Downtime
Liangxu Linux
Liangxu Linux
Aug 3, 2024 · Operations

Build a Highly Available Web Cluster with LVS and Keepalived on CentOS

This guide explains how to create a high‑availability web load‑balancing cluster using Linux Virtual Server (LVS) and Keepalived on CentOS, covering background, terminology, environment setup, detailed configuration steps for master and backup nodes, real‑server preparation, HA testing, and final conclusions.

CentOSHigh AvailabilityIPVS
0 likes · 12 min read
Build a Highly Available Web Cluster with LVS and Keepalived on CentOS
Linux Ops Smart Journey
Linux Ops Smart Journey
Jul 30, 2024 · Cloud Native

Unveiling Kubernetes: Inside the Cosmic Architecture Powering Cloud Native Apps

Amid the digital transformation era, Kubernetes has become essential for modern cloud computing, and this article demystifies its inner workings by detailing its master and node components, service discovery, storage orchestration, networking, high availability, flexible resource management, and thriving ecosystem.

Cloud NativeContainer OrchestrationHigh Availability
0 likes · 5 min read
Unveiling Kubernetes: Inside the Cosmic Architecture Powering Cloud Native Apps
Liangxu Linux
Liangxu Linux
Jul 29, 2024 · Databases

How to Build a Reliable MySQL Master‑Master Cluster with Keepalived Failover

This guide walks through the complete process of creating a MySQL dual‑master replication cluster, configuring replication users, synchronizing binary logs, setting up keepalived for virtual IP failover, and testing both data consistency and high‑availability monitoring.

High AvailabilityKeepalivedMaster-Master Replication
0 likes · 8 min read
How to Build a Reliable MySQL Master‑Master Cluster with Keepalived Failover
Efficient Ops
Efficient Ops
Jul 28, 2024 · Operations

Building a Resilient, High‑Performance Website: Domains, CDN, Security & Ops

This guide outlines a comprehensive, step‑by‑step strategy for creating a highly available, secure, and scalable website—from buying and protecting multiple domains, configuring DNS and CDN, setting up image and database servers, to implementing monitoring, redundancy, high‑concurrency testing, and disaster‑recovery plans.

CDNHigh AvailabilityMonitoring
0 likes · 13 min read
Building a Resilient, High‑Performance Website: Domains, CDN, Security & Ops
Tencent Cloud Developer
Tencent Cloud Developer
Jul 25, 2024 · Databases

Redis: Features, Use Cases, Evolution, Architecture, Data Types, Commands, and Tencent Cloud Redis

Redis is a high‑performance, in‑memory NoSQL key‑value store offering persistence, rich data types, advanced structures, and robust commands, supporting caching, session storage, pub/sub, and leaderboards, while evolving through replication, Sentinel, clustering, and multithreaded proxies, with Tencent Cloud providing scalable, highly available managed Redis services.

Cloud ServicesData StructuresHigh Availability
0 likes · 9 min read
Redis: Features, Use Cases, Evolution, Architecture, Data Types, Commands, and Tencent Cloud Redis
JD Cloud Developers
JD Cloud Developers
Jul 24, 2024 · Operations

How JD.com’s Buffalo Scheduler Achieves High‑Performance, Scalable DAG Orchestration

Buffalo, JD.com’s in‑house distributed DAG scheduler, tackles massive task volumes and complex dependencies through a dual‑layer entity model, instance‑based execution, tiered scheduling, high‑availability architecture, event‑driven processing, in‑memory and cold‑hot data separation, delivering scalable, low‑latency ETL orchestration.

DAG schedulingETL orchestrationHigh Availability
0 likes · 12 min read
How JD.com’s Buffalo Scheduler Achieves High‑Performance, Scalable DAG Orchestration
JD Tech
JD Tech
Jul 23, 2024 · Big Data

Design and Architecture of JD's Buffalo Distributed Workflow Scheduling System

This article examines JD's self‑developed Buffalo distributed workflow scheduling system for big‑data ETL, detailing its two‑layer entity model, instance‑based scheduling, high‑availability three‑layer architecture, performance optimizations, cold‑hot data separation, and open APIs to support massive, complex data pipelines.

Big DataHigh AvailabilityScheduling
0 likes · 11 min read
Design and Architecture of JD's Buffalo Distributed Workflow Scheduling System
JD Retail Technology
JD Retail Technology
Jul 22, 2024 · Big Data

Design and Architecture of JD's Buffalo Distributed Workflow Scheduling System

The article introduces JD's Buffalo distributed workflow scheduling system, detailing its dual-layer entity model, instance-based scheduling, high‑availability three‑tier architecture, performance optimizations such as horizontal scaling and event‑driven execution, as well as cold‑hot data separation and open APIs for future enhancements.

BuffaloDistributed SchedulingHigh Availability
0 likes · 10 min read
Design and Architecture of JD's Buffalo Distributed Workflow Scheduling System
Architecture and Beyond
Architecture and Beyond
Jul 21, 2024 · Operations

Mastering Backend Stability: 7 Essential Practices for High Availability

This comprehensive guide outlines the seven key pillars—operations, high‑availability architecture, capacity governance, change management, risk governance, fault management, and chaos engineering—that together form a systematic approach to building and maintaining a reliable, 24‑hour backend system.

Change ManagementHigh AvailabilityOperations
0 likes · 40 min read
Mastering Backend Stability: 7 Essential Practices for High Availability
Huolala Tech
Huolala Tech
Jul 11, 2024 · Operations

How LApiGateway Achieves 99.999% Uptime: Architecture, SLA & Risk Mitigation

LApiGateway, Huolala's internal micro‑service gateway, achieves five‑nine availability through a dual‑plane architecture, comprehensive monitoring, SLA definition, risk classification, heartbeat health checks, traffic migration strategies, strict change governance, and regular fault drills, all detailed in this technical overview.

High AvailabilityLApiGatewayMicroservice Gateway
0 likes · 9 min read
How LApiGateway Achieves 99.999% Uptime: Architecture, SLA & Risk Mitigation
Su San Talks Tech
Su San Talks Tech
Jul 6, 2024 · Backend Development

Mastering High Availability: 10 Essential Design Techniques for Scalable Systems

This article explains ten core techniques—system splitting, decoupling, asynchrony, retry, compensation, backup, multi‑active strategies, isolation, rate limiting, circuit breaking, and degradation—that together enable robust, high‑availability architectures for modern backend services.

High AvailabilitySystem Designdistributed systems
0 likes · 12 min read
Mastering High Availability: 10 Essential Design Techniques for Scalable Systems
Ctrip Technology
Ctrip Technology
Jul 5, 2024 · Backend Development

Design and Optimization of Ctrip Ticket Booking Transaction System for Flash‑Sale Events

This article examines the challenges faced by Ctrip’s ticket reservation transaction system during flash‑sale events and details the architectural optimizations—including Redis caching, database load reduction, supplier integration, and multi‑layer traffic throttling—that ensure system stability, strong consistency, and high availability under extreme concurrency.

Data ConsistencyHigh AvailabilityHigh concurrency
0 likes · 16 min read
Design and Optimization of Ctrip Ticket Booking Transaction System for Flash‑Sale Events
Aikesheng Open Source Community
Aikesheng Open Source Community
Jun 27, 2024 · Databases

Evaluation of OceanBase Arbitration Service in a 2F1A Deployment: Fault Injection Experiments and Recovery Procedures

This article presents a detailed experimental study of OceanBase's Arbitration Service in a 2F1A (two full‑function replicas plus one arbitration node) configuration, examining how the system behaves when one or both full‑function replicas fail, how log‑stream degradation and permanent offline mechanisms work, and how normal service is restored after node recovery.

Arbitration ServiceFault InjectionHigh Availability
0 likes · 17 min read
Evaluation of OceanBase Arbitration Service in a 2F1A Deployment: Fault Injection Experiments and Recovery Procedures
Top Architect
Top Architect
Jun 26, 2024 · Backend Development

High Availability Traffic Governance: Circuit Breakers, Isolation, Retries, Timeouts, and Rate Limiting

This article explains how to achieve high‑availability in microservice systems through traffic governance techniques such as circuit breakers, various isolation strategies, retry mechanisms, timeout controls, and rate‑limiting, illustrating each concept with examples, formulas, and pseudo‑code.

High Availabilitycircuit breakerrate limiting
0 likes · 31 min read
High Availability Traffic Governance: Circuit Breakers, Isolation, Retries, Timeouts, and Rate Limiting
Architect
Architect
Jun 24, 2024 · Operations

Traffic Governance and High‑Availability Strategies for Microservices

This article explains how traffic governance—including circuit breaking, isolation, retry mechanisms, degradation, timeout control, and rate limiting—helps microservice systems achieve the three‑high goals of high performance, high availability, and easy scalability, using concrete formulas, algorithms, and practical examples.

High AvailabilityMicroservicesdegradation
0 likes · 29 min read
Traffic Governance and High‑Availability Strategies for Microservices
ITPUB
ITPUB
Jun 15, 2024 · Databases

Resolving Oracle RAC VIP Failover and SCAN IP Load‑Balancing Issues

This article walks through real‑world Oracle RAC failures caused by misconfigured VIP failover and SCAN IP load‑balancing, explains how to diagnose the symptoms, provides correct TAF and listener settings, and highlights essential configuration tips to ensure reliable high‑availability operation.

Database ConfigurationHigh AvailabilityOracle
0 likes · 9 min read
Resolving Oracle RAC VIP Failover and SCAN IP Load‑Balancing Issues
iQIYI Technical Product Team
iQIYI Technical Product Team
Jun 14, 2024 · Operations

Stability Assurance Practices for the 2024 CCTV Spring Festival Gala Live Stream

The 2024 CCTV Spring Festival Gala live stream employed comprehensive stability assurance practices across signal encoding, CDN distribution, request handling, and playback—using multi‑source encoding, multi‑level origin redundancy, multi‑cluster HA, and P2P‑augmented delivery—to handle massive QPS spikes, ensure high availability, and provide a resilient, high‑quality viewing experience.

CDNHigh AvailabilityP2P
0 likes · 24 min read
Stability Assurance Practices for the 2024 CCTV Spring Festival Gala Live Stream
Mike Chen's Internet Architecture
Mike Chen's Internet Architecture
Jun 12, 2024 · Backend Development

Comprehensive Guide to Nginx Configuration, Reverse Proxy, Load Balancing, and High‑Availability Clusters

This article provides a detailed tutorial on Nginx, covering its core features, configuration file structure, and practical examples for reverse proxy, load balancing, static‑dynamic separation, and high‑availability clustering with code snippets and deployment steps.

Backend DevelopmentHigh AvailabilityNGINX
0 likes · 11 min read
Comprehensive Guide to Nginx Configuration, Reverse Proxy, Load Balancing, and High‑Availability Clusters
Architecture Breakthrough
Architecture Breakthrough
Jun 11, 2024 · R&D Management

Why Your Technical Presentation Fails and How the MECE Framework Saves It

The article reveals common pitfalls engineers face when presenting technical solutions—over‑focusing on details, ignoring business value and operational concerns—and shows how applying the MECE principle across value, technology, project, and operation dimensions creates a complete, persuasive report.

High AvailabilityMECE frameworkcommunication skills
0 likes · 7 min read
Why Your Technical Presentation Fails and How the MECE Framework Saves It
Tencent Cloud Developer
Tencent Cloud Developer
Jun 7, 2024 · Cloud Native

Multi-AZ High‑Availability Architecture of Tencent Cloud TDMQ for Apache Pulsar

Tencent Cloud TDMQ for Apache Pulsar achieves multi‑AZ high availability by containerizing ZooKeeper, BookKeeper and Brokers, using managed ZK, persistent cloud disks and elastic NICs, enforcing quorum and rack‑aware replicas, and planning cross‑region bidirectional replication to ensure seamless disaster recovery and continuous messaging.

Cloud NativeHigh AvailabilityMulti‑AZ
0 likes · 15 min read
Multi-AZ High‑Availability Architecture of Tencent Cloud TDMQ for Apache Pulsar
Sanyou's Java Diary
Sanyou's Java Diary
Jun 3, 2024 · Backend Development

Understanding the Full Lifecycle of a RocketMQ Message: From Production to Deletion

This article walks through every stage of a RocketMQ message—from producer creation, routing, queue selection, and storage with zero‑copy techniques, through high‑availability replication, consumption modes, ordering guarantees, and finally automatic cleanup—providing code examples and architectural diagrams for each step.

Backend DevelopmentHigh AvailabilityRocketMQ
0 likes · 26 min read
Understanding the Full Lifecycle of a RocketMQ Message: From Production to Deletion
Bilibili Tech
Bilibili Tech
May 31, 2024 · Backend Development

Design and High‑Availability Practices of Bilibili's Video Submission System

Bilibili’s video submission platform uses a layered micro‑service architecture with a DAG‑based scheduler, extensive observability, and HA tactics such as sharding, 64‑bit ID migration, full‑link stress tests, chaos engineering, and multi‑active data‑center deployment, while tooling like trace correlation and automated alerts ensures stability and guides future hybrid‑cloud migration.

BilibiliDAGHigh Availability
0 likes · 35 min read
Design and High‑Availability Practices of Bilibili's Video Submission System
Su San Talks Tech
Su San Talks Tech
May 30, 2024 · Backend Development

Why Single‑Server Apps Fail: Master Load Balancing with Nginx and LVS

This article walks through the evolution from a single‑Tomcat deployment to a multi‑layer load‑balancing architecture using Nginx, a gateway, LVS, and DNS, explaining static‑dynamic separation, high‑availability strategies, and performance trade‑offs for scalable backend systems.

High AvailabilityLVSNGINX
0 likes · 11 min read
Why Single‑Server Apps Fail: Master Load Balancing with Nginx and LVS
Efficient Ops
Efficient Ops
May 28, 2024 · Operations

How to Build a Resilient High‑Traffic Website: Domains, CDN, Monitoring, and Security

This guide outlines practical steps for creating a highly available, secure, and scalable website—including domain strategy, CDN deployment, image caching, data‑center selection, monitoring, attack mitigation, redundancy, server configuration, database replication, testing environments, disaster‑recovery planning, and high‑concurrency testing.

High AvailabilityMonitoringwebsite infrastructure
0 likes · 12 min read
How to Build a Resilient High‑Traffic Website: Domains, CDN, Monitoring, and Security
ITPUB
ITPUB
May 24, 2024 · Databases

Master PostgreSQL High Availability with Pacemaker & Corosync: A Step‑by‑Step Guide

This tutorial walks through building a PostgreSQL high‑availability cluster using Pacemaker and Corosync, covering architecture overview, component installation, cluster status checks, data synchronization verification, failover handling, and common maintenance commands with concrete commands and screenshots.

CorosyncHigh AvailabilityPacemaker
0 likes · 7 min read
Master PostgreSQL High Availability with Pacemaker & Corosync: A Step‑by‑Step Guide
iQIYI Technical Product Team
iQIYI Technical Product Team
May 24, 2024 · Operations

High Availability and Disaster Recovery Practices of iQIYI's Video Relay Service (VRS)

iQIYI’s Video Relay Service ensures uninterrupted video playback by employing a two‑region, three‑center hybrid cloud architecture, multi‑layer storage, cross‑AZ retry mechanisms, protective rate‑limiting and degradation paths, layered monitoring, and rigorous stress‑testing and chaos engineering to achieve high availability and disaster recovery.

Cloud NativeDisaster RecoveryHigh Availability
0 likes · 18 min read
High Availability and Disaster Recovery Practices of iQIYI's Video Relay Service (VRS)
Laravel Tech Community
Laravel Tech Community
May 21, 2024 · Databases

MongoDB Replication Set and Sharding Configuration Guide

This article provides a comprehensive step‑by‑step guide to setting up MongoDB replica sets and sharded clusters, explaining the architecture, member roles, configuration files, initialization commands, and operational procedures for ensuring data redundancy, high availability, and horizontal scaling.

High AvailabilityMongoDBSharding
0 likes · 29 min read
MongoDB Replication Set and Sharding Configuration Guide
DevOps Operations Practice
DevOps Operations Practice
May 19, 2024 · Operations

High‑Availability Solutions for Prometheus Monitoring

Prometheus, a leading monitoring system, can achieve high availability through several common architectures—including dual-node with external storage, federated mode with external storage, and multi-node clusters combined with Thanos and object storage—each offering data persistence and load distribution to enhance system stability and performance.

External StorageHigh AvailabilityThanos
0 likes · 3 min read
High‑Availability Solutions for Prometheus Monitoring
MaGe Linux Operations
MaGe Linux Operations
May 19, 2024 · Databases

How to Deploy Xenon: A Raft‑Based MySQL HA Solution with Semi‑Sync and Parallel Replication

This guide walks through deploying Xenon, an open‑source Raft‑based MySQL high‑availability solution, covering environment setup, installation of Go and Percona XtraBackup, configuring Xenon’s JSON, starting the cluster, monitoring status, and troubleshooting backup failures caused by misconfigured host settings.

GoHigh AvailabilityRaft
0 likes · 8 min read
How to Deploy Xenon: A Raft‑Based MySQL HA Solution with Semi‑Sync and Parallel Replication
Cognitive Technology Team
Cognitive Technology Team
May 16, 2024 · Operations

Core Principles of High‑Availability Architecture Design

These core principles—minimal dependency, weak dependency, distribution, rate limiting, degradable design, balanced risk, fault prevention and isolation, no single point of failure, self‑protection, automatic failover, and retry/idempotency/compensation—guide the design of highly available systems by reducing risk, ensuring redundancy, and protecting services at all layers.

High AvailabilityOperationsReliability
0 likes · 3 min read
Core Principles of High‑Availability Architecture Design
Selected Java Interview Questions
Selected Java Interview Questions
May 10, 2024 · Databases

Comparing NewSQL Databases with Middleware‑Based Sharding: Advantages, Limitations, and Practical Guidance

This article objectively compares NewSQL databases and middleware‑plus‑sharding architectures, examining their core principles, distributed transaction handling, high‑availability mechanisms, scaling and sharding strategies, SQL support, storage engines, and maturity to help engineers decide which solution fits their workload.

Database ArchitectureHigh AvailabilityNewSQL
0 likes · 18 min read
Comparing NewSQL Databases with Middleware‑Based Sharding: Advantages, Limitations, and Practical Guidance
Programmer XiaoFu
Programmer XiaoFu
May 10, 2024 · Databases

From Single Node to Tank: 20 Diagrams of Redis Architecture Evolution

This article walks through Redis's architectural journey—from a lone instance to a high‑availability, high‑performance cluster—covering persistence (RDB, AOF, hybrid), master‑slave replication, Sentinel automatic failover, sharding strategies, and the modern Redis Cluster design.

High AvailabilityPersistenceRedis
0 likes · 19 min read
From Single Node to Tank: 20 Diagrams of Redis Architecture Evolution
Sanyou's Java Diary
Sanyou's Java Diary
May 9, 2024 · Databases

From Single Node to Cluster: Mastering Redis Architecture Evolution

This article walks you through Redis’s architectural journey—from a simple single‑node setup, through persistence mechanisms, master‑slave replication, Sentinel‑driven automatic failover, and finally sharding with Redis Cluster—explaining each component’s purpose, trade‑offs, and how they collectively boost performance and reliability.

High AvailabilityPersistenceRedis
0 likes · 18 min read
From Single Node to Cluster: Mastering Redis Architecture Evolution
DevOps Cloud Academy
DevOps Cloud Academy
May 6, 2024 · Cloud Native

How to Deploy a Highly Available Application on Kubernetes

This article explains key Kubernetes configurations—such as pod replicas, pod anti‑affinity, deployment strategies, graceful termination, probes, resource allocation, scaling, and disruption budgets—to achieve high availability and zero‑downtime deployments for containerized applications in production.

Cloud NativeHigh AvailabilityProbes
0 likes · 20 min read
How to Deploy a Highly Available Application on Kubernetes
DataFunTalk
DataFunTalk
Apr 30, 2024 · Big Data

Vivo's Evolution of Large‑Scale Distributed Messaging Middleware Architecture and Practices

This technical presentation details Vivo's end‑to‑end big‑data architecture, the evolution from Kafka to Pulsar for massive message processing, deployment strategies, high‑availability mechanisms, observability practices, and future plans for cloud‑native, containerized messaging middleware.

Distributed MessagingHigh AvailabilityObservability
0 likes · 18 min read
Vivo's Evolution of Large‑Scale Distributed Messaging Middleware Architecture and Practices
JD Retail Technology
JD Retail Technology
Apr 26, 2024 · Operations

How Isolation Principles Boost System High Availability: Real-World Cases

This article explains the concept of high availability, defines the isolation principle, outlines its implementation across various layers, and presents concrete case studies—including vertical data‑center redesign, dual‑cluster Elasticsearch migration, traffic grouping, and hot‑cold data segregation—to illustrate how isolation improves system resilience.

Case StudyHigh AvailabilityOperations
0 likes · 15 min read
How Isolation Principles Boost System High Availability: Real-World Cases
Java Captain
Java Captain
Apr 26, 2024 · Databases

Choosing Between Sharding Middleware and NewSQL Distributed Databases: Advantages, Trade‑offs, and Use Cases

This article objectively compares middleware‑based sharding with modern NewSQL distributed databases, examining their architectural differences, performance, transaction support, scalability, high‑availability, and operational considerations, to help practitioners decide which approach best fits their workload and organizational constraints.

Database ArchitectureHigh AvailabilityNewSQL
0 likes · 20 min read
Choosing Between Sharding Middleware and NewSQL Distributed Databases: Advantages, Trade‑offs, and Use Cases
dbaplus Community
dbaplus Community
Apr 25, 2024 · Operations

How We Built Same‑City Active‑Active Architecture for a High‑Volume Transaction Platform

This article details the background, design principles, overall architecture, concrete refactoring steps, launch process, results, and emerging challenges of implementing a same‑city active‑active solution to improve reliability, load balancing, disaster recovery, and cost efficiency for a large‑scale transaction system.

Active-ActiveBlue-Green DeploymentHigh Availability
0 likes · 23 min read
How We Built Same‑City Active‑Active Architecture for a High‑Volume Transaction Platform
Architect
Architect
Apr 22, 2024 · Operations

Flow Governance and High‑Availability Strategies for Microservice Systems

This article explains how to achieve high availability in microservice architectures by applying flow governance techniques such as circuit breaking, isolation, retry policies, degradation, timeout management, and rate limiting, while detailing key metrics like MTBF and MTTR and providing practical implementation guidance.

Flow ControlHigh AvailabilityMicroservices
0 likes · 30 min read
Flow Governance and High‑Availability Strategies for Microservice Systems
Selected Java Interview Questions
Selected Java Interview Questions
Apr 21, 2024 · Backend Development

Designing an Enterprise‑Level Unified Notification Service Architecture

This article systematically outlines the requirements, evolution stages, functional and non‑functional specifications, and component design of a scalable, high‑availability enterprise notification platform that supports multi‑channel push (email, SMS, chat, WeChat, DingTalk, etc.) through a microservice‑based architecture.

High AvailabilityNotificationarchitecture
0 likes · 12 min read
Designing an Enterprise‑Level Unified Notification Service Architecture
Architecture Digest
Architecture Digest
Apr 19, 2024 · Databases

Comparing NewSQL Distributed Databases with Middleware‑Based Sharding: Advantages, Trade‑offs, and Use Cases

The article objectively compares NewSQL distributed databases with traditional middleware‑based sharding solutions, examining their architectural differences, distributed transaction support, performance, scalability, high‑availability mechanisms, storage engines, and practical suitability for various application scenarios.

CAP theoremHigh AvailabilityNewSQL
0 likes · 18 min read
Comparing NewSQL Distributed Databases with Middleware‑Based Sharding: Advantages, Trade‑offs, and Use Cases
Efficient Ops
Efficient Ops
Apr 14, 2024 · Operations

How to Ensure System Stability and High Availability: An SRE Playbook

This article explains the definitions of stability and high availability, distinguishes their relationship, outlines key performance indicators, and provides a comprehensive framework—including fault prevention, detection, and recovery, as well as design, coding, testing, monitoring, and emergency response practices—to help teams build reliable, highly available systems.

High AvailabilityMonitoringSRE
0 likes · 10 min read
How to Ensure System Stability and High Availability: An SRE Playbook
Mike Chen's Internet Architecture
Mike Chen's Internet Architecture
Apr 11, 2024 · Databases

Mastering Redis Sentinel: Ensuring Automatic High Availability

This article explains Redis Sentinel’s role in providing monitoring, notifications, automatic failover, and configuration updates to achieve high availability, detailing its heartbeat mechanism, master‑down detection, leader election, failover selection criteria, and the trade‑offs of using this solution.

High AvailabilityMonitoringRedis
0 likes · 6 min read
Mastering Redis Sentinel: Ensuring Automatic High Availability
Architecture & Thinking
Architecture & Thinking
Apr 10, 2024 · Operations

How Redis Sentinel Ensures Automatic Failover and High Availability

Redis Sentinel provides automatic monitoring, fault detection, and failover for Redis master‑slave clusters, enabling high availability by electing a new master when the original fails, using sdown/odown states, quorum voting, and pub/sub communication to keep services running with minimal downtime.

High AvailabilityMonitoringSentinel
0 likes · 11 min read
How Redis Sentinel Ensures Automatic Failover and High Availability
JD Retail Technology
JD Retail Technology
Apr 8, 2024 · Backend Development

Applying the Weak Dependency Principle for High Availability in Microservices

This article explains the weak dependency principle, contrasts it with the less‑dependency principle, and presents concrete microservice architecture strategies—including module splitting, independent deployment, asynchronous messaging, interface abstraction, fault‑tolerance, and governance—to improve system flexibility, scalability, and high availability.

High AvailabilityMicroservicesarchitecture
0 likes · 14 min read
Applying the Weak Dependency Principle for High Availability in Microservices
MaGe Linux Operations
MaGe Linux Operations
Apr 8, 2024 · Operations

Build a Highly Available Load Balancer with LVS and Keepalived

This guide explains how to design and deploy a highly available web load‑balancing cluster using Linux Virtual Server (LVS) together with Keepalived, covering architecture, required software, configuration steps for both master and backup nodes, real‑server setup, and HA testing procedures.

High AvailabilityKeepalivedLVS
0 likes · 12 min read
Build a Highly Available Load Balancer with LVS and Keepalived
Architect
Architect
Apr 4, 2024 · Backend Development

Mastering High Availability: 9 Essential Design Techniques for Scalable Systems

The article walks through nine practical techniques—system splitting, decoupling, asynchronous processing, retry, compensation, backup, multi‑active deployment, rate limiting, circuit breaking, and degradation—explaining why each is needed, how they are implemented in real‑world microservice architectures, and what trade‑offs to consider.

High AvailabilityMicroservicesSystem Design
0 likes · 13 min read
Mastering High Availability: 9 Essential Design Techniques for Scalable Systems
FunTester
FunTester
Mar 29, 2024 · Operations

Implementing Chaos Engineering in WeChat Pay: Practices, Challenges, and Outcomes

This article describes how WeChat Pay applied chaos engineering to improve system reliability, detailing the business scenario, challenges of controlling fault injection radius, practical solutions, risk assessment, automation, and the resulting business and tool achievements.

Fault InjectionHigh AvailabilityOperations
0 likes · 18 min read
Implementing Chaos Engineering in WeChat Pay: Practices, Challenges, and Outcomes
DeWu Technology
DeWu Technology
Mar 25, 2024 · Cloud Native

Design and Implementation of Same‑City Dual‑Active Architecture for a Transaction Platform

The paper details a same‑city dual‑active architecture for a high‑traffic transaction platform, combining blue‑green and dual‑cluster deployment with zone‑aware routing, middleware transformations, and a gradual traffic‑coloring release process that achieved near‑50/50 traffic split, stable performance, minimal cost, and outlines remaining challenges.

Dual-ActiveHigh AvailabilityMiddleware
0 likes · 20 min read
Design and Implementation of Same‑City Dual‑Active Architecture for a Transaction Platform
Tencent Cloud Developer
Tencent Cloud Developer
Mar 19, 2024 · Operations

Chaos Engineering in WeChat Pay: Design, Implementation, and Results

WeChat Pay’s team adopted Netflix‑style chaos engineering, building an automated, YAML‑driven fault‑injection platform that isolates experiments in multi‑zone partitions, enabling over 500 safe experiments in 2021‑2022, uncovering critical bugs across core services while maintaining five‑nine availability and zero production incidents.

AutomationFault InjectionHigh Availability
0 likes · 18 min read
Chaos Engineering in WeChat Pay: Design, Implementation, and Results
dbaplus Community
dbaplus Community
Mar 18, 2024 · Operations

How to Build a Resilient, High‑Traffic Web Infrastructure: A Step‑by‑Step Ops Guide

This guide outlines a complete, practical workflow for acquiring multiple domains, configuring DNS, deploying CDN and image caches, selecting data‑center locations, setting up redundant servers, implementing monitoring, handling DDoS attacks, planning capacity, securing systems, and organizing an operations team to ensure high availability for large‑scale web services.

CDNHigh AvailabilityMonitoring
0 likes · 12 min read
How to Build a Resilient, High‑Traffic Web Infrastructure: A Step‑by‑Step Ops Guide
Huolala Tech
Huolala Tech
Mar 14, 2024 · Cloud Native

HuoLala’s Cost‑Effective Multi‑Zone High Availability via Multi‑Lane Architecture

This article explains how HuoLala designed a cost‑effective multi‑zone high‑availability solution called the multi‑lane architecture, detailing its goals, deployment of services across availability zones, use of Consul for service discovery, Apollo for configuration, traffic scheduling strategies, and how it differs from traditional active‑active setups.

Cloud NativeHigh Availabilityconfiguration management
0 likes · 13 min read
HuoLala’s Cost‑Effective Multi‑Zone High Availability via Multi‑Lane Architecture
Linux Cloud Computing Practice
Linux Cloud Computing Practice
Mar 13, 2024 · Databases

Unlocking Redis: Architecture, High Availability, and Persistence Explained

This article provides a comprehensive overview of Redis, covering its core concepts, deployment architectures—including single instance, high‑availability, Sentinel, and cluster setups—its replication mechanisms, gossip protocol, and the various persistence options such as RDB, AOF, and fork‑based snapshots.

High AvailabilityIn-Memory DatabasePersistence
0 likes · 17 min read
Unlocking Redis: Architecture, High Availability, and Persistence Explained
Mike Chen's Internet Architecture
Mike Chen's Internet Architecture
Mar 6, 2024 · Cloud Computing

Understanding IaaS: Definition, Features, Core Technologies, and Application Scenarios

This article provides a comprehensive overview of IaaS, detailing its definition, core characteristics, underlying technologies such as virtualization and automation, and common use cases, while highlighting benefits like cost reduction, elasticity, high availability, and security in cloud environments.

AutomationCloud ComputingHigh Availability
0 likes · 8 min read
Understanding IaaS: Definition, Features, Core Technologies, and Application Scenarios
Architects' Tech Alliance
Architects' Tech Alliance
Feb 24, 2024 · Operations

How the Two‑Site Three‑Center Disaster Recovery Model Boosts Business Continuity

The article explains the two‑site three‑center disaster‑recovery architecture—comprising a production site, a same‑city backup, and a remote backup—detailing synchronous and asynchronous data replication, failover capabilities, Oracle Data Guard implementation, and why this hybrid approach delivers superior RPO, RTO, and availability for enterprises.

Disaster RecoveryHigh AvailabilityOracle Data Guard
0 likes · 6 min read
How the Two‑Site Three‑Center Disaster Recovery Model Boosts Business Continuity
MaGe Linux Operations
MaGe Linux Operations
Feb 14, 2024 · Databases

Unlocking Redis: Core Concepts, Architecture, and Persistence Explained

This article introduces Redis as an in‑memory key‑value data‑structure server, explains its primary use cases, walks through deployment options such as single instances, high‑availability, Sentinel and Cluster, and details its persistence mechanisms including RDB, AOF and forking.

CachingHigh AvailabilityIn-Memory Database
0 likes · 16 min read
Unlocking Redis: Core Concepts, Architecture, and Persistence Explained
Architects' Tech Alliance
Architects' Tech Alliance
Feb 13, 2024 · Operations

What Makes Enterprise Storage Systems Reliable and Scalable?

As enterprise data volumes surge, modern storage systems must deliver high availability, fault tolerance, multi‑protocol support, backup, snapshot, and cloning capabilities, often through distributed architectures that boost reliability, scalability, and cost efficiency while ensuring rapid data recovery.

Enterprise StorageHigh AvailabilityStorage Systems
0 likes · 4 min read
What Makes Enterprise Storage Systems Reliable and Scalable?
ITPUB
ITPUB
Feb 13, 2024 · Databases

Achieve Seamless Second‑Level Database Scaling for High‑Throughput Microservices

This guide explains how to design a high‑concurrency, high‑throughput internet architecture that ensures database high availability with double‑master sync and virtual IPs, and how to horizontally shard and smoothly expand the cluster in seconds using configuration changes, reloads, and cleanup steps.

DatabasesHigh AvailabilityMicroservices
0 likes · 8 min read
Achieve Seamless Second‑Level Database Scaling for High‑Throughput Microservices
JavaEdge
JavaEdge
Feb 7, 2024 · Backend Development

Designing a High‑Availability Payment System: Flow, Optimization, and Fault Tolerance

This article details the end‑to‑end design of a payment system, covering transaction flow, horizontal and vertical pre‑optimizations, task scheduling, sharding strategies, data structures, high‑availability mechanisms such as channel isolation and Hystrix, and future planning for dynamic scaling and intelligent routing.

Elastic-JobHigh AvailabilityHystrix
0 likes · 12 min read
Designing a High‑Availability Payment System: Flow, Optimization, and Fault Tolerance
MaGe Linux Operations
MaGe Linux Operations
Feb 7, 2024 · Databases

How to Build a Real‑Time Data Guard System for Dameng Database

This guide walks through setting up a Dameng data‑guard service using a primary, standby, and monitor server, covering data preparation, configuration of dm.ini, dmmal.ini, dmarch.ini, dmwatcher.ini, starting services, OGUID setup, mode switching, and monitoring to achieve high‑availability replication.

DamengData GuardDatabase Configuration
0 likes · 12 min read
How to Build a Real‑Time Data Guard System for Dameng Database
Alibaba Cloud Developer
Alibaba Cloud Developer
Feb 1, 2024 · Databases

Why Redis Dominates Modern Caching: Architecture, Strategies, and Pitfalls

This article provides a comprehensive technical overview of Redis, covering its high‑performance in‑memory design, rich data structures, persistence options, transaction support, eviction policies, common caching patterns, distributed locking techniques, and high‑availability solutions such as Sentinel and Cluster, while also comparing it with alternatives like Memcached, Tair, Guava, EVCache and ETCD.

High AvailabilityPersistenceRedis
0 likes · 35 min read
Why Redis Dominates Modern Caching: Architecture, Strategies, and Pitfalls
Baidu Geek Talk
Baidu Geek Talk
Jan 29, 2024 · Databases

BTS (Baidu Table Storage): Architecture and Core Technologies

BTS (Baidu Table Storage) is Baidu Intelligent Cloud’s high‑performance, low‑cost semi‑structured NoSQL service that evolved from single‑table to multi‑model (wide tables, time‑series, soon documents), featuring a three‑layer compute‑storage separation architecture, multi‑level caching, hot‑backup HA, and supporting massive IoT, AI, autonomous‑driving and monitoring workloads.

BTSBaidu Table StorageDatabase Architecture
0 likes · 21 min read
BTS (Baidu Table Storage): Architecture and Core Technologies
Efficient Ops
Efficient Ops
Jan 23, 2024 · Operations

Why Building Truly High‑Availability Systems Is Harder Than You Think

The article examines why 2023 saw a surge in major online outages, linking layoffs and cost‑cutting to lost expertise, and explores the entropy and Murphy laws that make perpetual high availability impossible without continuous, systematic investment and cultural change.

High AvailabilitySREsystem reliability
0 likes · 13 min read
Why Building Truly High‑Availability Systems Is Harder Than You Think