Tagged articles
310 articles
Page 2 of 4
Java High-Performance Architecture
Java High-Performance Architecture
Jan 24, 2023 · Backend Development

How to Build Highly Available Backend APIs: 10 Essential Design Principles

This article explains why high availability is crucial for backend services and outlines ten practical design principles—including dependency control, avoiding single points, load balancing, isolation, rate limiting, circuit breaking, async processing, degradation, gray release, and chaos engineering—to help developers create resilient APIs.

Backendapi-designfault tolerance
0 likes · 10 min read
How to Build Highly Available Backend APIs: 10 Essential Design Principles
Architecture Digest
Architecture Digest
Jan 19, 2023 · Backend Development

Designing High‑Availability Backend Interfaces

The article explains why high availability is essential for backend services, defines its core concepts, and outlines key design principles such as minimizing dependencies, avoiding single points of failure, load balancing, resource isolation, rate limiting, circuit breaking, asynchronous processing, degradation strategies, gray releases, and chaos engineering to build resilient APIs.

Reliabilityfault toleranceservice design
0 likes · 9 min read
Designing High‑Availability Backend Interfaces
ITPUB
ITPUB
Jan 12, 2023 · Operations

How to Build a Truly High‑Availability System: 6 Essential Design Layers

This article breaks down the essential design and operational considerations for achieving high availability across six layers—development standards, application services, storage, product strategy, operations deployment, and incident response—providing concrete practices, metrics, and safeguards to reach four‑nine (99.99%) uptime.

OperationsSystem Designcapacity planning
0 likes · 25 min read
How to Build a Truly High‑Availability System: 6 Essential Design Layers
Tencent Cloud Developer
Tencent Cloud Developer
Jan 5, 2023 · Cloud Native

QQ Music High-Availability Architecture Overview

QQ Music achieves high availability by layering redundant multi‑datacenter architecture, proactive chaos‑engineering toolchains, and comprehensive observability—including metrics, logging, tracing and profiling—while employing service grading, adaptive retry windows and EMA‑based dynamic timeouts to gracefully handle faults across its massive micro‑service ecosystem.

Distributed SystemsMicroservicesObservability
0 likes · 24 min read
QQ Music High-Availability Architecture Overview
Architecture Digest
Architecture Digest
Dec 21, 2022 · Operations

Designing High‑Availability Systems: Principles and Practices Across Six Layers

This article systematically explores high‑availability system design from development standards, capacity planning, application services, storage, product strategies, operations deployment, to incident response, presenting key concepts, architectural patterns, and practical guidelines for building resilient services.

DeploymentOperationsSystem Design
0 likes · 27 min read
Designing High‑Availability Systems: Principles and Practices Across Six Layers
Java High-Performance Architecture
Java High-Performance Architecture
Dec 6, 2022 · Cloud Native

How to Build Resilient Microservices: Patterns for Fault Tolerance and High Availability

Learn essential techniques for designing fault‑tolerant microservices, including graceful degradation, change management, health checks, self‑healing, failover caching, retry strategies, rate limiting, circuit breakers, and testing failures, to ensure high availability and reliability in distributed cloud‑native systems.

OperationsReliabilitycloud-native
0 likes · 15 min read
How to Build Resilient Microservices: Patterns for Fault Tolerance and High Availability
High Availability Architecture
High Availability Architecture
Dec 2, 2022 · Operations

High‑Availability Design and Implementation of the BIGO Backbone Network

This article explains how BIGO’s backbone network achieves high availability through a three‑layer design—control‑plane HA using ETCD‑based Raft leader election, data‑plane HA with MPLS SR‑Policy and intermediate Route‑Reflection layers, and business‑level HA that combines traffic, optimization, and fault scheduling to ensure seamless service continuity.

MPLSSDNSR-Policy
0 likes · 19 min read
High‑Availability Design and Implementation of the BIGO Backbone Network
Architecture & Thinking
Architecture & Thinking
Dec 2, 2022 · Cloud Native

Mastering Hystrix: A Deep Dive into Circuit Breaker, Fallback, and Isolation Strategies

This article provides a comprehensive guide to Hystrix, covering its purpose in microservice fault tolerance, the problems it addresses, core concepts like command pattern and isolation, detailed workflow steps, configuration options, and practical Java code examples for circuit breaking, fallback, and thread‑pool or semaphore isolation.

HystrixJavaMicroservices
0 likes · 21 min read
Mastering Hystrix: A Deep Dive into Circuit Breaker, Fallback, and Isolation Strategies
Code Ape Tech Column
Code Ape Tech Column
Nov 8, 2022 · Operations

Designing Resilient Microservices: Fault Tolerance, Health Checks, and Reliability Patterns

This article explains how to build highly available microservice systems by addressing the risks of distributed architectures, employing graceful degradation, change management, health checks, self‑healing, failover caching, retry and rate‑limiting strategies, bulkhead and circuit‑breaker patterns, and continuous failure testing.

Deployment Strategiesfault tolerancehealth checks
0 likes · 18 min read
Designing Resilient Microservices: Fault Tolerance, Health Checks, and Reliability Patterns
Open Source Linux
Open Source Linux
Oct 26, 2022 · Fundamentals

Choosing the Right RAID Level: Pros, Cons, and Best Use Cases

This guide explains what RAID is, its role in server storage, compares common RAID levels (0, 1, 5, 6, 10) in terms of fault tolerance, performance, and capacity, and offers recommendations for selecting the most suitable RAID configuration based on data safety, speed, and cost considerations.

Data ProtectionRAIDfault tolerance
0 likes · 8 min read
Choosing the Right RAID Level: Pros, Cons, and Best Use Cases
DeWu Technology
DeWu Technology
Oct 17, 2022 · Operations

High Availability: Principles and Practices for System Stability

High availability—measured in nines of uptime—requires partitioning systems, decoupling components, choosing robust technologies, deploying redundant instances with automatic failover, capacity planning, rapid scaling, traffic shaping, resource isolation, global protection, observability, and disciplined change management to achieve stable, resilient services.

Observabilitycapacity planningchange management
0 likes · 10 min read
High Availability: Principles and Practices for System Stability
Architecture Digest
Architecture Digest
Oct 10, 2022 · Operations

Designing Fault‑Tolerant Microservices: Patterns and Practices

This article explains how to build highly available microservice systems by applying fault‑tolerance patterns such as graceful degradation, health checks, self‑healing, failover caches, retries, rate limiting, bulkhead isolation, circuit breakers, and systematic failure testing, while also covering change‑management and deployment strategies.

Microservicescircuit breakerfault tolerance
0 likes · 14 min read
Designing Fault‑Tolerant Microservices: Patterns and Practices
ITPUB
ITPUB
Oct 4, 2022 · Operations

What Makes a System Truly High‑Availability? Lessons from B‑Station’s Outage

The article examines B‑Station’s July 2021 outage, explains the concept and quantitative metrics of high availability, and outlines practical techniques such as rate limiting, isolation, failover, timeout control, circuit breaking, degradation, and multi‑region deployment to achieve resilient systems.

MTBFMTTRcircuit breaker
0 likes · 15 min read
What Makes a System Truly High‑Availability? Lessons from B‑Station’s Outage
Architecture Digest
Architecture Digest
Sep 25, 2022 · Cloud Native

Designing Microservices Architecture for Failure: Patterns and Practices

This article explains how to build highly available microservices by addressing the inherent risks of distributed systems and presenting fault‑tolerance patterns such as graceful degradation, change management, health checks, self‑healing, failover caching, retries, rate limiting, bulkheads, circuit breakers, and systematic failure testing.

Cloud NativeMicroservicesReliability
0 likes · 14 min read
Designing Microservices Architecture for Failure: Patterns and Practices
ITFLY8 Architecture Home
ITFLY8 Architecture Home
Sep 20, 2022 · Cloud Native

How to Build Fault‑Tolerant Microservices: Essential Patterns and Practices

This article explains why microservice architectures increase failure risk and presents proven techniques—such as graceful degradation, change management, health checks, self‑healing, failover caches, retries, rate limiting, bulkheads, and circuit breakers—to design resilient, fault‑tolerant services.

Microservicesfault toleranceresilience patterns
0 likes · 15 min read
How to Build Fault‑Tolerant Microservices: Essential Patterns and Practices
Big Data Technology Architecture
Big Data Technology Architecture
Sep 18, 2022 · Backend Development

Design and Source Code Analysis of Apache DolphinScheduler

This article provides an in‑depth technical overview of Apache DolphinScheduler, covering its distributed design strategies, fault‑tolerance mechanisms, remote log access, source‑code module breakdown, API interfaces, Quartz integration, master‑worker execution flows, RPC communication, load‑balancing algorithms, logging services, and community contribution guidelines.

Distributed SchedulingDolphinSchedulerLog Service
0 likes · 47 min read
Design and Source Code Analysis of Apache DolphinScheduler
Top Architect
Top Architect
Sep 4, 2022 · Backend Development

Designing Fault‑Tolerant Microservices Architecture

The article explains how to build highly available microservice systems by isolating failures, applying graceful degradation, change‑management, health checks, self‑healing, fallback caches, circuit breakers, retry policies, rate limiting and testing strategies, while acknowledging the cost and operational complexity involved.

Retrychange managementcircuit breaker
0 likes · 16 min read
Designing Fault‑Tolerant Microservices Architecture
dbaplus Community
dbaplus Community
Aug 25, 2022 · Backend Development

Mastering Distributed Locks: From Basics to Redlock and Beyond

This comprehensive guide explains why distributed locks are needed, outlines their three essential properties, compares common implementations such as Redis, MySQL, ZooKeeper, and Redlock, discusses pitfalls like non‑atomic operations and lock expiration, and presents correct patterns using atomic commands, Lua scripts, watchdogs, and fencing tokens.

LuaRedlockdistributed-lock
0 likes · 37 min read
Mastering Distributed Locks: From Basics to Redlock and Beyond
DaTaobao Tech
DaTaobao Tech
Aug 15, 2022 · Cloud Native

Reflections on CAP Theory, ACID, BASE, and Cloud‑Native Fault Tolerance

Reflecting on reading, the author reviews CAP theory’s consistency‑availability‑partition trade‑offs, extends ACID and BASE concepts, proposes modernizing CAP objects to consistency, fault and disaster tolerance, and examines how cloud‑native architectures, micro‑services, and SLA‑driven designs reshape fault tolerance and future self‑healing systems.

ACIDBASECAP theorem
0 likes · 21 min read
Reflections on CAP Theory, ACID, BASE, and Cloud‑Native Fault Tolerance
IT Architects Alliance
IT Architects Alliance
Aug 11, 2022 · Fundamentals

Key Distributed System Concepts: Bloom Filter, Consistent Hashing, Quorum, Leader/Follower, and More

This article introduces essential distributed‑system mechanisms—including Bloom filters, consistent hashing, quorum, leader/follower roles, heartbeats, fencing, write‑ahead logs, segment logs, high‑water marks, leases, gossip protocols, failure detection, CAP/PACELC theorems, hinted handoff, read‑repair, and Merkle trees—to help engineers design scalable and fault‑tolerant services.

CAP theoremConsistencyData Structures
0 likes · 12 min read
Key Distributed System Concepts: Bloom Filter, Consistent Hashing, Quorum, Leader/Follower, and More

Understanding Spark Streaming Checkpoint Mechanism for Real‑Time Feature Computation

The article explains how Spark Streaming's checkpoint mechanism works, detailing the four-step process—from setting the checkpoint directory to writing RDD data and finalizing the checkpoint—highlighting its role in ensuring fault‑tolerant, fast recovery for real‑time recommendation feature pipelines.

Big DataCheckpointReal-time Processing
0 likes · 7 min read
Understanding Spark Streaming Checkpoint Mechanism for Real‑Time Feature Computation
IT Architects Alliance
IT Architects Alliance
Jun 20, 2022 · Cloud Native

Building Resilient Microservices: Fault Tolerance, Graceful Degradation, and Reliability Patterns

This article explains how microservice architectures can achieve high availability by using fault‑tolerant designs such as graceful degradation, health checks, failover caching, circuit breakers, bulkheads, rate limiting, and systematic change‑management practices to mitigate network, hardware, and application errors.

MicroservicesResiliencecircuit breaker
0 likes · 13 min read
Building Resilient Microservices: Fault Tolerance, Graceful Degradation, and Reliability Patterns
MaGe Linux Operations
MaGe Linux Operations
Jun 10, 2022 · Fundamentals

Demystifying Paxos: How Distributed Systems Achieve Consensus

This article explains the Paxos consensus algorithm—its origins, core concepts, roles of proposers, acceptors and learners, safety and liveness constraints, the two-phase protocol, proposal generation, and practical variations—showing why Paxos remains a foundational solution for fault‑tolerant distributed systems.

Consensus AlgorithmDistributed SystemsPaxos
0 likes · 16 min read
Demystifying Paxos: How Distributed Systems Achieve Consensus
Ctrip Technology
Ctrip Technology
Jun 9, 2022 · Databases

Ctrip Order Database Architecture Optimization and Sharding Case Study

This article details Ctrip's comprehensive redesign of its airline ticket order database, covering the background performance bottlenecks, vertical and hot‑cold data splitting, sharding key selection, multi‑level caching, cross‑shard query optimization, dual‑write mechanisms, fault‑tolerance strategies, project planning, and the resulting improvements in scalability and cost.

Dual WritePerformance Optimizationdatabase sharding
0 likes · 37 min read
Ctrip Order Database Architecture Optimization and Sharding Case Study
Architects Research Society
Architects Research Society
May 22, 2022 · Operations

Designing Resilient Microservices: Fault‑Tolerance Patterns and Practices

This article explains how to build highly available microservice systems by defining clear service boundaries, employing graceful degradation, change‑management strategies, health checks, self‑healing, cache failover, retry logic, rate limiting, bulkheads, circuit breakers, and testing techniques to mitigate failures in distributed environments.

Cloud Nativechange managementcircuit breaker
0 likes · 15 min read
Designing Resilient Microservices: Fault‑Tolerance Patterns and Practices
macrozheng
macrozheng
Apr 14, 2022 · Operations

Mastering High Availability: 4 Essential Design Techniques for Scalable Systems

This article outlines the core high‑availability techniques—system splitting, decoupling, asynchronous processing, retry, compensation, backup, multi‑active strategies, isolation, rate limiting, circuit breaking, and degradation—providing practical guidance for designing resilient, scalable backend architectures in large‑scale internet applications.

Distributed SystemsMicroservicesSystem Design
0 likes · 13 min read
Mastering High Availability: 4 Essential Design Techniques for Scalable Systems
Architect
Architect
Mar 11, 2022 · Operations

Rate Limiting, Circuit Breaking, and Service Degradation: Key Fault‑Tolerance Patterns for Distributed Systems

The article explains why distributed systems need fault‑tolerance mechanisms such as rate limiting, circuit breaking, and service degradation, describes common metrics (TPS, HPS, QPS), outlines several limiting algorithms (counter, sliding window, leaky bucket, token bucket, distributed and Hystrix‑based), and discusses circuit‑breaker states, considerations, and practical Hystrix usage.

HystrixMicroservicescircuit breaker
0 likes · 17 min read
Rate Limiting, Circuit Breaking, and Service Degradation: Key Fault‑Tolerance Patterns for Distributed Systems
IT Architects Alliance
IT Architects Alliance
Mar 10, 2022 · Backend Development

Building Resilient Microservices: Patterns and Practices for High Availability

This article explains the risks of microservice architectures and presents a collection of reliability patterns—including graceful degradation, change management, health checks, self‑healing, failover caching, retries, rate limiting, bulkheads, and circuit breakers—to help engineers design and operate highly available backend services.

BackendMicroservicesResilience
0 likes · 17 min read
Building Resilient Microservices: Patterns and Practices for High Availability
IT Services Circle
IT Services Circle
Feb 12, 2022 · Cloud Computing

Azure Leap‑Year Outage and Leap‑Second Impacts on Cloud Systems

The article analyzes the 2012 Azure outage caused by a leap‑year date bug, explains Azure's cluster and Fabric Controller architecture, discusses common leap‑year and leap‑second pitfalls, and shows how time anomalies can cascade through DNS and other cloud services, illustrated with real code examples.

AzureDNScloud computing
0 likes · 12 min read
Azure Leap‑Year Outage and Leap‑Second Impacts on Cloud Systems
IT Architects Alliance
IT Architects Alliance
Feb 3, 2022 · Cloud Native

Building a Docker‑Powered Microservice PaaS with Spring Cloud Netflix

This article explains how to design and implement a microservice‑based PaaS platform using Docker containers, Spring Cloud Netflix components such as Zuul, Eureka, and Hystrix, covering service gateway routing, registration and discovery, deployment, fault tolerance, and dynamic configuration.

DockerDynamic ConfigurationMicroservices
0 likes · 14 min read
Building a Docker‑Powered Microservice PaaS with Spring Cloud Netflix
IT Architects Alliance
IT Architects Alliance
Jan 23, 2022 · Operations

Microservice Monitoring, Fault Tolerance, Access Security, and Container Technology Overview

This article provides a comprehensive guide to microservice monitoring—including log, tracing, and metrics approaches—fault‑tolerance isolation techniques, access‑security mechanisms such as API‑gateway and OAuth2.0, and the role of container technologies like Docker in cloud‑native deployments.

Cloud NativeContainersMicroservices
0 likes · 30 min read
Microservice Monitoring, Fault Tolerance, Access Security, and Container Technology Overview
Code DAO
Code DAO
Dec 17, 2021 · Artificial Intelligence

How to Scale XGBoost with Ray for Distributed Multi‑GPU Training

XGBoost‑Ray provides a fault‑tolerant, multi‑node, multi‑GPU backend for XGBoost that integrates seamlessly with Ray Tune, supports distributed data loading, and can be enabled with only three code changes, enabling scalable training and inference on large clusters.

Distributed TrainingGPURay
0 likes · 8 min read
How to Scale XGBoost with Ray for Distributed Multi‑GPU Training
Architects Research Society
Architects Research Society
Dec 9, 2021 · Fundamentals

Key Challenges in Designing Distributed Systems

Designing a distributed system involves overcoming major challenges such as heterogeneity, transparency, openness, concurrency, security, scalability, and fault tolerance, each of which must be addressed to build a reliable, extensible, and performant system.

Distributed SystemsScalabilitySecurity
0 likes · 7 min read
Key Challenges in Designing Distributed Systems
MaGe Linux Operations
MaGe Linux Operations
Nov 24, 2021 · Backend Development

Mastering Go Circuit Breakers: Boost System Resilience with gobreaker

This article explains how to use the Go gobreaker library to implement circuit‑breaker patterns, describing its three states, state transitions, configurable parameters, and providing full source‑code examples to help developers improve fault tolerance in micro‑service architectures.

BackendGoMicroservices
0 likes · 9 min read
Mastering Go Circuit Breakers: Boost System Resilience with gobreaker
Java Architect Essentials
Java Architect Essentials
Nov 19, 2021 · Fundamentals

A Comprehensive Guide to Learning Distributed Systems

This article provides a thorough overview of distributed systems, explaining their definition, core concepts such as partition and replication, key challenges, essential characteristics, typical components and protocols, a practical request flow example, and a curated list of real‑world implementations to help readers build a solid learning roadmap.

ConsistencyDistributed SystemsPartition
0 likes · 17 min read
A Comprehensive Guide to Learning Distributed Systems
Beike Product & Technology
Beike Product & Technology
Nov 19, 2021 · Backend Development

Implementing a Hystrix‑Style Circuit Breaker in the PHP Ecosystem: Principles, Design, and Practice

This article explains the problem of service avalanche in distributed systems, introduces the Hystrix circuit‑breaker concept and its four command modes, evaluates existing PHP implementations, and details the design and implementation of a custom hystrix‑ex Composer package that integrates with Guzzle middleware for high‑concurrency fault tolerance.

BackendMicroservicescircuit-breaker
0 likes · 14 min read
Implementing a Hystrix‑Style Circuit Breaker in the PHP Ecosystem: Principles, Design, and Practice
DataFunTalk
DataFunTalk
Nov 13, 2021 · Cloud Native

Designing Cloud‑Native Distributed Database Architecture: Lessons from TiDB

This article explores how to design a cloud‑native distributed database architecture by examining TiDB’s current structure, proposing a storage‑compute separation that leverages cloud services like S3 and EBS, and discussing implications for cost, scalability, fault‑tolerance, and multi‑tenant deployment.

ScalabilityTiDBarchitecture
0 likes · 14 min read
Designing Cloud‑Native Distributed Database Architecture: Lessons from TiDB
Full-Stack Internet Architecture
Full-Stack Internet Architecture
Oct 23, 2021 · Backend Development

Redis Distributed Locks: Safety Issues, Redlock Debate, and Best Practices

This article thoroughly examines how Redis distributed locks work, the safety challenges they face—including deadlocks, lock expiration, and node failures—explores the Redlock algorithm and its controversies, compares Redis with Zookeeper implementations, and offers practical guidelines and best‑practice solutions for reliable distributed locking.

Redlockconcurrencydistributed-lock
0 likes · 32 min read
Redis Distributed Locks: Safety Issues, Redlock Debate, and Best Practices
Alibaba Cloud Developer
Alibaba Cloud Developer
Oct 13, 2021 · Big Data

Why “Exactly‑Once” Doesn’t Guarantee Consistency in Stream Processing

This article examines the true meaning of consistency in stream computing, clarifies common misconceptions about exactly‑once semantics, formalizes consistency challenges, and reviews how major stream engines such as Google MillWheel, Apache Flink, Kafka Streams, and Spark Streaming implement end‑to‑end consistency.

Big DataExactly-Oncefault tolerance
0 likes · 29 min read
Why “Exactly‑Once” Doesn’t Guarantee Consistency in Stream Processing
dbaplus Community
dbaplus Community
Sep 23, 2021 · Cloud Native

Master Distributed System Design: Patterns, Performance & Fault Tolerance

This article provides a comprehensive overview of distributed system architecture, covering design patterns such as gateways, sidecars and service meshes, performance techniques like caching and sharding, fault‑tolerance mechanisms including rate limiting and circuit breakers, and DevOps practices for deployment and monitoring, all aimed at building resilient cloud‑native applications.

DevOpsMicroservicesfault tolerance
0 likes · 15 min read
Master Distributed System Design: Patterns, Performance & Fault Tolerance
Architecture Digest
Architecture Digest
Sep 23, 2021 · Operations

High Availability Practices: From Taobao to Cloud

This talk shares practical high‑availability strategies learned from years of building Taobao’s massive e‑commerce platform and migrating to Alibaba Cloud, covering traditional IDC stability mechanisms, cache and disaster‑recovery designs, cloud‑native fault‑tolerance, capacity planning, rate‑limiting, graceful degradation, and multi‑region resilience.

Distributed Systemscachingcapacity planning
0 likes · 20 min read
High Availability Practices: From Taobao to Cloud
NiuNiu MaTe
NiuNiu MaTe
Sep 8, 2021 · Backend Development

Mastering Distributed Locks with Redis: From Basics to RedLock

This article explains what distributed locks are, outlines their essential properties, walks through step‑by‑step Redis implementations—from simple SETNX to Lua‑based atomic operations—and discusses reliability strategies such as master‑slave failover and RedLock while highlighting the inherent limits of any distributed lock.

Lua scriptingRedlockdistributed-lock
0 likes · 11 min read
Mastering Distributed Locks with Redis: From Basics to RedLock
Architecture Digest
Architecture Digest
Aug 22, 2021 · Operations

High Availability Practices: From Taobao to Cloud Migration

This talk shares practical high‑availability design experiences from Alibaba’s e‑commerce platform to its cloud services, covering traditional IDC stability mechanisms, cache and disaster‑recovery strategies, cloud‑native fault handling, capacity planning, traffic shaping, and lessons learned from real incidents.

AlibabaDistributed Systemscloud architecture
0 likes · 19 min read
High Availability Practices: From Taobao to Cloud Migration
ITFLY8 Architecture Home
ITFLY8 Architecture Home
Aug 20, 2021 · Operations

From Taobao to the Cloud: Secrets of Building Ultra‑High‑Availability Systems

This talk shares practical high‑availability strategies learned from Alibaba’s Taobao platform and Alibaba Cloud, covering traditional IDC stability, cache and disaster‑recovery designs, cloud‑native fault‑tolerance, performance‑capacity trade‑offs, traffic shaping, multi‑region replication, and lessons from real‑world incidents like GitLab failures.

AlibabaPerformance Optimizationcloud architecture
0 likes · 21 min read
From Taobao to the Cloud: Secrets of Building Ultra‑High‑Availability Systems
ITFLY8 Architecture Home
ITFLY8 Architecture Home
Aug 17, 2021 · Backend Development

How Meituan Scaled Instant Logistics with Distributed Systems and AI

This article details Meituan's five‑year journey building a high‑availability, low‑latency instant logistics platform, describing the distributed architecture evolution, AI‑driven optimizations, fault‑tolerance techniques, and future challenges in scaling micro‑services for massive order and rider volumes.

AI logisticsDistributed SystemsMicroservices
0 likes · 12 min read
How Meituan Scaled Instant Logistics with Distributed Systems and AI
Baidu Intelligent Testing
Baidu Intelligent Testing
Jul 29, 2021 · Backend Development

Building High‑Availability Architecture for Baidu Feed Online Recommendation System

This article describes how Baidu engineered a flexible, multi‑level fault‑tolerant architecture—including dynamic retry scheduling, multi‑recall coordination, ranking layer degradation, and cross‑IDC multi‑master storage—to achieve five‑nine availability for its massive feed recommendation service.

Cloud Nativedynamic retryfault tolerance
0 likes · 16 min read
Building High‑Availability Architecture for Baidu Feed Online Recommendation System
21CTO
21CTO
Jul 16, 2021 · Operations

What Bilibili’s Outage Teaches About Achieving True High Availability

The article analyzes Bilibili’s recent service outage, explains why high availability matters, introduces key metrics like MTBF and MTTR, and outlines practical strategies such as redundancy, rate limiting, isolation, failover, timeout control, circuit breaking, degradation, and multi‑region deployment to build resilient systems.

MTBFMTTROperations
0 likes · 18 min read
What Bilibili’s Outage Teaches About Achieving True High Availability
vivo Internet Technology
vivo Internet Technology
Jul 14, 2021 · Backend Development

Hystrix Source Code Analysis: Circuit Breaker, Isolation, and Fallback Mechanisms

Analyzing Hystrix’s source code reveals how its circuit‑breaker, bulkhead isolation (semaphore or thread‑pool), timeout detection, fallback logic, and sliding‑window health metrics work together to prevent cascading failures in distributed systems, as illustrated by an e‑commerce order service calling multiple downstream services.

Distributed SystemsHystrixMicroservices
0 likes · 20 min read
Hystrix Source Code Analysis: Circuit Breaker, Isolation, and Fallback Mechanisms
ITPUB
ITPUB
Jun 29, 2021 · Backend Development

Is Redis Distributed Lock Really Safe? A Deep Dive into Redlock, Pitfalls, and Alternatives

This article thoroughly examines the safety of Redis‑based distributed locks, explains basic SETNX locking, explores deadlock and lock‑release problems, presents robust solutions such as atomic SET with expiration, Lua scripts, and unique tokens, and critically compares Redlock with Zookeeper while summarizing expert debates and best‑practice recommendations.

LuaRedlockfault tolerance
0 likes · 34 min read
Is Redis Distributed Lock Really Safe? A Deep Dive into Redlock, Pitfalls, and Alternatives
Architects Research Society
Architects Research Society
Jun 16, 2021 · Backend Development

Common Pitfalls in Microservice Integration and How to Mitigate Them

The article explains three frequent pitfalls when adopting microservices—complex remote communication, asynchronous processing challenges, and distributed transaction difficulties—and shows how fast‑fail, retries, timeouts, compensation, lightweight workflow engines, and idempotency can reduce complexity and improve resilience.

Distributed SystemsIdempotencyfault tolerance
0 likes · 13 min read
Common Pitfalls in Microservice Integration and How to Mitigate Them
21CTO
21CTO
Jun 9, 2021 · Cloud Native

Baidu’s Low‑Intrusion, High‑Performance Service Mesh: Architecture & Lessons

This article details Baidu’s internal service‑mesh deployment, explaining why traditional RPC‑based governance fell short, how a sidecar‑based mesh decouples governance from frameworks, and the technical challenges and solutions for low‑intrusion, high‑performance, fault‑tolerant traffic management across tens of thousands of microservices.

Cloud NativeMicroservicesPerformance Optimization
0 likes · 18 min read
Baidu’s Low‑Intrusion, High‑Performance Service Mesh: Architecture & Lessons
Baidu Geek Talk
Baidu Geek Talk
Jun 9, 2021 · Cloud Native

Baidu's Internal Service Mesh Practice: Architecture, Challenges, and Performance Optimizations

Baidu created an internally‑built, Istio‑based service mesh that decouples governance from language‑specific RPCs, offering low‑intrusion integration, ultra‑low latency via a brpc coroutine data plane, advanced fault‑tolerance and fine‑grained traffic scheduling, and now powers over 80 % of its core microservices handling more than a trillion daily requests.

EnvoyIstioMicroservices
0 likes · 17 min read
Baidu's Internal Service Mesh Practice: Architecture, Challenges, and Performance Optimizations
Programmer DD
Programmer DD
Jun 4, 2021 · Operations

Mastering Fault‑Tolerant Microservices: Patterns for Reliable Distributed Systems

This article explores essential patterns and techniques—such as graceful degradation, change management, health checks, failover caching, retry logic, rate limiting, circuit breakers, and chaos testing—to build highly available microservice architectures that can withstand network, hardware, and application failures.

Cloud Nativecircuit breakerfault tolerance
0 likes · 15 min read
Mastering Fault‑Tolerant Microservices: Patterns for Reliable Distributed Systems
Amap Tech
Amap Tech
May 28, 2021 · Operations

System Observability Practices in Gaode Ride-Hailing: From Unified Logging to Fault Defense

Gaode Ride‑Hailing created a comprehensive 360° observability platform—standardized logging, distributed tracing, multi‑domain metrics, visual dashboards, and an incident workflow—that transforms raw data into actionable insights, accelerates root‑cause analysis, and enables automated fault defense for its large‑scale cloud‑native microservice system.

Distributed SystemsObservabilityfault tolerance
0 likes · 22 min read
System Observability Practices in Gaode Ride-Hailing: From Unified Logging to Fault Defense
Architects' Tech Alliance
Architects' Tech Alliance
May 19, 2021 · Operations

Designing Microservices Architecture for Failure: Patterns and Practices

Microservice architectures must handle inevitable network, hardware, and application errors by employing fault‑tolerant patterns such as graceful degradation, change management, health checks, fail‑over caches, retry logic, rate limiting, circuit breakers, and testing strategies to maintain service reliability and user experience.

MicroservicesOperationsReliability
0 likes · 15 min read
Designing Microservices Architecture for Failure: Patterns and Practices
JD Tech Talk
JD Tech Talk
May 17, 2021 · Databases

Design and Optimization of Multi‑Data‑Center Redis Synchronization

This article describes the challenges of native Redis in multi‑data‑center deployments and presents the design, implementation, and performance evaluation of a custom Redis extension that adds bidirectional synchronization, rlog logging, protocol enhancements, and conflict‑resolution mechanisms to achieve reliable cross‑region active‑active operation.

Multi-Data Centerdata synchronizationfault tolerance
0 likes · 16 min read
Design and Optimization of Multi‑Data‑Center Redis Synchronization
Top Architect
Top Architect
Apr 21, 2021 · Backend Development

Microservice Architecture Evolution: From Monolith to Service Mesh

This article chronicles the evolution of an online supermarket from a simple monolithic website to a fully split microservice architecture, discussing the motivations, challenges, design patterns, monitoring, fault tolerance, testing, service discovery, and the eventual adoption of service mesh.

Backendarchitecturefault tolerance
0 likes · 23 min read
Microservice Architecture Evolution: From Monolith to Service Mesh
vivo Internet Technology
vivo Internet Technology
Apr 14, 2021 · Fundamentals

An Overview of the Raft Distributed Consensus Algorithm

Raft is a fault‑tolerant distributed consensus protocol that simplifies Paxos by electing a single leader each term to coordinate client requests, replicate logs to a majority of servers, ensure safety through up‑to‑date voting, handle failures with randomized timeouts, resolve log conflicts, and compress logs via snapshots.

Log ReplicationRaftdistributed consensus
0 likes · 19 min read
An Overview of the Raft Distributed Consensus Algorithm
Kuaishou Tech
Kuaishou Tech
Apr 9, 2021 · Backend Development

Design and Implementation of Red Packet Calculation and Distribution System for Spring Festival Activity

This article details the design of a red packet calculation and distribution system for a Spring Festival promotion, addressing mixed-type packet computation, seamless transition to awarding, distributed task processing, stability guarantees, and successful implementation results.

Batch ProcessingSpring Festivaldistributed computing
0 likes · 12 min read
Design and Implementation of Red Packet Calculation and Distribution System for Spring Festival Activity
IT Architects Alliance
IT Architects Alliance
Apr 5, 2021 · Operations

Design and Implementation of a Handcrafted Distributed Cluster (MyCluster)

This article describes how to design and build a native distributed cluster called MyCluster without using any existing frameworks, covering master‑slave architecture, leader election, split‑brain handling, centralized configuration management, custom communication protocols, state transitions, and client interfaces.

Cluster ArchitectureConfiguration ManagementDistributed Systems
0 likes · 13 min read
Design and Implementation of a Handcrafted Distributed Cluster (MyCluster)
DataFunTalk
DataFunTalk
Mar 28, 2021 · Big Data

Flink Stream‑Batch Integration: Layered Architecture, Unified SDK, DAG Scheduler, Shuffle, and Fault‑Tolerance

This article explains how Apache Flink has evolved into a unified stream‑batch engine by introducing a three‑layer architecture, a unified DataStream SDK, a pipeline‑region‑based DAG scheduler, a common shuffle framework, and enhanced fault‑tolerance mechanisms to address efficiency, consistency, and resource‑utilisation challenges in real‑time big‑data processing.

Apache FlinkBatch ProcessingDAG scheduler
0 likes · 25 min read
Flink Stream‑Batch Integration: Layered Architecture, Unified SDK, DAG Scheduler, Shuffle, and Fault‑Tolerance
Baidu Geek Talk
Baidu Geek Talk
Mar 22, 2021 · Operations

How Baidu Achieved 99.999% Uptime for Its Massive Feed Recommendation System

This article details Baidu's Feed recommendation system architecture, explaining how a combination of dynamic retry scheduling, real‑time stop‑loss mechanisms, multi‑recall frameworks, ranking layer fallbacks, and IDC‑level multi‑master designs collectively ensure five‑nine availability across billions of daily requests.

Distributed SystemsMicroservicesOperations
0 likes · 18 min read
How Baidu Achieved 99.999% Uptime for Its Massive Feed Recommendation System
Wukong Talks Architecture
Wukong Talks Architecture
Mar 18, 2021 · Fundamentals

Understanding Distributed Theory and Algorithms: Importance, Core Concepts, and Learning Path

This article explains why distributed theory and algorithms are crucial for architects, outlines the four foundational theories and eight key protocols, discusses their four evaluation dimensions, and provides a step‑by‑step learning roadmap illustrated with stories and practical examples.

CAP theoremConsistencyDistributed Systems
0 likes · 10 min read
Understanding Distributed Theory and Algorithms: Importance, Core Concepts, and Learning Path
IT Architects Alliance
IT Architects Alliance
Mar 15, 2021 · Backend Development

Evolution of Meituan Instant Logistics Distributed System Architecture and Practices

The article details Meituan's five‑year journey in instant logistics, describing how distributed, high‑concurrency backend architectures were progressively upgraded to microservices, how AI is integrated for pricing, ETA and dispatch, and the operational techniques used to ensure scalability, fault tolerance, and high availability.

AIDistributed SystemsLogistics
0 likes · 8 min read
Evolution of Meituan Instant Logistics Distributed System Architecture and Practices
Xianyu Technology
Xianyu Technology
Feb 8, 2021 · Backend Development

Design and Implementation of a Cluster‑Aware Guava Cache Component for High Reliability

The paper presents a cluster‑aware Guava cache component for Alibaba’s Xianyu platform that mitigates downstream service failures by adding asynchronous reload, cluster‑wide key invalidation, and size reporting, enabling automatic fallback to refreshed local data and improving latency, with future plans for a management console, tiered storage, and disk‑backed caching.

Guavacachingfault tolerance
0 likes · 8 min read
Design and Implementation of a Cluster‑Aware Guava Cache Component for High Reliability
Sohu Tech Products
Sohu Tech Products
Jan 20, 2021 · Backend Development

Understanding Dubbo’s Core Architecture: Service Registration, Invocation, Routing, and Thread Dispatch Mechanisms

This article explains Dubbo’s internal architecture, covering service registration and discovery with Zookeeper, RPC invocation details including load balancing, routing, and fault‑tolerance strategies, as well as its network protocol and thread‑dispatch mechanisms, providing practical insights for backend developers.

DubboMicroservicesThread Dispatch
0 likes · 13 min read
Understanding Dubbo’s Core Architecture: Service Registration, Invocation, Routing, and Thread Dispatch Mechanisms
Top Architect
Top Architect
Jan 6, 2021 · Cloud Native

Implementing a Microservice Architecture with Spring Cloud, Docker, and PaaS

This article explains how to build a microservice‑based PaaS platform using Spring Cloud Netflix components, Docker containers, Eureka for service registration, Zuul as a gateway, Hystrix for fault tolerance, and a dynamic configuration center to achieve agile development and continuous integration.

DockerDynamic ConfigurationMicroservices
0 likes · 13 min read
Implementing a Microservice Architecture with Spring Cloud, Docker, and PaaS
Efficient Ops
Efficient Ops
Jan 5, 2021 · Operations

How to Prevent ZooKeeper Split‑Brain: Best Practices and Fault‑Tolerance Strategies

This article explains why ZooKeeper clusters should use an odd number of nodes, how the majority quorum mechanism avoids split‑brain scenarios, and outlines practical solutions such as quorums, redundant communication, fencing, arbitration, and disk‑lock techniques to ensure reliable distributed coordination.

Distributed SystemsSplit-BrainZooKeeper
0 likes · 14 min read
How to Prevent ZooKeeper Split‑Brain: Best Practices and Fault‑Tolerance Strategies
Architects Research Society
Architects Research Society
Dec 30, 2020 · Fundamentals

Key Challenges in Designing Distributed Systems

Designing a distributed system involves overcoming major challenges such as heterogeneity, transparency, openness, concurrency, security, scalability, and fault tolerance, each requiring careful consideration of hardware, software, network, and management aspects to build robust, scalable, and secure architectures.

Distributed SystemsScalabilitySecurity
0 likes · 9 min read
Key Challenges in Designing Distributed Systems
FunTester
FunTester
Dec 12, 2020 · Operations

Why Redundancy Is the Key to Effective Disaster Recovery in IT Systems

The article explains that disaster recovery for information systems relies on redundancy across hardware, energy, and data, classifies natural, human, and technical disasters, defines critical metrics such as RTO and RPO, and outlines the technologies, architectures, and maturity levels needed to ensure business continuity.

RPORTObusiness continuity
0 likes · 29 min read
Why Redundancy Is the Key to Effective Disaster Recovery in IT Systems
Architecture Digest
Architecture Digest
Dec 9, 2020 · Backend Development

Implementing Distributed Locks with Redis: Concepts, Algorithms, and Code Examples

This article explains how to implement distributed locks using Redis, covering the essential requirements of mutual exclusion, deadlock avoidance, and fault tolerance, detailing single‑instance and multi‑instance algorithms, code examples with SETNX and Lua scripts, and discussing challenges such as latency, crashes, and persistence.

Backend Developmentconcurrencyfault tolerance
0 likes · 10 min read
Implementing Distributed Locks with Redis: Concepts, Algorithms, and Code Examples
Manbang Technology Team
Manbang Technology Team
Nov 23, 2020 · Operations

Designing a Comprehensive Stability Assurance System for Large‑Scale Internet Services at Manbang

This article explains how Manbang built a rigorous stability‑assurance framework—including strict fault grading, a "watch‑and‑protect" system, blue‑green deployments, online pressure testing, fault‑drill platforms, and runtime metadata—to ensure rapid iteration while maintaining high availability for millions of logistics users.

fault tolerance
0 likes · 12 min read
Designing a Comprehensive Stability Assurance System for Large‑Scale Internet Services at Manbang
Tencent Cloud Developer
Tencent Cloud Developer
Nov 19, 2020 · Backend Development

Kafka Message Queue Reliability Design and Implementation

The article thoroughly explains Kafka’s message‑queue reliability design and implementation, covering use‑case scenarios, core concepts, storage format, producer acknowledgment settings, broker replication mechanisms (ISR, HW, LEO), consumer delivery semantics, the epoch solution for synchronization, and practical configuration guidelines for various consistency and availability requirements.

BrokerConsistencyConsumer
0 likes · 15 min read
Kafka Message Queue Reliability Design and Implementation
JavaEdge
JavaEdge
Oct 24, 2020 · Databases

Mastering Redis Cluster: Scaling, Routing, and Fault Tolerance Explained

This article explains why Redis clusters are needed, how CLUSTER MEET builds the network, slot assignment, scaling procedures, client redirection mechanisms, batch operations, fault detection, recovery processes, and common operational pitfalls, providing practical guidance for building and maintaining a robust Redis Cluster deployment.

Clusterfault toleranceredis
0 likes · 22 min read
Mastering Redis Cluster: Scaling, Routing, and Fault Tolerance Explained
DevOps
DevOps
Oct 20, 2020 · Cloud Computing

Chaos Monkey and the Simian Army: Building Resilient Cloud Systems

The article explains how Netflix uses Chaos Monkey and a suite of related tools, collectively called the Simian Army, to deliberately inject failures into their cloud infrastructure, continuously test fault‑tolerance, and ensure high availability and reliability for their streaming service.

NetflixOperationsSimian Army
0 likes · 7 min read
Chaos Monkey and the Simian Army: Building Resilient Cloud Systems
IT Architects Alliance
IT Architects Alliance
Oct 13, 2020 · Cloud Native

Designing Fault‑Tolerant Microservices Architecture

Microservice architectures increase system complexity and failure rates, so this article explains key reliability patterns—such as graceful degradation, change management, health checks, self‑healing, fallback caches, retry logic, rate limiting, circuit breakers, and testing—to help engineers design resilient, high‑availability services.

Cloud NativeMicroservicesOperations
0 likes · 23 min read
Designing Fault‑Tolerant Microservices Architecture
Architects' Tech Alliance
Architects' Tech Alliance
Oct 12, 2020 · Operations

Designing Resilient Microservices: Patterns for Fault Tolerance and Failure Management

This article examines the inherent risks of microservice architectures and presents practical patterns—such as graceful degradation, change management, health checks, self‑healing, fallback caching, retries, rate limiting, bulkheads, and circuit breakers—to build highly available, fault‑tolerant services.

MicroservicesResiliencebulkhead
0 likes · 15 min read
Designing Resilient Microservices: Patterns for Fault Tolerance and Failure Management
Top Architect
Top Architect
Oct 11, 2020 · Cloud Native

Using Hystrix for Fault Tolerance in Spring Cloud Microservices

This article explains how to integrate Netflix Hystrix into Spring Cloud applications to provide request timeout, circuit‑breaker, fallback, monitoring and resource isolation for microservice calls, including Maven setup, annotation usage, Feign client fallback configuration and disabling options.

HystrixSpring Cloudcircuit breaker
0 likes · 9 min read
Using Hystrix for Fault Tolerance in Spring Cloud Microservices
Architect
Architect
Oct 6, 2020 · Backend Development

Implementing Hystrix for Fault Tolerance in Spring Cloud Microservices

This article explains why microservice calls need fault‑tolerance mechanisms, introduces Hystrix’s core features such as timeouts, circuit‑breaker, fallback, monitoring and resource isolation, and provides step‑by‑step code examples for integrating Hystrix and Feign in a Spring Cloud project.

HystrixJavaMicroservices
0 likes · 8 min read
Implementing Hystrix for Fault Tolerance in Spring Cloud Microservices
DataFunTalk
DataFunTalk
Oct 2, 2020 · Big Data

Single-Task Recovery in Flink: Design and Implementation for Real‑Time Stream Processing

This article describes ByteDance's single‑task recovery solution for Flink's real‑time computation, detailing the problem of global job restarts, the proposed network‑layer enhancements, upstream and downstream optimizations, JobManager restart strategy, implementation challenges, and the measurable latency and availability benefits achieved in production.

FlinkSingle-Task Recoveryfault tolerance
0 likes · 11 min read
Single-Task Recovery in Flink: Design and Implementation for Real‑Time Stream Processing
Xianyu Technology
Xianyu Technology
Sep 27, 2020 · Backend Development

Design of an Asynchronous Component with Monitoring, Fault Tolerance, and Zero‑Cost Integration

The article presents a design for an asynchronous component that is monitorable, fault‑tolerant, and integrates with zero overhead, compares Akka, RxJava, and a custom JUC‑based implementation, and selects the latter—using extended Callables and a CountDownLatch—to track business units, handle timeouts, and provide fallback behavior.

AsynchronousJUCJava
0 likes · 8 min read
Design of an Asynchronous Component with Monitoring, Fault Tolerance, and Zero‑Cost Integration
Architect's Tech Stack
Architect's Tech Stack
Sep 21, 2020 · Backend Development

Overview of Tars: A High‑Performance RPC Framework and Service Governance Platform

The article introduces Tars, an open‑source high‑performance RPC framework and integrated service governance platform derived from Tencent's internal microservice architecture, detailing its design philosophy, layered architecture, core features such as the Tars protocol, load balancing, fault and overload protection, and centralized configuration management.

Backend DevelopmentMicroservicesRPC
0 likes · 11 min read
Overview of Tars: A High‑Performance RPC Framework and Service Governance Platform
21CTO
21CTO
Sep 12, 2020 · Fundamentals

Why Distributed Systems Mirror Single‑Node Concurrency and How to Avoid Common Pitfalls

This article explains how concurrency issues that appear in single‑threaded programs become amplified in distributed systems, covering consistency models, network reliability, clock synchronization, fault detection, backpressure, and cascading failures, and offers practical design and testing strategies to build resilient architectures.

Consistencyconcurrencyfault tolerance
0 likes · 19 min read
Why Distributed Systems Mirror Single‑Node Concurrency and How to Avoid Common Pitfalls
Sohu Tech Products
Sohu Tech Products
Sep 2, 2020 · Backend Development

Implementing Distributed Locks with Redis and Redisson: Abstraction, Auto-Release, and Fault Tolerance

This article explains how to use Redis and Redisson for distributed locking, introduces an abstract DistributedLock interface for flexible implementations, demonstrates automatic lock release with functional callbacks, and discusses fallback strategies and monitoring to ensure reliability in backend systems.

Javadistributed-lockfault tolerance
0 likes · 5 min read
Implementing Distributed Locks with Redis and Redisson: Abstraction, Auto-Release, and Fault Tolerance
Programmer DD
Programmer DD
Jul 31, 2020 · Fundamentals

What Is Distributed Architecture and Why It Powers Modern Systems

This article explains the concept of distributed architecture, its evolution from monolithic systems, core design principles, common challenges such as network latency and data consistency, and how major tech companies adopt it to achieve high availability, scalability, and fault tolerance.

System Designfault tolerance
0 likes · 11 min read
What Is Distributed Architecture and Why It Powers Modern Systems
Programmer DD
Programmer DD
Jul 28, 2020 · Fundamentals

What Makes Distributed Architecture Essential for Modern Systems?

Distributed architecture, built on distributed computing technologies like J2EE, transforms monolithic systems into multi‑layer, fault‑tolerant platforms by decoupling services, ensuring high availability, scalability, and resilience, while addressing challenges such as network latency, data consistency, and system complexity, as illustrated by real‑world case studies.

Case StudySystem Designfault tolerance
0 likes · 11 min read
What Makes Distributed Architecture Essential for Modern Systems?
Qunar Tech Salon
Qunar Tech Salon
Jun 16, 2020 · Operations

Qunar's Multi-IDC Deployment and Fault Self‑Healing Architecture

This article describes how Qunar scaled its IDC infrastructure, introduced multi‑IDC deployment, automated DNS‑based load balancing, open‑source DNSDB, and an IDC proxy built on Squid to achieve rapid fault self‑healing and transparent traffic switching for both user and third‑party access.

DNSOperationsProxy
0 likes · 8 min read
Qunar's Multi-IDC Deployment and Fault Self‑Healing Architecture
Top Architect
Top Architect
Jun 10, 2020 · Fundamentals

A Comprehensive Guide to Learning Distributed Systems

This article provides a thorough overview of distributed systems, explaining their definition, core challenges, key characteristics, essential components, common protocols, and practical implementations to help readers build a solid, structured learning path for mastering distributed architectures.

Distributed SystemsSystem Designfault tolerance
0 likes · 16 min read
A Comprehensive Guide to Learning Distributed Systems