Tagged articles
310 articles
Page 3 of 4
Xiaokun's Architecture Exploration Notes
Xiaokun's Architecture Exploration Notes
May 29, 2020 · Fundamentals

How the Byzantine Generals Problem Shapes Modern Distributed Consensus

This article explains the Byzantine Generals Problem, maps its concepts to distributed consensus, distinguishes consensus from consistency, outlines oral‑message and signed‑message solutions, and analyzes their applicability and limitations in fault‑tolerant distributed systems.

Byzantine Generalsconsensus algorithmsdistributed consensus
0 likes · 21 min read
How the Byzantine Generals Problem Shapes Modern Distributed Consensus
58 Tech
58 Tech
May 22, 2020 · Backend Development

Design and Implementation of a Distributed Retry System Based on Distributed Scheduling

This article presents a comprehensive distributed retry system that leverages a distributed scheduling mechanism to ensure eventual consistency, reduce manual recovery costs, and provide flexible retry strategies, automatic recovery detection, visual management, rate limiting, and intelligent retry for backend services.

Backend Developmentfault toleranceintelligent retry
0 likes · 13 min read
Design and Implementation of a Distributed Retry System Based on Distributed Scheduling
Big Data Technology & Architecture
Big Data Technology & Architecture
Apr 13, 2020 · Fundamentals

Understanding Replication, Consistency, Fault Tolerance, and the CAP Theorem in Distributed Systems

This article explains the core concepts of replication, consistency, and fault tolerance in distributed systems, discusses strong and asynchronous replication methods, and details the CAP theorem with its consistency, availability, and partition tolerance trade‑offs, illustrating AP and CP scenarios such as Eureka and Zookeeper clusters.

CAP theoremConsistencyfault tolerance
0 likes · 7 min read
Understanding Replication, Consistency, Fault Tolerance, and the CAP Theorem in Distributed Systems
ITFLY8 Architecture Home
ITFLY8 Architecture Home
Mar 23, 2020 · Fundamentals

15 Timeless Architecture Principles Every Engineer Should Follow

This article outlines how to create solid software architectures by presenting a process for forming design principles, detailing fifteen universal architecture guidelines, and explaining service‑splitting and key design rules that together help build scalable, maintainable, and resilient systems.

MicroservicesScalabilitySoftware Architecture
0 likes · 16 min read
15 Timeless Architecture Principles Every Engineer Should Follow
360 Tech Engineering
360 Tech Engineering
Mar 10, 2020 · Fundamentals

Introduction to Raft: A Comprehensive Overview of the Distributed Consensus Algorithm

This article provides a thorough introduction to the Raft consensus algorithm, explaining its purpose, core components such as state machine replication, log and consensus module, leader‑follower model, client interaction, fault‑tolerance considerations, the CAP trade‑off, and why Go is a suitable implementation language.

GoRaftState Machine Replication
0 likes · 11 min read
Introduction to Raft: A Comprehensive Overview of the Distributed Consensus Algorithm
Youzan Coder
Youzan Coder
Feb 28, 2020 · Big Data

Flink Checkpoint Principle Analysis and Failure Cause Investigation

The article thoroughly explains Apache Flink’s checkpoint mechanism—including state types, coordinator workflow, exactly‑once versus at‑least‑once semantics, common failure sources such as code exceptions, storage or network issues, and practical configuration tips like interval settings, local recovery and externalized checkpoints.

Apache FlinkCheckpointExactly-Once
0 likes · 15 min read
Flink Checkpoint Principle Analysis and Failure Cause Investigation
Architects' Tech Alliance
Architects' Tech Alliance
Feb 10, 2020 · Fundamentals

Mastering Distributed System Fundamentals: Models, Replication, Consistency, and Protocols

This article provides a comprehensive overview of distributed system fundamentals, covering node modeling, replica concepts, consistency levels, data distribution strategies, centralized and decentralized replica protocols, lease mechanisms, quorum, two‑phase commit, MVCC, Paxos, and the CAP theorem, while analyzing their trade‑offs in availability, consistency, and partition tolerance.

ConsistencyDistributed SystemsReplication
0 likes · 55 min read
Mastering Distributed System Fundamentals: Models, Replication, Consistency, and Protocols
Architects' Tech Alliance
Architects' Tech Alliance
Feb 4, 2020 · Backend Development

Microservice Architecture Evolution: From Monolith to Service Mesh

This article walks through the transformation of an online supermarket from a simple monolithic website to a fully fledged microservice architecture, highlighting the motivations, design decisions, common pitfalls, and essential components such as monitoring, tracing, logging, gateways, service discovery, circuit breaking, testing strategies, and service mesh adoption.

DeploymentMicroservicesService Mesh
0 likes · 22 min read
Microservice Architecture Evolution: From Monolith to Service Mesh
Mike Chen's Internet Architecture
Mike Chen's Internet Architecture
Jan 7, 2020 · Backend Development

Spring Cloud vs Dubbo: Protocol Handling, Performance, Load Balancing, Fault Tolerance, and Routing in Microservice Architecture

This article compares Spring Cloud and Dubbo across protocol handling, performance tuning, load‑balancing strategies, fault‑tolerance mechanisms, and routing/traffic‑shaping features, highlighting their flexibility, configuration complexity, and suitability for different microservice scenarios.

DubboMicroservicesPerformance Optimization
0 likes · 6 min read
Spring Cloud vs Dubbo: Protocol Handling, Performance, Load Balancing, Fault Tolerance, and Routing in Microservice Architecture
Efficient Ops
Efficient Ops
Dec 17, 2019 · Operations

How Alibaba Scales Flink: Lessons in Big Data Operations

This article details Alibaba's massive Flink deployment, covering its historical background, the operational challenges of managing tens of thousands of nodes, the design of a comprehensive Flink management platform, and the automated solutions for fault handling, resource allocation, and performance testing in a large‑scale big‑data environment.

AutomationBig Data OperationsCluster Management
0 likes · 20 min read
How Alibaba Scales Flink: Lessons in Big Data Operations
21CTO
21CTO
Nov 15, 2019 · Operations

How SRE Designs Highly Available Software Systems at Scale

This article presents Google SRE expert Ramón Medrano Llamas’s comprehensive guide on designing, operating, and maintaining large‑scale, highly available software systems, covering SRE fundamentals, daily workflows, scalability strategies, fault‑tolerant architecture, monitoring, and operational best practices.

SREScalable Systemsfault tolerance
0 likes · 13 min read
How SRE Designs Highly Available Software Systems at Scale
Java High-Performance Architecture
Java High-Performance Architecture
Nov 12, 2019 · Backend Development

How Kafka Consumer Groups Boost Performance and Fault Tolerance

Kafka consumer groups enable multiple consumers to share partition workloads, ensuring exclusive consumption within a group, flexible consumption patterns like broadcast and unicast, and automatic fault‑tolerance through rebalancing, ultimately improving throughput, scalability, and resilience of streaming applications.

Backend DevelopmentKafkaconsumer groups
0 likes · 4 min read
How Kafka Consumer Groups Boost Performance and Fault Tolerance
Architecture Digest
Architecture Digest
Nov 9, 2019 · Backend Development

Design and Implementation of eBay's Next‑Generation Million‑TPS Core Accounting System

The article details eBay's 2018‑2020 design, performance testing, and fault‑tolerance architecture of a next‑generation core accounting system capable of handling millions of transactions per second, covering system goals, multi‑region deployment, event‑sourcing, Raft consensus, scalability optimizations, and the planned open‑source release.

Event SourcingHigh TPSRaft consensus
0 likes · 24 min read
Design and Implementation of eBay's Next‑Generation Million‑TPS Core Accounting System
Architects' Tech Alliance
Architects' Tech Alliance
Oct 2, 2019 · Operations

Understanding Disaster Tolerance, Fault Tolerance, and Disaster Recovery: Concepts, Differences, and Implementation Strategies

This article explains the definitions of disaster tolerance, fault tolerance, and disaster recovery, compares their purposes, discusses backup versus disaster‑tolerance solutions, outlines key metrics such as RTO and RPO, and presents common architectural and investment considerations for building resilient enterprise systems.

BackupIT OperationsRPO
0 likes · 8 min read
Understanding Disaster Tolerance, Fault Tolerance, and Disaster Recovery: Concepts, Differences, and Implementation Strategies
Alibaba Cloud Native
Alibaba Cloud Native
Sep 25, 2019 · Cloud Native

Mastering Distributed System Design: Patterns, Performance, and Fault Tolerance

This article provides a comprehensive overview of distributed system architecture, covering essential design patterns such as gateways, sidecars, and service meshes, performance techniques like caching and async communication, fault‑tolerance mechanisms including rate limiting and circuit breakers, and practical DevOps practices for deployment and monitoring.

Cloud Nativearchitecture designcaching
0 likes · 13 min read
Mastering Distributed System Design: Patterns, Performance, and Fault Tolerance
Architecture Digest
Architecture Digest
Sep 23, 2019 · Operations

Improving Application Availability: Practices, Monitoring, and Fault‑Tolerance in a Large‑Scale Payment System

The article describes how a high‑traffic payment platform achieves 99.999% availability by avoiding single points of failure, applying fail‑fast principles, implementing resource limits, building real‑time monitoring and alerting, and automating fault detection, routing, and recovery to ensure continuous 7×24 operation.

backend operationsfault tolerancehigh availability
0 likes · 23 min read
Improving Application Availability: Practices, Monitoring, and Fault‑Tolerance in a Large‑Scale Payment System
Java Captain
Java Captain
Sep 19, 2019 · Backend Development

A Comprehensive Overview of Microservice Architecture and Its Evolution

This article presents a detailed, step‑by‑step illustration of microservice architecture, covering its motivations, component breakdown, migration from monoliths, common pitfalls, monitoring, tracing, logging, gateway, service discovery, resilience patterns, testing strategies, frameworks, and the emerging service‑mesh approach.

Service Meshfault tolerancemonitoring
0 likes · 23 min read
A Comprehensive Overview of Microservice Architecture and Its Evolution
AntTech
AntTech
Sep 11, 2019 · Artificial Intelligence

ElasticDL: An Open‑Source Elastic Deep Learning System Built on TensorFlow 2.0 and Kubernetes

ElasticDL, the first industry‑level open‑source system for elastic deep learning on TensorFlow, leverages Kubernetes‑native scheduling, fault‑tolerance, and TensorFlow 2.0 Eager Execution to dramatically improve cluster utilization, simplify distributed training, and integrate seamlessly with tools like Kubeflow and SQLFlow.

Distributed Deep LearningElasticDLKubernetes
0 likes · 13 min read
ElasticDL: An Open‑Source Elastic Deep Learning System Built on TensorFlow 2.0 and Kubernetes
dbaplus Community
dbaplus Community
Sep 10, 2019 · Big Data

Why Exactly‑Once Processing Is So Hard in Distributed Systems (And How to Tackle It)

This article explores the two toughest problems in distributed stream processing—exactly‑once message handling and ordering—by dissecting the underlying impossibility of perfect failure detectors, the liveness‑vs‑safety trade‑off, zombie processes, and the practical solutions employed by systems such as Flink, Kafka Streams, MillWheel, and Spark.

ConsensusDistributed SystemsExactly-Once
0 likes · 81 min read
Why Exactly‑Once Processing Is So Hard in Distributed Systems (And How to Tackle It)
Big Data Technology & Architecture
Big Data Technology & Architecture
Jun 19, 2019 · Big Data

Understanding Spark Structured Streaming StateStore: Architecture, Operations, and Fault Recovery

This article explains the design and implementation of Spark Structured Streaming's StateStore module, covering its distributed architecture, state sharding, versioning, batch read/write, migration, update/query APIs, maintenance compaction, and fault‑tolerance mechanisms that enable incremental continuous queries with exactly‑once guarantees.

Big DataSparkStateStore
0 likes · 8 min read
Understanding Spark Structured Streaming StateStore: Architecture, Operations, and Fault Recovery
ITFLY8 Architecture Home
ITFLY8 Architecture Home
Jun 5, 2019 · Fundamentals

Understanding Paxos: A Beginner’s 30‑Minute Guide with Real‑World Analogy

This article explains the Paxos consensus algorithm in plain terms, using a relatable travel‑planning analogy to illustrate how proposers, acceptors, and majority voting achieve fault‑tolerant agreement in distributed systems, and connects the concept to real‑world implementations like Google’s Chubby and ZooKeeper.

Distributed SystemsPaxosalgorithm
0 likes · 13 min read
Understanding Paxos: A Beginner’s 30‑Minute Guide with Real‑World Analogy
dbaplus Community
dbaplus Community
May 25, 2019 · Backend Development

Mastering Thread‑Pool Isolation: Prevent Cascading Failures in Java Services

This article explains the concept of fault tolerance in software architecture, illustrates why thread‑pool isolation is essential for preventing cascading failures, and provides concrete Java implementations—including code examples, pros and cons, and practical guidance for applying the technique in real‑world backend systems.

BackendIsolationJava
0 likes · 10 min read
Mastering Thread‑Pool Isolation: Prevent Cascading Failures in Java Services
21CTO
21CTO
Apr 29, 2019 · Big Data

How EasyScheduler Powers Scalable Big Data Workflow Management

EasyScheduler is an open‑source big‑data workflow scheduler that uses a decentralized architecture with Master and Worker nodes coordinated via ZooKeeper, supporting DAG‑based task definitions, various task types, fault tolerance, priority handling, distributed locks, and remote logging, all illustrated with detailed component diagrams.

Big DataDAGDistributed Systems
0 likes · 17 min read
How EasyScheduler Powers Scalable Big Data Workflow Management
Architecture Digest
Architecture Digest
Apr 29, 2019 · Big Data

EasyScheduler: An Open‑Source Big Data Workflow Scheduling System – Architecture and Design Overview

This article introduces EasyScheduler, an open‑source big data workflow scheduling system, explaining its core terminology, decentralized architecture, distributed lock implementation, thread‑shortage handling, fault‑tolerance mechanisms, task‑retry and priority designs, as well as its logging solution using Logback and gRPC.

DAGSchedulerfault tolerance
0 likes · 14 min read
EasyScheduler: An Open‑Source Big Data Workflow Scheduling System – Architecture and Design Overview
ITPUB
ITPUB
Mar 26, 2019 · Operations

How to Build a 99.99% High‑Availability Service: Practices and Architecture Evolution

This article explains the essential requirements for achieving 99.99% service availability—consistency, eliminating single points, placement groups, traffic isolation, same‑city active‑active, N+1 redundancy, and multi‑region active‑active—illustrated with a step‑by‑step Yum repository service case study and evolving architecture diagrams.

Deploymentarchitecturecloud operations
0 likes · 9 min read
How to Build a 99.99% High‑Availability Service: Practices and Architecture Evolution
iQIYI Technical Product Team
iQIYI Technical Product Team
Mar 15, 2019 · Cloud Computing

Design and Architecture of QLive Large‑Scale Live Streaming Service

The QLive service powers iQIYI’s massive live‑streaming events—such as the Spring Festival Gala—by combining vertical and horizontal scaling, a three‑layer architecture with dual data‑center isolation, multi‑level caching, circuit‑breaker/degradation controls, and a Flume‑Kafka‑Hive monitoring pipeline to sustain over 400 k QPS and 99.9999 % availability.

Vertical Scalingcachingfault tolerance
0 likes · 9 min read
Design and Architecture of QLive Large‑Scale Live Streaming Service
Big Data Technology & Architecture
Big Data Technology & Architecture
Mar 13, 2019 · Big Data

Understanding Fault Tolerance and Exactly-Once Semantics in Apache Flink

This article explains Apache Flink's fault‑tolerance mechanisms, including checkpointing, barrier alignment, the differences between At‑Least‑Once and Exactly‑Once semantics, configuration options, incremental checkpointing, and the requirements for external sources and sinks to achieve end‑to‑end exactly‑once processing.

Apache FlinkBig DataExactly-Once
0 likes · 15 min read
Understanding Fault Tolerance and Exactly-Once Semantics in Apache Flink
JD Tech
JD Tech
Mar 6, 2019 · Backend Development

Understanding Hystrix: Why It’s Needed and How to Use It for Dependency Isolation

This article explains why Hystrix is essential for large distributed systems, describes its dependency isolation mechanisms—including command, group, thread‑pool, and semaphore isolation—covers circuit‑breaker behavior, fallback strategies, and provides detailed Java code examples for configuration and usage.

HystrixJavaMicroservices
0 likes · 12 min read
Understanding Hystrix: Why It’s Needed and How to Use It for Dependency Isolation
Java Architect Essentials
Java Architect Essentials
Feb 25, 2019 · Backend Development

Service Isolation Design: Principles, Methods, and Best Practices

The article explains service isolation in system architecture, its origins, why it matters, two main isolation approaches (by service and by user), their advantages and drawbacks, and key considerations to ensure fault containment and improve overall system availability.

Microservicesbackend designfault tolerance
0 likes · 7 min read
Service Isolation Design: Principles, Methods, and Best Practices
dbaplus Community
dbaplus Community
Feb 18, 2019 · Databases

How Do Fault‑Tolerant Transactions Work? Exploring Raft, KV Engines, and Concurrency Control

This article examines multiple fault‑tolerant transaction designs—RSM‑based KV, RSM‑based transactions, shared‑storage approaches, high‑availability KV layers, and single‑node engine extensions—comparing their replication strategies, lock handling, and performance trade‑offs while raising open questions about ordering and consistency.

Distributed TransactionsKV StoreRaft
0 likes · 15 min read
How Do Fault‑Tolerant Transactions Work? Exploring Raft, KV Engines, and Concurrency Control
Architects Research Society
Architects Research Society
Jan 19, 2019 · Cloud Native

Three Common Microservices Integration Pitfalls and Their Mitigation Strategies

This article examines three frequent pitfalls encountered when integrating microservices—complex communication, asynchronous challenges, and distributed transaction difficulties—and proposes mitigation techniques such as rapid failure handling, workflow engines, timeout management, and compensation patterns to improve resilience and reduce system complexity.

Cloud NativeMicroservicesfault tolerance
0 likes · 13 min read
Three Common Microservices Integration Pitfalls and Their Mitigation Strategies
Programmer DD
Programmer DD
Dec 21, 2018 · Backend Development

How Circuit Breakers Safeguard Distributed Systems from Cascading Failures

This article explains the concept of circuit breaking in distributed systems, outlines a four‑step implementation process with strategies for detecting unhealthy services, cutting off calls, probing recovery, and restoring normal operation, and shares best‑practice tips to minimize downtime and improve resilience.

Distributed Systemscircuit breakerfault tolerance
0 likes · 10 min read
How Circuit Breakers Safeguard Distributed Systems from Cascading Failures
Architect's Tech Stack
Architect's Tech Stack
Dec 5, 2018 · Operations

Practical Fault‑Tolerance Practices in a Large‑Scale Activity Operations Platform

The article shares a comprehensive, experience‑driven guide on building fault‑tolerant systems—covering retry mechanisms, dynamic node removal, timeout settings, service degradation, decoupling, and business‑level safeguards—to enable a platform that scales from millions to billions of daily requests without relying on manual fire‑fighting.

OperationsSystem Designfault tolerance
0 likes · 21 min read
Practical Fault‑Tolerance Practices in a Large‑Scale Activity Operations Platform
Programmer DD
Programmer DD
Oct 30, 2018 · Fundamentals

What Is Paxos? A Storytelling Guide to Distributed Consensus

This article uses a vivid allegorical story to introduce the Paxos algorithm, then explains its roles, two-phase protocol, fault assumptions, and why majority and multiple acceptors are essential for achieving reliable consensus in distributed systems.

Distributed SystemsPaxosalgorithm
0 likes · 10 min read
What Is Paxos? A Storytelling Guide to Distributed Consensus
UC Tech Team
UC Tech Team
Oct 23, 2018 · Operations

Understanding Faults and Fault Isolation Strategies in Distributed Systems

The article explains what constitutes a fault, introduces key metrics such as RPO and RTO, and describes various fault isolation principles, patterns, and practical examples—including dependency degradation, failover, dynamic adjustment, fast‑fail, caching, rate limiting, and resource isolation—to improve system reliability.

OperationsRPORTO
0 likes · 14 min read
Understanding Faults and Fault Isolation Strategies in Distributed Systems
Java Backend Technology
Java Backend Technology
Aug 18, 2018 · Backend Development

Why Service Isolation Is Essential for Fault‑Tolerant Backend Systems

The article explains the concept of service isolation, its origins in shipbuilding, why it’s crucial for reducing fault impact in software systems, practical approaches such as functional and user‑based isolation, their trade‑offs, and key design principles to ensure reliable, maintainable back‑end architectures.

Backend ArchitectureMicroservicesfault tolerance
0 likes · 7 min read
Why Service Isolation Is Essential for Fault‑Tolerant Backend Systems
Meitu Technology
Meitu Technology
Aug 2, 2018 · Big Data

Spark Streaming vs Flink – Architecture, Scheduling & Fault Tolerance

This article compares Spark Streaming and Flink across runtime models, component roles, programming APIs, task scheduling, time semantics, dynamic Kafka partition detection, fault‑tolerance mechanisms, exactly‑once guarantees, and back‑pressure handling, providing code examples and practical insights for real‑time data processing.

Dynamic Partition DetectionExactly-OnceFlink
0 likes · 23 min read
Spark Streaming vs Flink – Architecture, Scheduling & Fault Tolerance
ITFLY8 Architecture Home
ITFLY8 Architecture Home
Jul 20, 2018 · Backend Development

Mastering Service Discovery and Communication in Microservices

This article explains how microservices use service registries for discovery, registration, health checks, and deregistration, compares third‑party and self‑registration, explores server‑side and client‑side call mechanisms, discusses API gateways, synchronous vs asynchronous messaging, and outlines fault‑tolerance patterns such as timeouts, circuit breakers, and bulkheads.

BackendMicroservicesapi-gateway
0 likes · 21 min read
Mastering Service Discovery and Communication in Microservices
Architecture Digest
Architecture Digest
Jul 19, 2018 · Operations

How to Prevent System Failures: Suspect Third‑Party Services, Guard Consumers, and Strengthen Your Own Service

The article presents practical strategies for avoiding service failures by treating third‑party dependencies as unreliable, designing robust APIs for consumers, and applying solid engineering principles such as degradation plans, timeout settings, traffic control, and resource‑limiting techniques.

ReliabilityResource Managementapi-design
0 likes · 16 min read
How to Prevent System Failures: Suspect Third‑Party Services, Guard Consumers, and Strengthen Your Own Service
ITPUB
ITPUB
Jun 6, 2018 · Cloud Native

How to Build a Cloud‑Native Microservices PaaS with Spring Cloud Netflix

This article explains how to construct a PaaS cloud platform using microservice architecture and Docker containers, detailing the roles of Spring Cloud Netflix components such as Zuul, Eureka, Hystrix, and Config Server, and covering gateway routing, service discovery, deployment, fault tolerance, and dynamic configuration.

MicroservicesSpring Cloudfault tolerance
0 likes · 13 min read
How to Build a Cloud‑Native Microservices PaaS with Spring Cloud Netflix
Meituan Technology Team
Meituan Technology Team
May 31, 2018 · Operations

High‑Availability Practices for Account Services at Meituan/Dianping

Meituan/Dianping ensures its critical account service stays online by combining real‑time business monitoring, circuit‑breaker‑driven graceful degradation, and active‑active cross‑region deployment with isolated dependencies, versioned data sync, and automated cache updates, dramatically extending MTBF while cutting MTTR and latency.

data synchronizationfault tolerancehigh availability
0 likes · 13 min read
High‑Availability Practices for Account Services at Meituan/Dianping
Efficient Ops
Efficient Ops
May 27, 2018 · Operations

Mastering High Availability and High Concurrency: Principles and Practical Techniques

This article outlines guiding principles, high‑availability strategies, and high‑concurrency techniques—covering stateless design, resource isolation, quota management, monitoring, degradation, rollback, and scaling—to help engineers build resilient, scalable systems while balancing cost and performance.

OperationsScalabilitySystem Design
0 likes · 21 min read
Mastering High Availability and High Concurrency: Principles and Practical Techniques
Efficient Ops
Efficient Ops
May 21, 2018 · Databases

Why Do Database Failures Happen and How to Prevent Them?

This article examines common hardware and network failures in data centers, analyzes real‑world outage cases, classifies fault domains, and presents comprehensive strategies for database fault handling—including logging, checkpointing, backup, replication, and high‑availability architectures—to improve reliability and reduce downtime.

BackupDistributed Systemsdatabase
0 likes · 22 min read
Why Do Database Failures Happen and How to Prevent Them?
ITFLY8 Architecture Home
ITFLY8 Architecture Home
May 19, 2018 · Backend Development

How to Structure Functional Teams and Service Patterns for Scalable Microservices

This article explains how Conway's law guides functional team division in microservice architectures, describes decentralized governance, outlines various interaction and composition patterns, discusses fault‑tolerance mechanisms such as isolation, circuit breaking, rate limiting, and provides guidance on choosing appropriate service granularity.

MicroservicesTeam Organizationarchitecture
0 likes · 32 min read
How to Structure Functional Teams and Service Patterns for Scalable Microservices
ITFLY8 Architecture Home
ITFLY8 Architecture Home
Mar 24, 2018 · Operations

How Service Degradation and Fault‑Tolerance Keep Large‑Scale Systems Resilient

This article explains how setting low timeouts for non‑core services, decoupling and physically isolating micro‑services, separating light and heavy workloads, and implementing automated configuration checks together enhance system reliability and reduce both technical and human errors in high‑traffic environments.

Configuration Managementfault tolerancesystem reliability
0 likes · 9 min read
How Service Degradation and Fault‑Tolerance Keep Large‑Scale Systems Resilient
ITFLY8 Architecture Home
ITFLY8 Architecture Home
Mar 22, 2018 · Operations

How Simple Retry Can Crash Your System and Smarter Alternatives

This article examines the pitfalls of naive retry mechanisms, explores active‑standby service switching, dynamic removal of unhealthy nodes, proper timeout configuration, and anti‑reentrancy strategies to improve system reliability and prevent cascading failures in large‑scale backend operations.

RetryTimeoutfault tolerance
0 likes · 14 min read
How Simple Retry Can Crash Your System and Smarter Alternatives
ITFLY8 Architecture Home
ITFLY8 Architecture Home
Mar 4, 2018 · Operations

Mastering Service Fault Tolerance: Key Patterns for Resilient Microservices

Effective fault tolerance is crucial for microservice stability, and this article explores core design principles and classic patterns—such as timeout retries, rate limiting, bulkhead isolation, circuit breakers, and fallback strategies—guiding developers to choose and combine the right approaches for high‑availability systems.

Microservicesbulkheadcircuit breaker
0 likes · 8 min read
Mastering Service Fault Tolerance: Key Patterns for Resilient Microservices
Efficient Ops
Efficient Ops
Feb 23, 2018 · Operations

What a Decade of Ops Taught Me: Key Strategies for Scalable Infrastructure

This article reflects on ten years of Tencent's operations experience, sharing the author's career journey, the evolution of large‑scale service management, the design of the L5 fault‑tolerant system, unified frameworks, resource packaging, CMDB virtual mirrors, and automated deployment practices that together enable reliable, efficient, and scalable infrastructure.

AutomationCMDBOperations
0 likes · 11 min read
What a Decade of Ops Taught Me: Key Strategies for Scalable Infrastructure
Tencent TDS Service
Tencent TDS Service
Feb 1, 2018 · Backend Development

How a TV App’s Waterfall Layout Boosted User Engagement and Efficiency

This article details the redesign of a TV app from a horizontal layout to a waterfall flow, explaining the project timeline, advantages, new seven‑layer architecture, CMS‑driven configuration, compatibility handling, pagination strategies, caching, and fault‑tolerance measures that together improved user conversion and system robustness.

Backend ArchitectureCMSTV app
0 likes · 36 min read
How a TV App’s Waterfall Layout Boosted User Engagement and Efficiency
21CTO
21CTO
Nov 20, 2017 · Operations

Mastering High Availability and Concurrency: Core Principles and Practical Techniques

This article distills essential guiding principles, high‑availability strategies, and high‑concurrency techniques for building resilient, scalable systems, covering stateless design, fault‑handling phases, replication, isolation, rate limiting, caching, async processing, multithreading, and scaling approaches.

System Designfault tolerancehigh availability
0 likes · 21 min read
Mastering High Availability and Concurrency: Core Principles and Practical Techniques
Efficient Ops
Efficient Ops
Nov 15, 2017 · Big Data

How Tencent Built a 10 TB‑Per‑Day Full‑Link Log Monitoring Platform

This article explains how Tencent's ZhiYun full‑link log monitoring platform handles massive daily logs, overcomes challenges of diverse log formats, high throughput, fault‑tolerant design, and provides scalable storage, query, and alerting capabilities for distributed micro‑service environments.

Big DataDistributed SystemsLog Monitoring
0 likes · 10 min read
How Tencent Built a 10 TB‑Per‑Day Full‑Link Log Monitoring Platform
21CTO
21CTO
Oct 22, 2017 · Operations

How to Build Highly Available Systems: Fault Tolerance and Scalability Strategies

This article explains why high availability is critical for internet services, outlines key techniques such as stateless design, service discovery, heartbeat checks, idempotent operations, load balancing, throttling, caching, and micro‑service architecture, and discusses the operational challenges and monitoring tools needed to maintain resilient, scalable systems.

IdempotencyMicroservicesScalability
0 likes · 8 min read
How to Build Highly Available Systems: Fault Tolerance and Scalability Strategies
Architecture Digest
Architecture Digest
Oct 15, 2017 · Operations

High Concurrency and High Availability Design Principles

This article outlines essential high‑concurrency and high‑availability principles—including stateless design, service decomposition, caching strategies, message queues, data heterogeneity, degradation, rate limiting, traffic switching, and rollback mechanisms—to help architects build scalable, reliable, and resilient systems.

ScalabilitySystem Designarchitecture
0 likes · 12 min read
High Concurrency and High Availability Design Principles
21CTO
21CTO
Sep 26, 2017 · Operations

Why You Should Never Trust Any Component in Your System—and How to Protect It

In programming and operations, every element—from services and dependencies to requests, machines, data centers, power, networks, and humans—can fail unexpectedly, so you must assume distrust and implement defensive measures such as monitoring, redundancy, rate limiting, fallback strategies, backups, and automated deployment.

OperationsReliabilitySecurity
0 likes · 9 min read
Why You Should Never Trust Any Component in Your System—and How to Protect It
21CTO
21CTO
Aug 11, 2017 · Operations

Alibaba’s Double 11 Playbook: Scaling Architecture and Real‑Time Fault Tolerance

Alibaba’s eight‑year evolution of Double 11 showcases how limited cost can deliver maximal user experience and massive throughput by transitioning from a centralized 3.0 distributed architecture to multi‑active zones, employing capacity planning, full‑link stress testing, fine‑grained dependency governance, and dynamic traffic scheduling to ensure high availability.

capacity planningfault tolerancelarge-scale e-commerce
0 likes · 12 min read
Alibaba’s Double 11 Playbook: Scaling Architecture and Real‑Time Fault Tolerance
Architecture Digest
Architecture Digest
Jul 16, 2017 · Operations

Fault Governance in Distributed Systems: Dependency Failures, Strong/Weak Dependency, and Fault‑Injection Practices

This article presents a comprehensive overview of fault governance in large‑scale distributed systems, covering classic dependency failures, the concept of strong and weak dependencies, experimental observations, the evolution of fault‑injection techniques, and best practices for building reliable fault‑drill platforms.

Distributed SystemsOperationschaos engineering
0 likes · 20 min read
Fault Governance in Distributed Systems: Dependency Failures, Strong/Weak Dependency, and Fault‑Injection Practices
Architecture Digest
Architecture Digest
Jul 6, 2017 · Fundamentals

PacificA: Microsoft’s General Replication Framework for Large‑Scale Distributed Storage Systems

PacificA is Microsoft’s generic replication framework for large‑scale distributed storage systems that provides strong consistency, separates configuration management from data replication, and uses a primary‑secondary model with lease‑based fault detection to ensure availability, correctness, and efficient operation across heterogeneous nodes.

ConsistencyDistributed SystemsPacificA
0 likes · 14 min read
PacificA: Microsoft’s General Replication Framework for Large‑Scale Distributed Storage Systems
Suning Technology
Suning Technology
May 18, 2017 · Big Data

Why Apache Flink Beats Spark and Storm in Stream Processing

This article examines Apache Flink's stream‑processing architecture, compares its native streaming model, fault‑tolerance, performance and SQL capabilities with Spark and Storm, and concludes that Flink offers a more powerful and efficient solution despite some maturity gaps.

Apache FlinkSparkStorm
0 likes · 12 min read
Why Apache Flink Beats Spark and Storm in Stream Processing
Alibaba Cloud Developer
Alibaba Cloud Developer
May 12, 2017 · Operations

How Alibaba Engineers Fault Governance and Chaos Engineering for E‑commerce

This article recounts Alibaba's middleware team's QCon Beijing 2017 presentation on fault governance and fault‑drill practices, covering distributed‑system dependency failures, strong/weak dependency concepts, multi‑stage technical evolution, and the design of their chaos‑engineering platform for large‑scale e‑commerce.

AlibabaOperationschaos engineering
0 likes · 21 min read
How Alibaba Engineers Fault Governance and Chaos Engineering for E‑commerce
DevOps
DevOps
May 8, 2017 · Backend Development

Key Technical Concerns and Core Components of Microservices Architecture

Microservices architecture introduces technical concerns such as service registration, discovery, load balancing, health checks, front-end routing, fault tolerance, dynamic configuration, and framework selection, with common solutions ranging from centralized and in-process load balancers to Netflix and Spring Cloud components.

Microservicesfault toleranceframeworks
0 likes · 16 min read
Key Technical Concerns and Core Components of Microservices Architecture
Alibaba Cloud Developer
Alibaba Cloud Developer
Apr 21, 2017 · Big Data

How Alibaba Tackles Real-Time Stream and Graph Computing at Scale

In his ASPLOS keynote, Alibaba’s Vice President Zhou Jingren detailed the company’s large‑scale stream and graph computing platforms, highlighting fault‑tolerance innovations, real‑time data challenges, and upcoming advances in graph analytics and massive machine‑learning workloads.

AIAlibabaBig Data
0 likes · 7 min read
How Alibaba Tackles Real-Time Stream and Graph Computing at Scale
Architecture Digest
Architecture Digest
Apr 16, 2017 · Operations

Common Load‑Balancing Strategies and Their Reliability Analysis in Distributed Systems

The article reviews hardware and software load‑balancing, explains classic strategies such as round‑robin, random, minimum‑response‑time, least‑connections and hash, and quantitatively evaluates their fault‑tolerance using probability formulas and example scenarios in distributed systems.

Distributed SystemsLeast ConnectionsRound Robin
0 likes · 10 min read
Common Load‑Balancing Strategies and Their Reliability Analysis in Distributed Systems
Qunar Tech Salon
Qunar Tech Salon
Feb 23, 2017 · Backend Development

Microservice Fault Tolerance: Timeout, Retry, Circuit Breaker, Rate Limiting, and Service Degradation

This article explains microservice fault‑tolerance techniques—including timeout settings, retry strategies, circuit‑breaker logic, current limiting, resource isolation, and service degradation—from both micro and macro perspectives, illustrating how to design resilient service chains and avoid cascading failures.

RetryTimeoutcircuit breaker
0 likes · 11 min read
Microservice Fault Tolerance: Timeout, Retry, Circuit Breaker, Rate Limiting, and Service Degradation
Tencent Cloud Developer
Tencent Cloud Developer
Feb 14, 2017 · Databases

TDSQL Audit Capability: Architecture, Kafka Integration, and Consistency Hash Implementation

TDSQL’s cloud‑based audit solution combines a three‑proxy high‑availability layer, Kafka’s O(1) persistent messaging, and a distributed audit‑server that uses consistent hashing and multi‑coroutine processing to consume data within seconds, while fault‑tolerant offsets, majority acknowledgments, and Tencent Cloud MongoDB storage ensure secure, ordered, scalable, and highly reliable audit logging.

KafkaMongoDBTDSQL
0 likes · 7 min read
TDSQL Audit Capability: Architecture, Kafka Integration, and Consistency Hash Implementation
Tencent Cloud Developer
Tencent Cloud Developer
Feb 9, 2017 · Backend Development

Backend Design and Implementation of QQ Game Spring Festival Red Packet System

The article details the QQ Game Spring Festival Red Packet backend, describing its multi‑phase architecture that handles 80 k RPS, uses CDN‑served static gift data, two‑level sorting, CMEM caching, RocketMQ buffering for throttled delivery, idempotent order tracking for fault tolerance, and unified real‑time monitoring.

Backendasynchronous processingfault tolerance
0 likes · 16 min read
Backend Design and Implementation of QQ Game Spring Festival Red Packet System
Efficient Ops
Efficient Ops
Feb 6, 2017 · Operations

Building Billion‑Scale Web Systems That Auto‑Extinguish Failures

The article shares Tencent’s practical fault‑tolerance journey for a billion‑scale activity platform, covering retry strategies, automatic removal of faulty nodes, timeout tuning, business‑level safeguards, service degradation, and decoupling techniques that together reduce manual firefighting and improve system resilience.

Operationsfault tolerancelarge-scale systems
0 likes · 25 min read
Building Billion‑Scale Web Systems That Auto‑Extinguish Failures
dbaplus Community
dbaplus Community
Jan 15, 2017 · Databases

How JD’s JIMDB Achieves Zero‑Downtime Scaling and Automatic Failover for Massive Caches

JIMDB is JD’s in‑house distributed cache platform that combines automatic fault detection, seamless online scaling, multi‑language support, and containerized deployment to replace traditional Memcached/Redis solutions, offering features such as one‑click cluster creation, elastic expansion, lossless scaling, and comprehensive monitoring for high‑traffic e‑commerce services.

CacheDistributed Systemselastic scaling
0 likes · 23 min read
How JD’s JIMDB Achieves Zero‑Downtime Scaling and Automatic Failover for Massive Caches
ITFLY8 Architecture Home
ITFLY8 Architecture Home
Nov 27, 2016 · Backend Development

Why Microservices? Benefits, Features, and Real-World Frameworks Explained

This article explains why microservices are chosen over monolithic architectures, outlines their key characteristics, reviews popular frameworks such as Netflix's stack and Spring Cloud, and discusses essential concerns like service discovery, load balancing, security, configuration, monitoring, and fault tolerance.

Microservicesfault toleranceservice discovery
0 likes · 10 min read
Why Microservices? Benefits, Features, and Real-World Frameworks Explained
High Availability Architecture
High Availability Architecture
Nov 25, 2016 · Backend Development

Disque: An Experimental Distributed In‑Memory Message Queue – Design and Usage Overview

Disque is an experimental, distributed, fault‑tolerant in‑memory message queue built in C that extends Redis concepts with synchronous replication, configurable delivery semantics, explicit acknowledgments, fast‑ack support, dead‑letter handling, and optional disk persistence for robust backend messaging workloads.

BackendDisqueIn-Memory
0 likes · 17 min read
Disque: An Experimental Distributed In‑Memory Message Queue – Design and Usage Overview
Meituan Technology Team
Meituan Technology Team
Nov 11, 2016 · Operations

Common Service Fault Tolerance Patterns

The article explains how Meituan‑Dianping applies classic fault‑tolerance patterns—timeout and retry, rate limiting/load shedding, circuit breaker, bulkhead isolation, and fallback—to design for failure, prevent cascading service outages, and enhance system stability and high‑availability in a service‑oriented architecture.

Distributed SystemsFallbackRetry
0 likes · 14 min read
Common Service Fault Tolerance Patterns
MaGe Linux Operations
MaGe Linux Operations
Nov 7, 2016 · Big Data

How HDFS Achieves Low Cost, High Reliability, and Fault Tolerance

This article explains how HDFS, inspired by Google’s GFS, provides a low‑cost, highly reliable, fault‑tolerant, and high‑performance distributed file system for big‑data workloads by using replication, standby NameNodes, block storage, rack awareness, and compute‑close‑to‑data strategies.

Big DataDistributed File SystemHDFS
0 likes · 7 min read
How HDFS Achieves Low Cost, High Reliability, and Fault Tolerance
ITFLY8 Architecture Home
ITFLY8 Architecture Home
Oct 14, 2016 · Fundamentals

Understanding Paxos: How Distributed Systems Reach Consensus

This article provides a vivid explanation of the Paxos algorithm, illustrating how it achieves reliable consensus among unreliable processors through a two‑phase prepare/promise and propose/accept process, using distributed auction analogies, message sequencing, and read/write operations to ensure consistency in distributed systems.

Read/Writealgorithmdistributed consensus
0 likes · 15 min read
Understanding Paxos: How Distributed Systems Reach Consensus
21CTO
21CTO
Jul 30, 2016 · Operations

Building a 3‑Minute Fault Detection, 5‑Minute Recovery HA System for Games

This article explains how Alibaba’s NineGame platform achieved ultra‑high availability by shifting from system‑centric to business‑centric design, defining measurable goals (3‑minute issue detection, 5‑minute recovery, bi‑monthly incidents) and implementing a layered, automated, visual monitoring, client‑side retry, HTTP‑DNS, functional isolation, and multi‑site active‑active architecture.

Operationsbusiness‑centric designfault tolerance
0 likes · 22 min read
Building a 3‑Minute Fault Detection, 5‑Minute Recovery HA System for Games

Designing a Business‑Oriented High Availability Architecture for a Game Access System

The article presents a business‑centric high‑availability solution for a large‑scale game access platform, detailing measurable goals, a three‑dimensional architecture that includes client‑side retry, HTTP‑DNS, functional separation, multi‑region active‑active deployment, and automated, visual monitoring to achieve rapid problem detection, recovery, and minimal outage frequency.

Distributed Systemsbusiness continuityfault tolerance
0 likes · 23 min read
Designing a Business‑Oriented High Availability Architecture for a Game Access System
Qunar Tech Salon
Qunar Tech Salon
Jul 7, 2016 · Backend Development

Design and Practices of Qunar's Self‑Developed High‑Availability Message Middleware

This article shares Qunar's architecture and practical experience in designing a self‑developed high‑availability message middleware, covering its role in transaction processing, consistency guarantees, fault‑tolerance mechanisms, isolation, monitoring, and consumer design, and discusses trade‑offs and operational considerations.

Consistencyfault tolerance
0 likes · 16 min read
Design and Practices of Qunar's Self‑Developed High‑Availability Message Middleware
WeChat Backend Team
WeChat Backend Team
Jun 22, 2016 · Fundamentals

How PhxPaxos Turns Paxos Theory into a Production‑Grade Consensus Library

This article provides a beginner-friendly, engineering-focused overview of the production‑grade Paxos library PhxPaxos, explaining the consensus protocol, its roles, instance management, state‑machine integration, performance optimizations, multi‑group deployment, and practical considerations such as disk durability, leader election, and log checkpointing.

Paxosdistributed consensusfault tolerance
0 likes · 30 min read
How PhxPaxos Turns Paxos Theory into a Production‑Grade Consensus Library
Architecture Digest
Architecture Digest
Apr 8, 2016 · Operations

Practical Fault‑Tolerance Practices in a Large‑Scale Activity Operations Platform

This article shares the author’s experience building fault‑tolerance for Tencent’s activity operations platform, covering retry strategies, automatic removal of unhealthy machines, timeout tuning, asynchronous processing, anti‑replay mechanisms, service degradation, service decoupling, and business‑level safeguards to reduce manual alarm handling and improve system robustness.

Distributed SystemsOperationsRetry
0 likes · 21 min read
Practical Fault‑Tolerance Practices in a Large‑Scale Activity Operations Platform
21CTO
21CTO
Apr 5, 2016 · Operations

How Tencent’s AMS Achieved Fault Tolerance at Billion‑Request Scale

This article shares Tencent’s experience building fault‑tolerant mechanisms for the AMS activity platform, covering retry strategies, automatic machine exclusion, timeout tuning, service isolation, asynchronous processing, anti‑replay safeguards, and operational best practices that transformed a million‑request service into an 800‑million‑request system.

OperationsRetrySystem Design
0 likes · 24 min read
How Tencent’s AMS Achieved Fault Tolerance at Billion‑Request Scale
21CTO
21CTO
Mar 25, 2016 · Operations

How Different Load‑Balancing Strategies Impact Reliability in Distributed Systems

This article examines common load‑balancing algorithms—round‑robin, random, minimum response time, minimum concurrency, and hash—analyzing their fault‑tolerance in distributed clusters, deriving success‑rate formulas, and showing why strategies like minimum concurrency outperform simple methods under node failures.

Distributed SystemsRound Robinfault tolerance
0 likes · 12 min read
How Different Load‑Balancing Strategies Impact Reliability in Distributed Systems
ITPUB
ITPUB
Jan 15, 2016 · Backend Development

Understanding Microservices: Concepts, Benefits, and Practical Implementation

This article explains what microservices are, compares them with monolithic architecture, outlines their key characteristics, and provides practical guidance on client access, inter‑service communication, service discovery, fault tolerance, and deployment considerations.

BackendMicroservicesapi-gateway
0 likes · 13 min read
Understanding Microservices: Concepts, Benefits, and Practical Implementation
Qunar Tech Salon
Qunar Tech Salon
Dec 15, 2015 · Big Data

Real-Time Computing with Apache Storm: Architecture, Code Samples, and Fault Tolerance

This article explains the principles of real-time computing, compares it with offline batch processing, and demonstrates a practical solution using Kafka for ingestion, Apache Storm for continuous computation, and various storage options, while also covering streaming concepts and Storm's high‑availability mechanisms.

Apache StormKafkaReal‑Time Computing
0 likes · 8 min read
Real-Time Computing with Apache Storm: Architecture, Code Samples, and Fault Tolerance

Designing a Business‑Oriented High‑Availability Architecture for Game Access Systems

The article presents a comprehensive, business‑centric high‑availability architecture for a game access platform, detailing measurable goals, a three‑layered design, client‑side retry with HTTP‑DNS, functional separation and degradation, multi‑region active‑active deployment, and automated, visual monitoring to achieve rapid issue detection, recovery, and minimal downtime.

Distributed Systemsbusiness reliabilityfault tolerance
0 likes · 23 min read
Designing a Business‑Oriented High‑Availability Architecture for Game Access Systems
21CTO
21CTO
Oct 12, 2015 · Databases

How NoSQL Databases Achieve Scalability: Distributed Strategies Explained

This article systematically explores the distributed characteristics of NoSQL databases, covering data consistency, placement, peer systems, anti‑entropy protocols, eventual consistency data types, sharding, fault detection, and coordinator election, illustrating how these strategies balance scalability, availability, latency, and fault tolerance.

NoSQLReplicationfault tolerance
0 likes · 33 min read
How NoSQL Databases Achieve Scalability: Distributed Strategies Explained

Understanding Storm: A Distributed Real-Time Computation System

The article explains the need for low‑latency, high‑performance, distributed real‑time processing, outlines the challenges such systems must address, and introduces Storm as a Hadoop‑like framework for stream processing, detailing its architecture, fault‑tolerance mechanisms, transactional topology, and large‑scale deployment at Taobao.

Big DataDistributed SystemsReal-time Processing
0 likes · 14 min read
Understanding Storm: A Distributed Real-Time Computation System