Tagged articles
310 articles
Page 1 of 4
Coder Trainee
Coder Trainee
May 18, 2026 · Cloud Native

Spring Cloud Microservices Tutorial – Sentinel for Fault Tolerance and Rate Limiting

This article walks through adding Alibaba Sentinel to a Spring Cloud microservice suite to protect against service outages, traffic spikes, and slow calls by configuring rate limiting, circuit breaking, and fallback mechanisms across user, order, and gateway services, with full Docker‑compose setup and testing steps.

MicroservicesSpring Cloudcircuit breaker
0 likes · 14 min read
Spring Cloud Microservices Tutorial – Sentinel for Fault Tolerance and Rate Limiting
Architect's Guide
Architect's Guide
May 13, 2026 · Big Data

Next‑Gen Visual Drag‑Drop Data Flow Platform: Features, Architecture, and Performance

The article introduces a visual drag‑and‑drop data flow platform that unifies stream and batch processing, offers version control, automatic fault tolerance, configurable data permissions, comprehensive monitoring, data alignment, and query templates, and presents single‑instance performance benchmarks of over 30k and 60k ops/s.

Data AlignmentData FlowDrag-and-Drop
0 likes · 7 min read
Next‑Gen Visual Drag‑Drop Data Flow Platform: Features, Architecture, and Performance
IT Services Circle
IT Services Circle
Apr 30, 2026 · Backend Development

How a Single Front‑end Change Dragged Four Backend Teams – The BFF Solution

A tiny UI tweak that required a meeting with four backend groups exposed the pain of calling many micro‑services from the front‑end, and the article shows how introducing a Backend‑For‑Frontend (BFF) layer can aggregate, transform, and simplify those calls while improving reliability and performance.

API AggregationBFFBackend For Frontend
0 likes · 21 min read
How a Single Front‑end Change Dragged Four Backend Teams – The BFF Solution
Coder Trainee
Coder Trainee
Apr 27, 2026 · Cloud Native

Spring Cloud Microservices Practice #6: Sentinel for Service Fault Tolerance and Rate Limiting

This article explains why service fault tolerance is essential in micro‑service architectures, compares Sentinel with Hystrix and Resilience4j, and provides step‑by‑step guidance on integrating Sentinel for circuit breaking, QPS and concurrency limiting, hot‑parameter control, system protection, and dynamic rule management with Nacos.

Circuit BreakingMicroservicesNacos
0 likes · 14 min read
Spring Cloud Microservices Practice #6: Sentinel for Service Fault Tolerance and Rate Limiting
Architecture and Beyond
Architecture and Beyond
Apr 25, 2026 · Artificial Intelligence

Practical Insights on Recent AI Engineering Deployments

The article examines how large language models function as probabilistic components within deterministic software, discusses fault‑tolerance limits for viable AI use cases, and offers detailed engineering guidance on RAG pipelines, tool‑calling determinism, agent fragility, testing, monitoring, and privacy‑conscious deployment in finance.

AI EngineeringAgent ArchitectureLLM
0 likes · 14 min read
Practical Insights on Recent AI Engineering Deployments
AI Tech Publishing
AI Tech Publishing
Apr 21, 2026 · Artificial Intelligence

Why Your AI Agent Stays a Toy: Six Production‑Readiness Gaps and How to Bridge Them

Moving an AI agent from a controlled demo to an unattended production environment introduces six critical gaps—fault handling, state persistence, observability, credential security, cost control, and human supervision—each requiring specific infrastructure, practices, and a comprehensive readiness checklist to avoid costly failures.

AI AgentsCost ManagementObservability
0 likes · 15 min read
Why Your AI Agent Stays a Toy: Six Production‑Readiness Gaps and How to Bridge Them
JD Tech
JD Tech
Apr 15, 2026 · Artificial Intelligence

How OpenClaw Powers Multi‑Channel AI Agents with Skills and Sub‑Agents

The article provides an in‑depth analysis of OpenClaw’s architecture, explaining why it was created, its layered design, the core ReAct loop, the Skill system, sub‑agent creation and management, fault‑tolerance mechanisms, tool policies, and how it extends the pi‑mono engine to support robust, multi‑channel AI agents.

AI AgentsOpenClawReAct loop
0 likes · 20 min read
How OpenClaw Powers Multi‑Channel AI Agents with Skills and Sub‑Agents
AI Insight Log
AI Insight Log
Apr 8, 2026 · Artificial Intelligence

Anthropic Blocks Third‑Party Agents, Then Launches Claude Managed Agents to Disrupt the Startup Scene

Anthropic’s Claude Managed Agents is a hosted platform that offers sandboxed execution, long‑running sessions, multi‑agent coordination, MCP integration and immutable session persistence, delivering up to 90% latency reduction and fault‑tolerant design, while early adopters like Notion, Rakuten, Asana and Sentry showcase real‑world production use.

AI agent orchestrationAgent ArchitectureAnthropic
0 likes · 7 min read
Anthropic Blocks Third‑Party Agents, Then Launches Claude Managed Agents to Disrupt the Startup Scene
AI Architecture Hub
AI Architecture Hub
Feb 25, 2026 · Artificial Intelligence

How OpenClaw Turns AI Agents into Production‑Ready Infrastructure

This article analyzes OpenClaw’s engineering‑focused architecture, detailing its three‑layer component boundaries, gateway‑centric session management, concurrency controls, fault‑self‑healing mechanisms, context handling, multi‑agent routing, and practical deployment scenarios for building stable, auditable AI agent systems.

AI AgentsOpenClawfault tolerance
0 likes · 20 min read
How OpenClaw Turns AI Agents into Production‑Ready Infrastructure
Amap Tech
Amap Tech
Feb 3, 2026 · Artificial Intelligence

Building a Scalable AI Agent Smart Task Framework for Offline & Event‑Driven Use

After LLMs entered the deep‑water stage, developers realized that agents must go beyond passive Q&A to support asynchronous, long‑running, and subscribable tasks; this article details the design, architecture, and engineering challenges of the “Xiao Gao Teacher AI Agent” smart‑task system, from event‑driven logic to fault‑tolerant deployment.

AI AgentEvent-Driven ArchitectureLLM
0 likes · 19 min read
Building a Scalable AI Agent Smart Task Framework for Offline & Event‑Driven Use
Architect's Journey
Architect's Journey
Dec 3, 2025 · Cloud Native

Microservice Governance Guide: From Stable Operations to Maximum Efficiency

This comprehensive guide breaks down microservice governance into four pillars—node management, load balancing, routing, and fault tolerance—providing concrete configurations, algorithm choices, and service‑mesh recommendations to achieve 99.99% availability, cut wasted resources by over 30%, and halve iteration cycles.

MicroservicesService Meshfault tolerance
0 likes · 16 min read
Microservice Governance Guide: From Stable Operations to Maximum Efficiency
Architect's Journey
Architect's Journey
Dec 1, 2025 · Backend Development

Designing Three‑High Systems: Practical Performance Tuning and Fault‑Tolerant Architecture

The article breaks down the design logic and implementation steps for high‑performance, high‑concurrency, and high‑availability systems, covering bottleneck identification, read/write optimization, three‑dimensional scaling, and concrete fault‑tolerance strategies to build resilient, scalable services.

System Architecturefault tolerancehigh availability
0 likes · 15 min read
Designing Three‑High Systems: Practical Performance Tuning and Fault‑Tolerant Architecture
JD Tech
JD Tech
Sep 26, 2025 · Operations

Avoiding High‑Availability Pitfalls: Real‑World JD Lessons and Solutions

This article examines common high‑availability challenges across applications, databases, caches, message queues, containers, and GC, presenting real JD engineering cases, root‑cause analyses, and practical mitigation strategies to help engineers design more resilient systems.

Message Queuedatabasefault tolerance
0 likes · 37 min read
Avoiding High‑Availability Pitfalls: Real‑World JD Lessons and Solutions
Ops Community
Ops Community
Sep 17, 2025 · Operations

Mastering System Fault Tolerance: From Theory to Production‑Ready High‑Availability

This comprehensive guide explores the philosophy, core patterns, and practical techniques for designing fault‑tolerant, highly available systems, covering circuit breakers, retries, rate limiting, monitoring, cloud‑native deployment, and real‑world case studies to help engineers build resilient production architectures.

Cloud Nativecircuit breakerfault tolerance
0 likes · 24 min read
Mastering System Fault Tolerance: From Theory to Production‑Ready High‑Availability
Efficient Ops
Efficient Ops
Sep 9, 2025 · Fundamentals

Inside 3FS: How Distributed File Systems Hide Complexity and Scale

3FS is an open‑source distributed file system that abstracts multiple machines into a single namespace, offering massive scalability, fault tolerance, and high throughput through components like Meta, Mgmtd, Storage, and Client, and leveraging the CRAQ protocol for strong consistency and efficient reads and writes.

3FSCRAQDistributed File System
0 likes · 12 min read
Inside 3FS: How Distributed File Systems Hide Complexity and Scale
NiuNiu MaTe
NiuNiu MaTe
Sep 4, 2025 · Operations

Mastering Multi‑Active Distributed Systems: From Single Server to Global Fault Tolerance

This article walks developers through the evolution of distributed system architectures—from single‑machine deployments to master‑slave, same‑city active‑active, and finally true multi‑active setups—explaining core concepts, replication strategies, conflict resolution, fault detection, switch mechanisms, recovery methods, and interview tips for high‑availability design.

CAP theoremDistributed SystemsInterview Preparation
0 likes · 26 min read
Mastering Multi‑Active Distributed Systems: From Single Server to Global Fault Tolerance
JD Tech Talk
JD Tech Talk
Sep 4, 2025 · Operations

Avoid Common High‑Availability Pitfalls: Real‑World JD Practices and Solutions

This article analyzes the multi‑dimensional challenges of building high‑availability systems—covering applications, databases, caches, message queues, containers, GC, and more—by sharing real JD engineering scenarios, common failure patterns, and concrete mitigation strategies to help engineers design more resilient services.

BackendDistributed Systemsfault tolerance
0 likes · 36 min read
Avoid Common High‑Availability Pitfalls: Real‑World JD Practices and Solutions
JD Cloud Developers
JD Cloud Developers
Sep 4, 2025 · Operations

Mastering High‑Availability: JD Real‑World Pitfalls & Fixes for Apps, DBs, Cache & MQ

This article shares JD's practical high‑availability architecture lessons, detailing common pitfalls across applications, databases, caches, RPC frameworks, containers, data centers, GC, and message queues, and provides concrete troubleshooting steps and optimization techniques to help engineers design more resilient, fault‑tolerant systems.

BackendSystem Designfault tolerance
0 likes · 36 min read
Mastering High‑Availability: JD Real‑World Pitfalls & Fixes for Apps, DBs, Cache & MQ
JD Retail Technology
JD Retail Technology
Sep 4, 2025 · Operations

Mastering High Availability: Real-World Pitfalls and Solutions from JD's Production Systems

This article walks through the challenges of building high‑availability systems—covering applications, databases, caches, message queues, containers, GC, and more—using JD’s production experiences to highlight common pitfalls, root‑cause analyses, and practical mitigation strategies for engineers seeking resilient architecture.

CacheDistributed SystemsJDK
0 likes · 37 min read
Mastering High Availability: Real-World Pitfalls and Solutions from JD's Production Systems
Architect's Guide
Architect's Guide
Aug 25, 2025 · Fundamentals

19 Essential Distributed System Design Patterns You Must Know

This article explores nineteen core design patterns for distributed systems—including Bloom filters, consistent hashing, quorum, leader‑follower, heartbeat, fencing, WAL, segmented logs, high‑water mark, leases, gossip, Phi accrual detection, split‑brain handling, checksums, CAP and PACELC theorems, hinted handoff, read repair, and Merkle trees—explaining their purpose, operation, and typical use cases.

ConsistencyDistributed Systemsfault tolerance
0 likes · 14 min read
19 Essential Distributed System Design Patterns You Must Know
Tech Freedom Circle
Tech Freedom Circle
Aug 20, 2025 · Backend Development

P0 Eureka Service Discovery Collapse Cost a Top E‑commerce $120M During Double‑11

During the Double‑11 shopping festival, a leading e‑commerce platform suffered a P0 outage when its Eureka service‑discovery cluster overloaded, triggering a full‑chain failure that lasted 2 hours 42 minutes and caused losses exceeding 1.2 billion yuan; the article dissects the timeline, root causes, capacity mis‑planning, monitoring gaps, and remediation strategies.

JavaMicroservicescapacity planning
0 likes · 34 min read
P0 Eureka Service Discovery Collapse Cost a Top E‑commerce $120M During Double‑11
Tech Freedom Circle
Tech Freedom Circle
Jul 27, 2025 · Interview Experience

Designing a Payment Middle Platform from Scratch – Core Challenges (Interview Answer)

This article provides a comprehensive guide to designing a payment middle platform from zero, covering its definition, classic middle‑platform types, core architecture, functional modules, fault‑tolerance, security measures, distributed‑transaction strategies, and detailed Java pseudocode, offering interview‑ready insights for architects.

MicroservicesSecurityarchitecture
0 likes · 39 min read
Designing a Payment Middle Platform from Scratch – Core Challenges (Interview Answer)
JakartaEE China Community
JakartaEE China Community
Jul 15, 2025 · Cloud Native

Choosing a Technology Stack for Cloud‑Native Microservices: MicroProfile vs Spring

This article explains why cloud‑native microservices are beneficial, defines their key characteristics, and provides a detailed, side‑by‑side comparison of MicroProfile and Spring frameworks—including REST APIs, dependency injection, configuration, fault tolerance, security, health checks, metrics, and tracing—along with concrete code examples and starter resources.

Cloud NativeConfigurationMicroProfile
0 likes · 27 min read
Choosing a Technology Stack for Cloud‑Native Microservices: MicroProfile vs Spring
IT Architects Alliance
IT Architects Alliance
Jul 7, 2025 · Backend Development

Avoid the 5 Fatal Architecture Mistakes That Cost Millions

This article analyzes five common architectural design errors—over‑pursuing cutting‑edge tech, single points of failure, mishandling data consistency, fragmented performance tuning, and neglecting security—illustrating their costly impacts with real‑world cases and offering practical principles to prevent them.

Microservicesfault toleranceperformance
0 likes · 13 min read
Avoid the 5 Fatal Architecture Mistakes That Cost Millions
Cognitive Technology Team
Cognitive Technology Team
Jun 21, 2025 · Fundamentals

Understanding Faults, Failures, and Fault Tolerance in Distributed Systems

This tutorial explains the definitions of faults and failures in distributed systems, explores their types and root causes, and presents fault‑tolerance mechanisms such as replication, checkpointing, redundancy, error detection, load balancing, and consensus algorithms to build resilient architectures.

Distributed Systemsconsensus algorithmsdata replication
0 likes · 10 min read
Understanding Faults, Failures, and Fault Tolerance in Distributed Systems
Linux Kernel Journey
Linux Kernel Journey
Jun 16, 2025 · Cloud Computing

How Tencent’s TGW Achieves Seamless Fast Migration and Self‑Healing Fault Recovery

The paper presents Tencent’s TGW cloud gateway architecture, highlighting a 2.9× forwarding performance boost, lossless state migration within 4 seconds, sub‑minute fault detection, multi‑level fault‑tolerance mechanisms, and operational best practices that enable 100 % availability for massive online services.

Cloud GatewayDPDKState Migration
0 likes · 16 min read
How Tencent’s TGW Achieves Seamless Fast Migration and Self‑Healing Fault Recovery
Tencent Cloud Developer
Tencent Cloud Developer
May 20, 2025 · Cloud Computing

Efficient and Resilient Cloud Gateway at Scale: Architecture, Key Technologies, and Operational Practices of Tencent TGW

The article presents a comprehensive analysis of Tencent's TGW cloud gateway, detailing its modular architecture, high‑performance forwarding plane, lossless state migration, rapid fault recovery, multi‑level redundancy, operational best practices, and security mechanisms that enable ultra‑low latency and high availability for large‑scale internet services.

Cloud GatewayState Migrationfault tolerance
0 likes · 13 min read
Efficient and Resilient Cloud Gateway at Scale: Architecture, Key Technologies, and Operational Practices of Tencent TGW
Tencent Technical Engineering
Tencent Technical Engineering
May 19, 2025 · Cloud Native

How Tencent’s TGW Delivers 3× Faster Throughput and Near‑Zero Downtime at Scale

The USENIX‑selected paper on Tencent’s TGW cloud gateway reveals how a modular, multi‑layer architecture achieves up to 2.9‑fold throughput gains, seconds‑level elastic scaling, loss‑less hot migration, and sub‑second fault recovery, offering a blueprint for resilient large‑scale cloud networking.

Cloud GatewayState MigrationTencent
0 likes · 16 min read
How Tencent’s TGW Delivers 3× Faster Throughput and Near‑Zero Downtime at Scale
Xiaokun's Architecture Exploration Notes
Xiaokun's Architecture Exploration Notes
May 11, 2025 · Fundamentals

Why Unreliable Networks Threaten Distributed Systems and How to Mitigate Them

Distributed systems suffer from network unreliability—including packet loss, out‑of‑order delivery, variable latency, and ambiguous node failures—making timeout settings and fault detection challenging, and this article explains these issues, compares synchronous and asynchronous networks, and discusses strategies to balance latency and resource utilization.

Distributed SystemsNetwork Reliabilityasynchronous network
0 likes · 8 min read
Why Unreliable Networks Threaten Distributed Systems and How to Mitigate Them
Cognitive Technology Team
Cognitive Technology Team
Apr 8, 2025 · Backend Development

Design and Implementation of RocketMQ NameServer: Core Functions, Architecture, and Optimization Strategies

The article explains RocketMQ NameServer's lightweight, stateless design, its core routing and metadata management functions, AP‑oriented architecture, fault‑tolerant mechanisms, scalability features, and practical optimization techniques for high availability and low operational cost.

Distributed MessagingNameServerRocketMQ
0 likes · 6 min read
Design and Implementation of RocketMQ NameServer: Core Functions, Architecture, and Optimization Strategies
DataFunSummit
DataFunSummit
Mar 20, 2025 · Artificial Intelligence

Evolution of AI Training Stability and Baidu Baige’s Full-Stack Solutions for Large-Scale Model Training

The article traces the evolution of AI training stability from early manual operations on small GPU clusters to sophisticated, fault‑tolerant infrastructures for thousand‑card and ten‑thousand‑card models, detailing Baidu Baige’s metrics, monitoring, eBPF‑based diagnostics, and checkpoint strategies that reduce invalid training time and accelerate fault recovery.

Distributed SystemsLarge-Scale Trainingcheckpointing
0 likes · 22 min read
Evolution of AI Training Stability and Baidu Baige’s Full-Stack Solutions for Large-Scale Model Training
Baidu Geek Talk
Baidu Geek Talk
Mar 17, 2025 · Industry Insights

From Manual Restarts to Automated Fault Tolerance: The Evolution of AI Training Stability

This article traces the decade‑long evolution of AI training stability—from early small‑model manual operations to large‑scale, multi‑thousand‑GPU clusters—detailing metrics like invalid training time, fault‑tolerance architectures, eBPF‑based hidden‑fault detection, BCCL enhancements, multi‑level restart strategies, and trigger‑based checkpointing that together shrink downtime from minutes to seconds.

AI trainingDistributed SystemsInfrastructure
0 likes · 22 min read
From Manual Restarts to Automated Fault Tolerance: The Evolution of AI Training Stability
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Mar 10, 2025 · Artificial Intelligence

How Baidu Baige Achieves Near‑Zero Downtime in Massive AI Model Training

The article examines how Baidu Baige evolved AI training stability from manual operations to precise engineering, detailing metrics, fault‑perception techniques, eBPF‑based diagnostics, multi‑level restart strategies, and trigger‑based checkpointing that together achieve sub‑minute recovery and 99.5% effective training time on massive GPU clusters.

AI trainingLarge-Scale Clusterscheckpointing
0 likes · 25 min read
How Baidu Baige Achieves Near‑Zero Downtime in Massive AI Model Training
FunTester
FunTester
Mar 2, 2025 · Operations

Common Fault Propagation Patterns and Prevention Strategies in Distributed Systems

The article examines typical fault propagation scenarios such as avalanche effects, cascading failures, resource exhaustion, data pollution, and dependency cycles in distributed systems, and outlines proactive measures like rate limiting, circuit breaking, isolation, monitoring, and chaos engineering to prevent small issues from escalating into large-scale outages.

chaos engineeringcircuit breakerfault tolerance
0 likes · 11 min read
Common Fault Propagation Patterns and Prevention Strategies in Distributed Systems
IT Services Circle
IT Services Circle
Feb 9, 2025 · Big Data

Understanding HDFS: Architecture, Data Blocks, Fault Tolerance, and High Availability

This article explains how HDFS, the Hadoop Distributed File System, splits large files into blocks, replicates them for fault tolerance, organizes the cluster into NameNode and DataNode components, and provides high‑availability and scalability mechanisms such as standby NameNode and federation, enabling reliable big‑data storage and access.

Big DataDataNodeDistributed File System
0 likes · 11 min read
Understanding HDFS: Architecture, Data Blocks, Fault Tolerance, and High Availability
Architect
Architect
Jan 23, 2025 · Operations

Designing High‑Availability Systems: Architecture, Capacity Planning, and Fault‑Tolerance Guide

This article presents a comprehensive guide to building high‑availability systems, covering availability metrics, fault prevention, detection and recovery, capacity evaluation, layered architecture design, service tiering, resilience mechanisms, and operational best practices for reliable service delivery.

OperationsSystem Architecturecapacity planning
0 likes · 34 min read
Designing High‑Availability Systems: Architecture, Capacity Planning, and Fault‑Tolerance Guide
MaGe Linux Operations
MaGe Linux Operations
Jan 17, 2025 · Databases

Understanding Redis Cluster: Architecture, Data Distribution, and Fault Tolerance

Redis Cluster provides a scalable, fault‑tolerant distributed Redis solution, explaining why it’s needed, its architecture, virtual slot partitioning, data distribution methods, limitations, smart client optimization, and automatic failover mechanisms, while highlighting key operational considerations for high‑performance deployments.

ClusterVirtual Slotsdata sharding
0 likes · 11 min read
Understanding Redis Cluster: Architecture, Data Distribution, and Fault Tolerance
IT Architects Alliance
IT Architects Alliance
Jan 14, 2025 · Backend Development

Microservice Architecture: Common Problems and Solutions

Microservice architecture, once a buzzword, breaks monolithic applications into independent services, but introduces challenges such as service governance, communication, gateway management, fault tolerance, and tracing; the article outlines these issues and presents practical solutions like Consul/Eureka, REST/RPC, API gateways, Hystrix, and tracing tools.

Backend ArchitectureDistributed Tracingapi-gateway
0 likes · 11 min read
Microservice Architecture: Common Problems and Solutions
High Availability Architecture
High Availability Architecture
Jan 13, 2025 · Operations

Comprehensive Guide to High‑Availability System Architecture and Practices

This article provides a systematic overview of high‑availability system design, covering availability metrics, fault prevention, detection, recovery, capacity planning, service tiering, data layer resilience, monitoring, and the responsibilities of architects, SREs, and developers to ensure reliable, scalable services.

System Architecturecapacity planningfault tolerance
0 likes · 30 min read
Comprehensive Guide to High‑Availability System Architecture and Practices
Tencent Cloud Developer
Tencent Cloud Developer
Jan 7, 2025 · Operations

Designing High‑Availability Systems: Principles, Architecture, and Operations

This comprehensive guide explains how to design, build, and operate high‑availability systems by covering availability metrics, fault‑tolerance strategies, capacity planning, code and data layer architecture, automated testing, monitoring, and clear role responsibilities to ensure services stay reliable and resilient under load.

Cloud NativeSRESystem Design
0 likes · 32 min read
Designing High‑Availability Systems: Principles, Architecture, and Operations
IT Architects Alliance
IT Architects Alliance
Jan 6, 2025 · Operations

Ensuring High Reliability in Distributed Systems: Redundancy, Fault Detection, Replication, and Resilience Strategies

The article explores how distributed systems achieve high reliability through redundant design, precise fault detection and recovery, data replication and synchronization, coordinated fault tolerance and load balancing, distributed transaction handling, comprehensive monitoring, elastic scaling, security safeguards, and robust disaster‑recovery planning.

Reliabilityfault tolerancemonitoring
0 likes · 18 min read
Ensuring High Reliability in Distributed Systems: Redundancy, Fault Detection, Replication, and Resilience Strategies
BirdNest Tech Talk
BirdNest Tech Talk
Dec 29, 2024 · Fundamentals

Unlocking Distributed System Design: 20 Core Patterns Explained

This article distills the key design patterns behind distributed systems—covering replication, partitioning, consensus, and fault‑tolerance—by presenting each pattern’s problem statement, concrete solution, trade‑offs, and technical considerations, all illustrated with real‑world examples from projects like Kafka and Cassandra.

ConsensusDesign PatternsDistributed Systems
0 likes · 18 min read
Unlocking Distributed System Design: 20 Core Patterns Explained
DevOps Cloud Academy
DevOps Cloud Academy
Dec 2, 2024 · Artificial Intelligence

Key Kubernetes Features that Benefit AI Inference Workloads

This article explains how Kubernetes’ native scalability, resource optimization, performance tuning, portability, and fault‑tolerance features align with the demands of AI inference, helping organizations run large ML models efficiently, cost‑effectively, and reliably across diverse environments.

AI inferenceKubernetesPortability
0 likes · 15 min read
Key Kubernetes Features that Benefit AI Inference Workloads
Sanyou's Java Diary
Sanyou's Java Diary
Nov 25, 2024 · Cloud Native

Designing Resilient Stateful Distributed Systems: From Theory to Microservice Architecture

This article explores the fundamentals of distributed systems, compares stateful and stateless services, examines monolithic, SOA, and microservice models, and provides practical guidance on access layers, fault tolerance, service discovery, scaling, and data storage for building robust cloud‑native architectures.

Cloud NativeMicroservicesScalability
0 likes · 29 min read
Designing Resilient Stateful Distributed Systems: From Theory to Microservice Architecture
Zhuanzhuan Tech
Zhuanzhuan Tech
Nov 20, 2024 · Backend Development

Design and Implementation of a High‑Performance Message Notification System

This article presents a comprehensive design of a high‑performance, fault‑tolerant message notification system, covering service partitioning, system architecture, idempotent processing, dynamic error detection, thread‑pool management, retry mechanisms, and stability measures such as traffic‑spike handling, resource isolation, third‑party protection, monitoring, and active‑active deployment.

Backend ArchitectureDistributed SystemsJava
0 likes · 16 min read
Design and Implementation of a High‑Performance Message Notification System
Tencent Cloud Developer
Tencent Cloud Developer
Oct 22, 2024 · Industry Insights

Designing Stateful Distributed Systems: Core Principles and Architecture Patterns

This article analyzes the motivations, benefits, and challenges of building stateful distributed systems, compares monolithic, SOA, and microservice models, and provides detailed guidance on access layers, service discovery, fault tolerance, scaling, and data storage for cloud‑native architectures.

Cloud NativeDistributed SystemsMicroservices
0 likes · 29 min read
Designing Stateful Distributed Systems: Core Principles and Architecture Patterns
JavaEdge
JavaEdge
Oct 21, 2024 · Operations

Why Move Beyond Microservices? Unlocking Resilience with Unitized Architecture

This article explores the advantages of unitized architecture over traditional microservices, detailing how its modular design, dedicated routing layer, and tailored observability practices enhance system resilience, fault‑tolerance, and operational insight for large‑scale distributed applications.

Distributed SystemsResiliencefault tolerance
0 likes · 17 min read
Why Move Beyond Microservices? Unlocking Resilience with Unitized Architecture
Baidu Geek Talk
Baidu Geek Talk
Oct 9, 2024 · Artificial Intelligence

How Baidu’s Baige 4.0 Architecture Redefines AI Compute Efficiency

This article analyzes Baidu's Baige 4.0 AI infrastructure, detailing its four‑layer architecture, XMAN 5.0 hardware, HPN network, BCCL communication library, and AIAK inference upgrades, and explains how these innovations address large‑model training and inference challenges while boosting performance, utilization, and cost efficiency.

AI InfrastructureCluster ManagementGPU Acceleration
0 likes · 16 min read
How Baidu’s Baige 4.0 Architecture Redefines AI Compute Efficiency
IT Services Circle
IT Services Circle
Oct 4, 2024 · Databases

Understanding Redis Split‑Brain: Causes, Data Loss, and Prevention Strategies

This article explains Redis split‑brain behavior, describing its definition, causes such as network failures and Sentinel elections, the resulting data loss during master‑slave switches, and practical prevention measures including quorum configuration, timeout tuning, network monitoring, proxy layers, and the min‑slaves‑to‑write and min‑slaves‑max‑lag settings.

Master‑SlaveSplit-Braindatabase
0 likes · 7 min read
Understanding Redis Split‑Brain: Causes, Data Loss, and Prevention Strategies
FunTester
FunTester
Sep 19, 2024 · Fundamentals

Software Antifragility: Rethinking Error Handling and Reliability

This paper introduces the concept of software antifragility, drawing on Taleb’s theory to argue that embracing errors through fault tolerance, automatic runtime repair, and fault injection can transform software systems into self‑improving, more robust entities, and discusses implications for development processes and product reliability.

antifragilitychaos engineeringfault tolerance
0 likes · 13 min read
Software Antifragility: Rethinking Error Handling and Reliability
Top Architect
Top Architect
Aug 15, 2024 · Backend Development

Handling Interface‑Level Failures: Degradation, Circuit Breaking, Rate Limiting, and Queuing

The article explains how interface‑level faults—where the system stays up but business performance degrades—can be mitigated through four core techniques (degradation, circuit breaking, rate limiting, and queuing), detailing their principles, implementation patterns, and practical trade‑offs for backend services.

Backendcircuit breakerdegradation
0 likes · 20 min read
Handling Interface‑Level Failures: Degradation, Circuit Breaking, Rate Limiting, and Queuing
dbaplus Community
dbaplus Community
Aug 13, 2024 · Artificial Intelligence

Why Kubernetes Is the Ideal Platform for AI Inference: 5 Key Benefits

Kubernetes aligns perfectly with AI inference demands by offering built‑in scalability, resource and performance optimization, seamless portability across clouds, and robust fault‑tolerance, making it a cost‑effective, high‑availability foundation for deploying large‑scale machine‑learning models.

AI inferenceKubernetesResource Optimization
0 likes · 10 min read
Why Kubernetes Is the Ideal Platform for AI Inference: 5 Key Benefits
MaGe Linux Operations
MaGe Linux Operations
Aug 9, 2024 · Operations

Mastering Elasticsearch Data Sync and Cluster Architecture: Strategies & Best Practices

This article explains how to keep MySQL and Elasticsearch data in sync using synchronous calls, asynchronous notifications, or binlog listeners, and dives deep into Elasticsearch cluster design, node roles, distributed storage, query phases, split‑brain handling, and fault‑tolerance mechanisms.

Cluster ArchitectureDistributed QueryElasticsearch
0 likes · 8 min read
Mastering Elasticsearch Data Sync and Cluster Architecture: Strategies & Best Practices
Top Architecture Tech Stack
Top Architecture Tech Stack
Jul 16, 2024 · Cloud Native

Designing Fault‑Tolerant Microservices Architecture: Patterns and Practices

The article explains how to build reliable microservices by isolating failures, applying graceful degradation, change‑management, health checks, self‑healing, fallback caching, retry strategies, rate limiting, fast‑fail principles, circuit breakers, and failure‑testing to ensure high availability in distributed cloud‑native systems.

Cloud NativeMicroservicesOperations
0 likes · 14 min read
Designing Fault‑Tolerant Microservices Architecture: Patterns and Practices
Su San Talks Tech
Su San Talks Tech
Jul 6, 2024 · Backend Development

Mastering High Availability: 10 Essential Design Techniques for Scalable Systems

This article explains ten core techniques—system splitting, decoupling, asynchrony, retry, compensation, backup, multi‑active strategies, isolation, rate limiting, circuit breaking, and degradation—that together enable robust, high‑availability architectures for modern backend services.

Distributed SystemsSystem Designfault tolerance
0 likes · 12 min read
Mastering High Availability: 10 Essential Design Techniques for Scalable Systems
Ctrip Technology
Ctrip Technology
Jun 20, 2024 · Backend Development

Design and Architecture of Ctrip Service Registration Center

The article explains Ctrip's service registration center architecture, including its two‑layer Data and Session design, multi‑sharding, fault‑tolerance mechanisms, Redis‑based cluster discovery, design trade‑offs such as proxy versus Smart SDK, hashing strategy, and operational considerations for burst traffic and future scaling.

Distributed SystemsRedis discoveryfault tolerance
0 likes · 16 min read
Design and Architecture of Ctrip Service Registration Center
Mike Chen's Internet Architecture
Mike Chen's Internet Architecture
May 31, 2024 · Backend Development

Mastering Microservice Splitting: 6 Essential Design Principles

This article outlines six fundamental microservice splitting principles—including single responsibility, appropriate granularity, interface segregation, product impact avoidance, scalability, and fault tolerance—to help architects design maintainable, decoupled, and resilient services.

MicroservicesScalabilityfault tolerance
0 likes · 5 min read
Mastering Microservice Splitting: 6 Essential Design Principles
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
May 24, 2024 · Artificial Intelligence

How DeepRec Extension Boosts Distributed Sparse Model Training with Elasticity and Fault Tolerance

DeepRec Extension enhances large‑scale sparse model training by adding automatic elastic training, resource‑aware scheduling, real‑time monitoring, and efficient fault‑tolerance mechanisms, enabling lower cost, higher throughput, and more reliable distributed training for AI workloads.

AI InfrastructureDeepRecSparse Models
0 likes · 13 min read
How DeepRec Extension Boosts Distributed Sparse Model Training with Elasticity and Fault Tolerance
Tongcheng Travel Technology Center
Tongcheng Travel Technology Center
Apr 17, 2024 · Backend Development

In-Depth Analysis of Apache RocketMQ Architecture, Operation Principles, and High‑Throughput Mechanisms

This article provides a comprehensive overview of Apache RocketMQ, detailing its core components, producer and consumer workflows, storage strategies, master‑slave synchronization, Raft‑based half‑write and leader election mechanisms, and best‑practice recommendations for high‑throughput, fault‑tolerant messaging systems.

Backend DevelopmentDistributed SystemsHigh Throughput
0 likes · 22 min read
In-Depth Analysis of Apache RocketMQ Architecture, Operation Principles, and High‑Throughput Mechanisms
Architects' Tech Alliance
Architects' Tech Alliance
Apr 6, 2024 · Artificial Intelligence

How ByteDance Scaled LLM Training to Over 10,000 GPUs: Inside the MegaScale System

The article analyzes ByteDance and Peking University's MegaScale system that enables efficient, stable training of large language models on clusters exceeding ten thousand GPUs, detailing algorithmic tweaks, 3D parallel communication overlap, operator optimizations, data‑pipeline improvements, network tuning, and fault‑tolerance mechanisms that together achieve a 55.2% MFU on a 175B model.

Distributed SystemsGPU clustersLLM training
0 likes · 15 min read
How ByteDance Scaled LLM Training to Over 10,000 GPUs: Inside the MegaScale System
Architect
Architect
Apr 4, 2024 · Backend Development

Mastering High Availability: 9 Essential Design Techniques for Scalable Systems

The article walks through nine practical techniques—system splitting, decoupling, asynchronous processing, retry, compensation, backup, multi‑active deployment, rate limiting, circuit breaking, and degradation—explaining why each is needed, how they are implemented in real‑world microservice architectures, and what trade‑offs to consider.

Distributed SystemsMicroservicesSystem Design
0 likes · 13 min read
Mastering High Availability: 9 Essential Design Techniques for Scalable Systems
Architecture & Thinking
Architecture & Thinking
Mar 5, 2024 · Databases

How Database Middleware Solves High‑Traffic Challenges: Connection Pools, Sharding, and More

This article examines how database middleware tackles the demanding needs of large‑scale internet services by providing centralized connection‑pool management, transparent read‑write splitting, diverse load‑balancing algorithms, sharding support, automatic failover, security controls, comprehensive monitoring, and flexible backup‑recovery mechanisms.

Connection Poolfault tolerancemonitoring
0 likes · 9 min read
How Database Middleware Solves High‑Traffic Challenges: Connection Pools, Sharding, and More
Linux Cloud Computing Practice
Linux Cloud Computing Practice
Mar 4, 2024 · Operations

Building a High‑Performance, Highly Available Membership System with ES, Redis & MySQL

To ensure the massive, multi‑platform membership service remains fast and reliable, this article details a multi‑center architecture using Elasticsearch for unified member data, Redis caching, and MySQL partitioning, along with traffic isolation, fault‑tolerant syncing, and fine‑grained flow‑control and degradation strategies.

System Architecturefault tolerancemysql
0 likes · 23 min read
Building a High‑Performance, Highly Available Membership System with ES, Redis & MySQL
Architect's Guide
Architect's Guide
Mar 2, 2024 · Fundamentals

RabbitMQ vs Kafka: Core Differences and When to Use Each

This article compares RabbitMQ and Apache Kafka across architecture, message ordering, routing, timing, retention, fault handling, scalability, and consumer complexity, and provides guidance on which platform suits specific use‑cases such as flexible routing, strict ordering, long‑term retention, or high throughput.

KafkaMessage OrderingMessage Queue
0 likes · 19 min read
RabbitMQ vs Kafka: Core Differences and When to Use Each
Architecture & Thinking
Architecture & Thinking
Dec 25, 2023 · Databases

How to Detect, Analyze, and Prevent Redis Hot Keys to Avoid Outages

This article explains what Redis hot keys are, the scenarios that generate them, their risks, and provides practical monitoring methods and mitigation strategies—including cache pre‑warming, distributed caching, rate limiting, and secondary caches—to keep production systems stable.

Hot Keyfault tolerancemonitoring
0 likes · 11 min read
How to Detect, Analyze, and Prevent Redis Hot Keys to Avoid Outages
ITPUB
ITPUB
Dec 5, 2023 · Cloud Native

Prevent Massive K8s Outages: Scale, Redundancy, and Embrace Restarts

The article analyzes the November 27 Didi outage caused by an aggressive Kubernetes upgrade, then presents four engineering principles—controlling cluster size, eliminating single points of failure, treating restarts as normal, and decoupling data and control planes—to build more resilient cloud‑native systems.

Cloud NativeCluster UpgradeKubernetes
0 likes · 13 min read
Prevent Massive K8s Outages: Scale, Redundancy, and Embrace Restarts
Spring Full-Stack Practical Cases
Spring Full-Stack Practical Cases
Dec 1, 2023 · Backend Development

Resilience4j Essentials: Circuit Breaker, TimeLimiter, Bulkhead & RateLimiter

This article introduces Resilience4j, a lightweight fault‑tolerance library for Spring Boot, explaining its core decorators—CircuitBreaker, TimeLimiter, Bulkhead, and RateLimiter—along with configuration examples, annotation usage, fallback handling, and practical test code to improve system stability and resilience.

JavaSpring Bootcircuit breaker
0 likes · 16 min read
Resilience4j Essentials: Circuit Breaker, TimeLimiter, Bulkhead & RateLimiter
Open Source Linux
Open Source Linux
Nov 23, 2023 · Operations

Mastering RAID Fault Tolerance: Consistency, Hot Spare, Rebuild & More

This article explains RAID fault tolerance mechanisms—including redundancy levels of RAID 1,5,6,10,50,60—covers consistency checks, hot‑spare and emergency backup, data reconstruction, read/write policies, power‑loss protection, striping, mirroring, foreign configurations, energy‑saving and JBOD, providing a comprehensive guide for storage administrators.

Data ProtectionRAIDStorage Management
0 likes · 15 min read
Mastering RAID Fault Tolerance: Consistency, Hot Spare, Rebuild & More
Open Source Linux
Open Source Linux
Nov 21, 2023 · Fundamentals

Understanding RAID Levels: Choose the Right Storage Solution for Performance and Reliability

RAID combines multiple physical disks into virtual drives, offering various levels—RAID 0, 1, 1ADM, 5, 6, 10, 10ADM, 1E, 50, and 60—each balancing performance, fault tolerance, and capacity, with detailed processing flows, storage calculations, and best‑practice recommendations for optimal deployment.

RAIDdata redundancyfault tolerance
0 likes · 20 min read
Understanding RAID Levels: Choose the Right Storage Solution for Performance and Reliability
Sanyou's Java Diary
Sanyou's Java Diary
Nov 20, 2023 · Operations

Mastering High Availability: 10 Essential Design Techniques for Scalable Systems

This article outlines ten practical techniques—including system splitting, decoupling, asynchronous processing, retry strategies, compensation, backup, multi‑active deployment, isolation, rate limiting, circuit breaking, and degradation—to help engineers design highly available, resilient architectures for large‑scale internet applications.

MicroservicesSystem Designfault tolerance
0 likes · 14 min read
Mastering High Availability: 10 Essential Design Techniques for Scalable Systems
Architects' Tech Alliance
Architects' Tech Alliance
Nov 5, 2023 · Fundamentals

Understanding RAID Fault Tolerance, Consistency Checks, Hot Spare, Rebuild, and Data Protection Features

This article explains RAID fault‑tolerance mechanisms, consistency verification, hot‑spare and emergency backup, rebuild processes, virtual‑disk read/write policies, power‑loss protection, disk striping, mirroring, foreign configurations, power‑saving and pass‑through features, providing a comprehensive overview of modern storage system capabilities.

RAIDdisk stripingfault tolerance
0 likes · 16 min read
Understanding RAID Fault Tolerance, Consistency Checks, Hot Spare, Rebuild, and Data Protection Features
Alibaba Cloud Native
Alibaba Cloud Native
Oct 13, 2023 · Cloud Native

Why Microservice Governance Matters and How OpenSergo Tackles Its Challenges

The article explains the stability challenges of modern microservice architectures, outlines the three governance domains (development/testing, change, runtime), and introduces OpenSergo’s open, cloud‑native specifications, control‑plane, and data‑plane solutions for traffic routing, gray‑release, and fault‑tolerance.

OpenSergofault tolerancegray release
0 likes · 18 min read
Why Microservice Governance Matters and How OpenSergo Tackles Its Challenges
dbaplus Community
dbaplus Community
Oct 7, 2023 · Operations

How to Build a Truly High‑Availability System: 6 Essential Design Layers

This article breaks down high‑availability system design into six critical layers—architecture, development standards, application services, storage, product safeguards, and operations—offering concrete practices such as capacity planning, fault‑tolerant patterns, monitoring, and incident‑response strategies to achieve four‑nine (99.99%) uptime.

OperationsSystem Designcapacity planning
0 likes · 26 min read
How to Build a Truly High‑Availability System: 6 Essential Design Layers
MaGe Linux Operations
MaGe Linux Operations
Aug 29, 2023 · Operations

How to Effectively Monitor and Recover a Kafka Cluster

This guide explains essential Kafka monitoring techniques, third‑party tools, custom scripts, key metrics, and practical strategies for high availability, fault detection, rapid recovery, and ongoing testing to keep Kafka clusters stable and performant.

Operationsdistributed-systemsfault tolerance
0 likes · 7 min read
How to Effectively Monitor and Recover a Kafka Cluster
JD Retail Technology
JD Retail Technology
Aug 14, 2023 · Backend Development

Implementing a Lightweight Distributed Scheduling Solution to Replace TBSchedule

To improve stability and reduce costs during high‑traffic events, we replaced the Zookeeper‑dependent TBSchedule framework with a lightweight, Redis‑based distributed scheduler that decentralizes task execution, uses thread pools instead of timers, and supports dynamic scaling and seamless degradation for reliable order processing.

Distributed SchedulingMicroservicesfault tolerance
0 likes · 4 min read
Implementing a Lightweight Distributed Scheduling Solution to Replace TBSchedule
JD Cloud Developers
JD Cloud Developers
Aug 9, 2023 · Backend Development

Mastering Hystrix: Implementing Circuit Breakers in Spring Cloud Microservices

This article explains why circuit breakers are essential in microservice architectures, introduces Netflix's Hystrix library, details its design principles, shows step‑by‑step demos for Ribbon and Feign integration, and covers dashboards, Turbine, isolation strategies, request merging, caching, and related Spring Boot SPI mechanisms.

HystrixJavaMicroservices
0 likes · 29 min read
Mastering Hystrix: Implementing Circuit Breakers in Spring Cloud Microservices
Architect
Architect
Aug 4, 2023 · Fundamentals

What Exactly Is Software Architecture? A Deep Dive into Systems, Modules, and Design Principles

The article systematically defines software architecture, distinguishes systems, subsystems, modules, and components, compares frameworks with architectures, explores TOGAF and RUP classifications, traces the evolution from monoliths to micro‑services, and presents concrete design principles and common pitfalls for building scalable, maintainable systems.

MicroservicesScalabilitySoftware Architecture
0 likes · 25 min read
What Exactly Is Software Architecture? A Deep Dive into Systems, Modules, and Design Principles
Architects Research Society
Architects Research Society
Jul 13, 2023 · Operations

Five Patterns to Make Your Microservice Fault‑Tolerant

This article explains essential fault‑tolerance patterns for microservices—including timeouts, retries, circuit breakers, distributed deadlines, and rate limiting—detailing their basic forms, drawbacks, and practical implementation strategies to improve reliability and prevent cascading failures.

Microservicescircuit breakerfault tolerance
0 likes · 12 min read
Five Patterns to Make Your Microservice Fault‑Tolerant
58UXD
58UXD
Apr 27, 2023 · Product Management

Designing Fault‑Tolerant Products: Lessons from AI and Human Error

The article explores how AI's rise highlights human error, argues that fault‑tolerant design is essential for user‑centric products, outlines practical guidelines such as anticipatory guidance and constructive error recovery, and validates the approach with a VR‑tool experiment.

AI integrationError HandlingProduct Design
0 likes · 9 min read
Designing Fault‑Tolerant Products: Lessons from AI and Human Error
Zhuanzhuan Tech
Zhuanzhuan Tech
Apr 26, 2023 · Backend Development

Design and Implementation of an Automated Payment Channel Management System

This article describes the design, technology choices, architecture, and implementation details of an automated payment channel management system that uses Redis‑based time‑series storage, custom circuit‑breaker logic, and monitoring to achieve fast fault detection, accurate alerting, and future automated failover.

Backendcircuit breakerfault tolerance
0 likes · 10 min read
Design and Implementation of an Automated Payment Channel Management System
Programmer DD
Programmer DD
Mar 16, 2023 · Operations

Why High Availability Matters: Building Fault‑Tolerant Cloud Systems

The article explains how system failures like bugs, security breaches, and cloud outages can cripple businesses, and outlines the concepts of fault tolerance and disaster recovery as essential components of high‑availability architectures to ensure continuous service and protect revenue.

disaster recoveryfault tolerancehigh availability
0 likes · 7 min read
Why High Availability Matters: Building Fault‑Tolerant Cloud Systems
Tencent Cloud Developer
Tencent Cloud Developer
Mar 13, 2023 · Cloud Computing

Design Principles for High‑Availability System Architecture

The article outlines a comprehensive high‑availability architecture framework across six layers—development standards, application services, storage, product fallback, operations deployment, and emergency response—detailing design principles such as stateless services, elastic scaling, redundant storage, robust monitoring, gray releases, and chaos engineering to ensure resilient, continuously available systems.

DeploymentScalabilitySystem Architecture
0 likes · 25 min read
Design Principles for High‑Availability System Architecture
DaTaobao Tech
DaTaobao Tech
Mar 1, 2023 · Game Development

Design and Implementation of Taobao Dou Dizhu Endgame Mode

The article describes the design and implementation of Taobao Dou Dizhu’s new single‑player endgame mode, which generates daily unique puzzles with a guaranteed single solution, manages activity triggers, Redis caching, AI interaction, fault tolerance, consistency, and reward idempotency, boosting user retention during promotions.

AIBackend Architecturecaching
0 likes · 15 min read
Design and Implementation of Taobao Dou Dizhu Endgame Mode