Tagged articles

fault tolerance

317 articles · Page 1 of 4

Jul 1, 2026 · Operations

When One Timeout Triggers a Platform‑Wide Outage

The article explains how unbounded retries, replication fan‑out, and naïve autoscaling can amplify a single timeout into a cascade of failures, and it proposes bounded retry policies, load‑aware scaling, and layered persistence as safeguards for reliable API‑centric systems.

autoscalingbounded retriesdistributed systems

0 likes · 12 min read

When One Timeout Triggers a Platform‑Wide Outage

Lobster Programming

Jun 1, 2026 · Backend Development

How ZooKeeper Implements Distributed Locks: Mechanism and Pitfalls

The article explains ZooKeeper's herd effect, how temporary sequential nodes and chain watching reduce notification storms, how client failures are handled, and why most projects use Curator to simplify fault‑tolerant distributed lock implementations.

Chain WatchingCuratorDistributed Lock

0 likes · 5 min read

How ZooKeeper Implements Distributed Locks: Mechanism and Pitfalls

IT Learning Made Simple

May 31, 2026 · Backend Development

What Journey to the West Teaches About Distributed System Architecture

Using the classic tale Journey to the West, the article maps each disciple to a microservice, explains the shift from monolith to microservices, and illustrates service governance, load balancing, service discovery, fault tolerance, and distributed transactions through vivid analogies and concrete examples.

MicroservicesService Governancedistributed systems

0 likes · 7 min read

What Journey to the West Teaches About Distributed System Architecture

Java Tech Workshop

May 31, 2026 · Backend Development

Spring Boot Service Circuit Breaking and Degradation with Sentinel: A Practical Guide

This article explains how microservice architectures suffer from cascading failures and demonstrates how to use Sentinel for rate limiting, circuit breaking, and degradation—including architecture, configuration, code examples, and best‑practice tips—to achieve high‑availability Spring Boot services.

SentinelSpring Bootcircuit breaking

0 likes · 16 min read

Spring Boot Service Circuit Breaking and Degradation with Sentinel: A Practical Guide

FunTester

May 21, 2026 · Artificial Intelligence

How Anthropic Solves Agent Forgetfulness with Event Persistence

The article explains why in‑memory state is unreliable for long‑running or parallel agents, defines event persistence, shows how persisted event records enable checkpoint‑restart, observability, and experience extraction, and outlines practical guidelines for what to record.

AIAgentObservability

0 likes · 10 min read

How Anthropic Solves Agent Forgetfulness with Event Persistence

Coder Trainee

May 18, 2026 · Cloud Native

Spring Cloud Microservices Tutorial – Sentinel for Fault Tolerance and Rate Limiting

This article walks through adding Alibaba Sentinel to a Spring Cloud microservice suite to protect against service outages, traffic spikes, and slow calls by configuring rate limiting, circuit breaking, and fallback mechanisms across user, order, and gateway services, with full Docker‑compose setup and testing steps.

FeignMicroservicesSentinel

0 likes · 14 min read

Spring Cloud Microservices Tutorial – Sentinel for Fault Tolerance and Rate Limiting

Architect's Guide

May 13, 2026 · Big Data

Next‑Gen Visual Drag‑Drop Data Flow Platform: Features, Architecture, and Performance

The article introduces a visual drag‑and‑drop data flow platform that unifies stream and batch processing, offers version control, automatic fault tolerance, configurable data permissions, comprehensive monitoring, data alignment, and query templates, and presents single‑instance performance benchmarks of over 30k and 60k ops/s.

Data AlignmentData FlowDrag-and-Drop

0 likes · 7 min read

Next‑Gen Visual Drag‑Drop Data Flow Platform: Features, Architecture, and Performance

PaperAgent

May 8, 2026 · Artificial Intelligence

Jeff Dean’s Decoupled DiLoCo Shatters the Million‑Chip LLM Pre‑training Bottleneck

The article explains how Google’s Decoupled DiLoCo architecture breaks the scalability wall of million‑chip LLM pre‑training by partitioning the cluster into independent learners, using an asynchronous syncer, and achieving up to 88% effective compute while preserving model quality.

AIGoogleLLM

0 likes · 7 min read

Jeff Dean’s Decoupled DiLoCo Shatters the Million‑Chip LLM Pre‑training Bottleneck

Linyb Geek Road

May 6, 2026 · Artificial Intelligence

Ensuring High Availability and Robustness for LLM Agents: Key Strategies and Pitfalls

The article breaks down the unique hard and soft failure modes of LLM‑driven agents and proposes a four‑layer defense—LLM call handling, tool execution isolation, execution‑chain checkpointing, and semantic‑level safeguards—plus observability practices to keep production agents stable and reliable.

AgentCheckpointLLM

0 likes · 15 min read

Ensuring High Availability and Robustness for LLM Agents: Key Strategies and Pitfalls

IT Services Circle

Apr 30, 2026 · Backend Development

How a Single Front‑end Change Dragged Four Backend Teams – The BFF Solution

A tiny UI tweak that required a meeting with four backend groups exposed the pain of calling many micro‑services from the front‑end, and the article shows how introducing a Backend‑For‑Frontend (BFF) layer can aggregate, transform, and simplify those calls while improving reliability and performance.

API aggregationBFFBackend For Frontend

0 likes · 21 min read

How a Single Front‑end Change Dragged Four Backend Teams – The BFF Solution

Golang Shines

Apr 28, 2026 · Backend Development

Essential Go Packages for Production Environments

This article compiles a curated list of production‑ready Go packages covering testing, logging, error handling, caching, databases, HTTP routing, HTTP clients, fault tolerance, Kafka, and various utility libraries, explaining their key features, concrete code examples, and why they are preferred in real‑world services.

CachingGoHTTP

0 likes · 15 min read

Essential Go Packages for Production Environments

Coder Trainee

Apr 27, 2026 · Cloud Native

Spring Cloud Microservices Practice #6: Sentinel for Service Fault Tolerance and Rate Limiting

This article explains why service fault tolerance is essential in micro‑service architectures, compares Sentinel with Hystrix and Resilience4j, and provides step‑by‑step guidance on integrating Sentinel for circuit breaking, QPS and concurrency limiting, hot‑parameter control, system protection, and dynamic rule management with Nacos.

MicroservicesSentinelcircuit breaking

0 likes · 14 min read

Spring Cloud Microservices Practice #6: Sentinel for Service Fault Tolerance and Rate Limiting

Architecture and Beyond

Apr 25, 2026 · Artificial Intelligence

Practical Insights on Recent AI Engineering Deployments

The article examines how large language models function as probabilistic components within deterministic software, discusses fault‑tolerance limits for viable AI use cases, and offers detailed engineering guidance on RAG pipelines, tool‑calling determinism, agent fragility, testing, monitoring, and privacy‑conscious deployment in finance.

AI EngineeringLLMRAG

0 likes · 14 min read

Practical Insights on Recent AI Engineering Deployments

AI Tech Publishing

Apr 21, 2026 · Artificial Intelligence

Why Your AI Agent Stays a Toy: Six Production‑Readiness Gaps and How to Bridge Them

Moving an AI agent from a controlled demo to an unattended production environment introduces six critical gaps—fault handling, state persistence, observability, credential security, cost control, and human supervision—each requiring specific infrastructure, practices, and a comprehensive readiness checklist to avoid costly failures.

AI AgentsObservabilitycost management

0 likes · 15 min read

Why Your AI Agent Stays a Toy: Six Production‑Readiness Gaps and How to Bridge Them

JD Tech

Apr 15, 2026 · Artificial Intelligence

How OpenClaw Powers Multi‑Channel AI Agents with Skills and Sub‑Agents

The article provides an in‑depth analysis of OpenClaw’s architecture, explaining why it was created, its layered design, the core ReAct loop, the Skill system, sub‑agent creation and management, fault‑tolerance mechanisms, tool policies, and how it extends the pi‑mono engine to support robust, multi‑channel AI agents.

AI AgentsOpenClawReAct loop

0 likes · 20 min read

How OpenClaw Powers Multi‑Channel AI Agents with Skills and Sub‑Agents

AI Insight Log

Apr 8, 2026 · Artificial Intelligence

Anthropic Blocks Third‑Party Agents, Then Launches Claude Managed Agents to Disrupt the Startup Scene

Anthropic’s Claude Managed Agents is a hosted platform that offers sandboxed execution, long‑running sessions, multi‑agent coordination, MCP integration and immutable session persistence, delivering up to 90% latency reduction and fault‑tolerant design, while early adopters like Notion, Rakuten, Asana and Sentry showcase real‑world production use.

AI agent orchestrationAnthropicClaude Managed Agents

0 likes · 7 min read

Anthropic Blocks Third‑Party Agents, Then Launches Claude Managed Agents to Disrupt the Startup Scene

AI Architecture Hub

Feb 25, 2026 · Artificial Intelligence

How OpenClaw Turns AI Agents into Production‑Ready Infrastructure

This article analyzes OpenClaw’s engineering‑focused architecture, detailing its three‑layer component boundaries, gateway‑centric session management, concurrency controls, fault‑self‑healing mechanisms, context handling, multi‑agent routing, and practical deployment scenarios for building stable, auditable AI agent systems.

AI AgentsOpenClawfault tolerance

0 likes · 20 min read

How OpenClaw Turns AI Agents into Production‑Ready Infrastructure

AI Large Model Application Practice

Feb 9, 2026 · Artificial Intelligence

Inside OpenClaw: How Its Agent Engine Powers Scalable, Fault‑Tolerant AI Agents

This article dissects OpenClaw’s core Agent engine, explaining its workspace layout, overall architecture, scheduling and concurrency mechanisms, high‑availability safeguards, and context‑guard strategies that together enable robust, production‑grade AI agents.

AI Agent ArchitectureConcurrency ControlContext Guard

0 likes · 13 min read

Inside OpenClaw: How Its Agent Engine Powers Scalable, Fault‑Tolerant AI Agents

Amap Tech

Feb 3, 2026 · Artificial Intelligence

Building a Scalable AI Agent Smart Task Framework for Offline & Event‑Driven Use

After LLMs entered the deep‑water stage, developers realized that agents must go beyond passive Q&A to support asynchronous, long‑running, and subscribable tasks; this article details the design, architecture, and engineering challenges of the “Xiao Gao Teacher AI Agent” smart‑task system, from event‑driven logic to fault‑tolerant deployment.

AI AgentEvent-Driven ArchitectureLLM

0 likes · 19 min read

Building a Scalable AI Agent Smart Task Framework for Offline & Event‑Driven Use

Architect's Journey

Dec 3, 2025 · Cloud Native

Microservice Governance Guide: From Stable Operations to Maximum Efficiency

This comprehensive guide breaks down microservice governance into four pillars—node management, load balancing, routing, and fault tolerance—providing concrete configurations, algorithm choices, and service‑mesh recommendations to achieve 99.99% availability, cut wasted resources by over 30%, and halve iteration cycles.

GovernanceMicroservicesRouting

0 likes · 16 min read

Microservice Governance Guide: From Stable Operations to Maximum Efficiency

Architect's Journey

Dec 1, 2025 · Backend Development

Designing Three‑High Systems: Practical Performance Tuning and Fault‑Tolerant Architecture

The article breaks down the design logic and implementation steps for high‑performance, high‑concurrency, and high‑availability systems, covering bottleneck identification, read/write optimization, three‑dimensional scaling, and concrete fault‑tolerance strategies to build resilient, scalable services.

High AvailabilityHigh concurrencyfault tolerance

0 likes · 15 min read

Designing Three‑High Systems: Practical Performance Tuning and Fault‑Tolerant Architecture

IT Architects Alliance

Nov 9, 2025 · Operations

How to Build Fault‑Tolerant Distributed Systems: Principles, Patterns, and Code

This article explains core fault‑tolerance principles for distributed systems, covering isolation, redundancy, health checks, failure detection, automatic recovery, consistency trade‑offs, Saga transactions, monitoring, prediction, and team practices to create resilient, maintainable architectures.

Microservicesfault tolerancekubernetes

0 likes · 10 min read

How to Build Fault‑Tolerant Distributed Systems: Principles, Patterns, and Code

JD Tech

Sep 26, 2025 · Operations

Avoiding High‑Availability Pitfalls: Real‑World JD Lessons and Solutions

This article examines common high‑availability challenges across applications, databases, caches, message queues, containers, and GC, presenting real JD engineering cases, root‑cause analyses, and practical mitigation strategies to help engineers design more resilient systems.

High AvailabilityMessage QueueRedis

0 likes · 37 min read

Avoiding High‑Availability Pitfalls: Real‑World JD Lessons and Solutions

Ops Community

Sep 17, 2025 · Operations

Mastering System Fault Tolerance: From Theory to Production‑Ready High‑Availability

This comprehensive guide explores the philosophy, core patterns, and practical techniques for designing fault‑tolerant, highly available systems, covering circuit breakers, retries, rate limiting, monitoring, cloud‑native deployment, and real‑world case studies to help engineers build resilient production architectures.

Cloud NativeHigh Availabilitycircuit breaker

0 likes · 24 min read

Mastering System Fault Tolerance: From Theory to Production‑Ready High‑Availability

IT Architects Alliance

Sep 14, 2025 · Operations

How to Build Truly High‑Availability Systems: Principles, Patterns & Code

This article explores the core concepts, design principles, and practical code examples for building high‑availability architectures, covering fault isolation, load balancing, data replication, monitoring, and cost‑benefit considerations to keep large‑scale services running reliably.

Cloud NativeHigh AvailabilityMonitoring

0 likes · 11 min read

How to Build Truly High‑Availability Systems: Principles, Patterns & Code

Efficient Ops

Sep 9, 2025 · Fundamentals

Inside 3FS: How Distributed File Systems Hide Complexity and Scale

3FS is an open‑source distributed file system that abstracts multiple machines into a single namespace, offering massive scalability, fault tolerance, and high throughput through components like Meta, Mgmtd, Storage, and Client, and leveraging the CRAQ protocol for strong consistency and efficient reads and writes.

3FSCRAQDistributed File System

0 likes · 12 min read

Inside 3FS: How Distributed File Systems Hide Complexity and Scale

NiuNiu MaTe

Sep 4, 2025 · Operations

Mastering Multi‑Active Distributed Systems: From Single Server to Global Fault Tolerance

This article walks developers through the evolution of distributed system architectures—from single‑machine deployments to master‑slave, same‑city active‑active, and finally true multi‑active setups—explaining core concepts, replication strategies, conflict resolution, fault detection, switch mechanisms, recovery methods, and interview tips for high‑availability design.

CAP theoremData Replicationdistributed systems

0 likes · 26 min read

Mastering Multi‑Active Distributed Systems: From Single Server to Global Fault Tolerance

JD Tech Talk

Sep 4, 2025 · Operations

Avoid Common High‑Availability Pitfalls: Real‑World JD Practices and Solutions

This article analyzes the multi‑dimensional challenges of building high‑availability systems—covering applications, databases, caches, message queues, containers, GC, and more—by sharing real JD engineering scenarios, common failure patterns, and concrete mitigation strategies to help engineers design more resilient services.

High Availabilitybackenddistributed systems

0 likes · 36 min read

Avoid Common High‑Availability Pitfalls: Real‑World JD Practices and Solutions

JD Cloud Developers

Sep 4, 2025 · Operations

Mastering High‑Availability: JD Real‑World Pitfalls & Fixes for Apps, DBs, Cache & MQ

This article shares JD's practical high‑availability architecture lessons, detailing common pitfalls across applications, databases, caches, RPC frameworks, containers, data centers, GC, and message queues, and provides concrete troubleshooting steps and optimization techniques to help engineers design more resilient, fault‑tolerant systems.

High AvailabilitySystem Designbackend

0 likes · 36 min read

Mastering High‑Availability: JD Real‑World Pitfalls & Fixes for Apps, DBs, Cache & MQ

JD Retail Technology

Sep 4, 2025 · Operations

Mastering High Availability: Real-World Pitfalls and Solutions from JD's Production Systems

This article walks through the challenges of building high‑availability systems—covering applications, databases, caches, message queues, containers, GC, and more—using JD’s production experiences to highlight common pitfalls, root‑cause analyses, and practical mitigation strategies for engineers seeking resilient architecture.

CacheHigh AvailabilityJDK

0 likes · 37 min read

Mastering High Availability: Real-World Pitfalls and Solutions from JD's Production Systems

Architect's Guide

Aug 25, 2025 · Fundamentals

19 Essential Distributed System Design Patterns You Must Know

This article explores nineteen core design patterns for distributed systems—including Bloom filters, consistent hashing, quorum, leader‑follower, heartbeat, fencing, WAL, segmented logs, high‑water mark, leases, gossip, Phi accrual detection, split‑brain handling, checksums, CAP and PACELC theorems, hinted handoff, read repair, and Merkle trees—explaining their purpose, operation, and typical use cases.

consistencydistributed systemsfault tolerance

0 likes · 14 min read

19 Essential Distributed System Design Patterns You Must Know

Tech Freedom Circle

Aug 20, 2025 · Backend Development

P0 Eureka Service Discovery Collapse Cost a Top E‑commerce $120M During Double‑11

During the Double‑11 shopping festival, a leading e‑commerce platform suffered a P0 outage when its Eureka service‑discovery cluster overloaded, triggering a full‑chain failure that lasted 2 hours 42 minutes and caused losses exceeding 1.2 billion yuan; the article dissects the timeline, root causes, capacity mis‑planning, monitoring gaps, and remediation strategies.

JavaMicroservicesMonitoring

0 likes · 34 min read

P0 Eureka Service Discovery Collapse Cost a Top E‑commerce $120M During Double‑11

Tech Freedom Circle

Jul 27, 2025 · Interview Experience

Designing a Payment Middle Platform from Scratch – Core Challenges (Interview Answer)

This article provides a comprehensive guide to designing a payment middle platform from zero, covering its definition, classic middle‑platform types, core architecture, functional modules, fault‑tolerance, security measures, distributed‑transaction strategies, and detailed Java pseudocode, offering interview‑ready insights for architects.

Microservicesarchitecturedistributed transaction

0 likes · 39 min read

Designing a Payment Middle Platform from Scratch – Core Challenges (Interview Answer)

JakartaEE China Community

Jul 15, 2025 · Cloud Native

Choosing a Technology Stack for Cloud‑Native Microservices: MicroProfile vs Spring

This article explains why cloud‑native microservices are beneficial, defines their key characteristics, and provides a detailed, side‑by‑side comparison of MicroProfile and Spring frameworks—including REST APIs, dependency injection, configuration, fault tolerance, security, health checks, metrics, and tracing—along with concrete code examples and starter resources.

Cloud NativeConfigurationMicroProfile

0 likes · 27 min read

Choosing a Technology Stack for Cloud‑Native Microservices: MicroProfile vs Spring

Big Data Technology Tribe

Jul 8, 2025 · Operations

Mastering Retry Strategies: Why Exponential Backoff Is Essential for Reliable Systems

This article explains the purpose of retry mechanisms, why exponential backoff is crucial for handling transient failures, compares common backoff strategies, details key parameters such as base delay, max delay, multiplier and jitter, and provides a Java example that demonstrates their practical effects.

Javadistributed systemsexponential backoff

0 likes · 6 min read

Mastering Retry Strategies: Why Exponential Backoff Is Essential for Reliable Systems

IT Architects Alliance

Jul 7, 2025 · Backend Development

Avoid the 5 Fatal Architecture Mistakes That Cost Millions

This article analyzes five common architectural design errors—over‑pursuing cutting‑edge tech, single points of failure, mishandling data consistency, fragmented performance tuning, and neglecting security—illustrating their costly impacts with real‑world cases and offering practical principles to prevent them.

Microservicesfault toleranceperformance

0 likes · 13 min read

Avoid the 5 Fatal Architecture Mistakes That Cost Millions

Cognitive Technology Team

Jun 21, 2025 · Fundamentals

Understanding Faults, Failures, and Fault Tolerance in Distributed Systems

This tutorial explains the definitions of faults and failures in distributed systems, explores their types and root causes, and presents fault‑tolerance mechanisms such as replication, checkpointing, redundancy, error detection, load balancing, and consensus algorithms to build resilient architectures.

Data Replicationconsensus algorithmsdistributed systems

0 likes · 10 min read

Understanding Faults, Failures, and Fault Tolerance in Distributed Systems

Linux Kernel Journey

Jun 16, 2025 · Cloud Computing

How Tencent’s TGW Achieves Seamless Fast Migration and Self‑Healing Fault Recovery

The paper presents Tencent’s TGW cloud gateway architecture, highlighting a 2.9× forwarding performance boost, lossless state migration within 4 seconds, sub‑minute fault detection, multi‑level fault‑tolerance mechanisms, and operational best practices that enable 100 % availability for massive online services.

Cloud GatewayDPDKState Migration

0 likes · 16 min read

How Tencent’s TGW Achieves Seamless Fast Migration and Self‑Healing Fault Recovery

Tencent Cloud Developer

May 20, 2025 · Cloud Computing

Efficient and Resilient Cloud Gateway at Scale: Architecture, Key Technologies, and Operational Practices of Tencent TGW

The article presents a comprehensive analysis of Tencent's TGW cloud gateway, detailing its modular architecture, high‑performance forwarding plane, lossless state migration, rapid fault recovery, multi‑level redundancy, operational best practices, and security mechanisms that enable ultra‑low latency and high availability for large‑scale internet services.

Cloud GatewayState Migrationfault tolerance

0 likes · 13 min read

Efficient and Resilient Cloud Gateway at Scale: Architecture, Key Technologies, and Operational Practices of Tencent TGW

Tencent Technical Engineering

May 19, 2025 · Cloud Native

How Tencent’s TGW Delivers 3× Faster Throughput and Near‑Zero Downtime at Scale

The USENIX‑selected paper on Tencent’s TGW cloud gateway reveals how a modular, multi‑layer architecture achieves up to 2.9‑fold throughput gains, seconds‑level elastic scaling, loss‑less hot migration, and sub‑second fault recovery, offering a blueprint for resilient large‑scale cloud networking.

Cloud GatewayHigh AvailabilityNetwork Architecture

0 likes · 16 min read

How Tencent’s TGW Delivers 3× Faster Throughput and Near‑Zero Downtime at Scale

Xiaokun's Architecture Exploration Notes

May 18, 2025 · Fundamentals

How Distributed Consensus Overcomes the FLP Impossibility Theorem

This article explores how to build fault‑tolerant distributed systems by formalizing consensus, outlines its core properties, explains the FLP impossibility theorem, and shows how algorithms like Raft sidestep its limits through timing constraints and recovery mechanisms.

ConsensusFLP theoremHigh Availability

0 likes · 8 min read

How Distributed Consensus Overcomes the FLP Impossibility Theorem

Xiaokun's Architecture Exploration Notes

May 11, 2025 · Fundamentals

How Fencing Tokens Ensure Safety and Liveness in Distributed Lock Services

This article explores how fencing tokens can provide safety and liveness guarantees in distributed lock services, illustrating fault scenarios, token-based conflict resolution, and abstract system models that help engineers prioritize correctness while tolerating temporary unavailability.

distributed systemsfault tolerancefencing tokens

0 likes · 8 min read

How Fencing Tokens Ensure Safety and Liveness in Distributed Lock Services

Xiaokun's Architecture Exploration Notes

May 11, 2025 · Fundamentals

Why Unreliable Networks Threaten Distributed Systems and How to Mitigate Them

Distributed systems suffer from network unreliability—including packet loss, out‑of‑order delivery, variable latency, and ambiguous node failures—making timeout settings and fault detection challenging, and this article explains these issues, compares synchronous and asynchronous networks, and discusses strategies to balance latency and resource utilization.

Network Reliabilityasynchronous networkdistributed systems

0 likes · 8 min read

Why Unreliable Networks Threaten Distributed Systems and How to Mitigate Them

Cognitive Technology Team

Apr 8, 2025 · Backend Development

Design and Implementation of RocketMQ NameServer: Core Functions, Architecture, and Optimization Strategies

The article explains RocketMQ NameServer's lightweight, stateless design, its core routing and metadata management functions, AP‑oriented architecture, fault‑tolerant mechanisms, scalability features, and practical optimization techniques for high availability and low operational cost.

Distributed MessagingNameServerRocketMQ

0 likes · 6 min read

Design and Implementation of RocketMQ NameServer: Core Functions, Architecture, and Optimization Strategies

DataFunSummit

Mar 20, 2025 · Artificial Intelligence

Evolution of AI Training Stability and Baidu Baige’s Full-Stack Solutions for Large-Scale Model Training

The article traces the evolution of AI training stability from early manual operations on small GPU clusters to sophisticated, fault‑tolerant infrastructures for thousand‑card and ten‑thousand‑card models, detailing Baidu Baige’s metrics, monitoring, eBPF‑based diagnostics, and checkpoint strategies that reduce invalid training time and accelerate fault recovery.

Large‑Scale Trainingcheckpointingdistributed systems

0 likes · 22 min read

Evolution of AI Training Stability and Baidu Baige’s Full-Stack Solutions for Large-Scale Model Training

Baidu Geek Talk

Mar 17, 2025 · Industry Insights

From Manual Restarts to Automated Fault Tolerance: The Evolution of AI Training Stability

This article traces the decade‑long evolution of AI training stability—from early small‑model manual operations to large‑scale, multi‑thousand‑GPU clusters—detailing metrics like invalid training time, fault‑tolerance architectures, eBPF‑based hidden‑fault detection, BCCL enhancements, multi‑level restart strategies, and trigger‑based checkpointing that together shrink downtime from minutes to seconds.

AI trainingdistributed systemseBPF

0 likes · 22 min read

From Manual Restarts to Automated Fault Tolerance: The Evolution of AI Training Stability

Baidu Intelligent Cloud Tech Hub

Mar 10, 2025 · Artificial Intelligence

How Baidu Baige Achieves Near‑Zero Downtime in Massive AI Model Training

The article examines how Baidu Baige evolved AI training stability from manual operations to precise engineering, detailing metrics, fault‑perception techniques, eBPF‑based diagnostics, multi‑level restart strategies, and trigger‑based checkpointing that together achieve sub‑minute recovery and 99.5% effective training time on massive GPU clusters.

AI trainingLarge-Scale Clusterscheckpointing

0 likes · 25 min read

How Baidu Baige Achieves Near‑Zero Downtime in Massive AI Model Training

FunTester

Mar 2, 2025 · Operations

Common Fault Propagation Patterns and Prevention Strategies in Distributed Systems

The article examines typical fault propagation scenarios such as avalanche effects, cascading failures, resource exhaustion, data pollution, and dependency cycles in distributed systems, and outlines proactive measures like rate limiting, circuit breaking, isolation, monitoring, and chaos engineering to prevent small issues from escalating into large-scale outages.

Monitoringchaos engineeringcircuit breaker

0 likes · 11 min read

Common Fault Propagation Patterns and Prevention Strategies in Distributed Systems

IT Services Circle

Feb 9, 2025 · Big Data

Understanding HDFS: Architecture, Data Blocks, Fault Tolerance, and High Availability

This article explains how HDFS, the Hadoop Distributed File System, splits large files into blocks, replicates them for fault tolerance, organizes the cluster into NameNode and DataNode components, and provides high‑availability and scalability mechanisms such as standby NameNode and federation, enabling reliable big‑data storage and access.

Big DataDataNodeDistributed File System

0 likes · 11 min read

Architect

Jan 23, 2025 · Operations

Designing High‑Availability Systems: Architecture, Capacity Planning, and Fault‑Tolerance Guide

This article presents a comprehensive guide to building high‑availability systems, covering availability metrics, fault prevention, detection and recovery, capacity evaluation, layered architecture design, service tiering, resilience mechanisms, and operational best practices for reliable service delivery.

High AvailabilityOperationscapacity planning

0 likes · 34 min read

Designing High‑Availability Systems: Architecture, Capacity Planning, and Fault‑Tolerance Guide

MaGe Linux Operations

Jan 17, 2025 · Databases

Understanding Redis Cluster: Architecture, Data Distribution, and Fault Tolerance

Redis Cluster provides a scalable, fault‑tolerant distributed Redis solution, explaining why it’s needed, its architecture, virtual slot partitioning, data distribution methods, limitations, smart client optimization, and automatic failover mechanisms, while highlighting key operational considerations for high‑performance deployments.

RedisVirtual Slotscluster

0 likes · 11 min read

Understanding Redis Cluster: Architecture, Data Distribution, and Fault Tolerance

IT Architects Alliance

Jan 14, 2025 · Backend Development

Microservice Architecture: Common Problems and Solutions

Microservice architecture, once a buzzword, breaks monolithic applications into independent services, but introduces challenges such as service governance, communication, gateway management, fault tolerance, and tracing; the article outlines these issues and presents practical solutions like Consul/Eureka, REST/RPC, API gateways, Hystrix, and tracing tools.

API GatewayDistributed TracingService Governance

0 likes · 11 min read

Microservice Architecture: Common Problems and Solutions

High Availability Architecture

Jan 13, 2025 · Operations

Comprehensive Guide to High‑Availability System Architecture and Practices

This article provides a systematic overview of high‑availability system design, covering availability metrics, fault prevention, detection, recovery, capacity planning, service tiering, data layer resilience, monitoring, and the responsibilities of architects, SREs, and developers to ensure reliable, scalable services.

capacity planningfault tolerancesystem architecture

0 likes · 30 min read

Comprehensive Guide to High‑Availability System Architecture and Practices

Tencent Cloud Developer

Jan 7, 2025 · Operations

Designing High‑Availability Systems: Principles, Architecture, and Operations

This comprehensive guide explains how to design, build, and operate high‑availability systems by covering availability metrics, fault‑tolerance strategies, capacity planning, code and data layer architecture, automated testing, monitoring, and clear role responsibilities to ensure services stay reliable and resilient under load.

Cloud NativeHigh AvailabilitySRE

0 likes · 32 min read

Designing High‑Availability Systems: Principles, Architecture, and Operations

IT Architects Alliance

Jan 6, 2025 · Big Data

How Distributed Architecture Tames Massive Data: Strategies, Benefits, and Real‑World Cases

In an era of exploding data volumes, distributed architecture offers unparalleled scalability, fault tolerance, and parallel performance through sharding, replication, batch and stream processing, with real‑world examples from e‑commerce and social media giants illustrating its practical impact.

Big Datadata shardingdistributed architecture

0 likes · 12 min read

How Distributed Architecture Tames Massive Data: Strategies, Benefits, and Real‑World Cases

IT Architects Alliance

Jan 6, 2025 · Operations

Ensuring High Reliability in Distributed Systems: Redundancy, Fault Detection, Replication, and Resilience Strategies

The article explores how distributed systems achieve high reliability through redundant design, precise fault detection and recovery, data replication and synchronization, coordinated fault tolerance and load balancing, distributed transaction handling, comprehensive monitoring, elastic scaling, security safeguards, and robust disaster‑recovery planning.

MonitoringReliabilityfault tolerance

0 likes · 18 min read

Ensuring High Reliability in Distributed Systems: Redundancy, Fault Detection, Replication, and Resilience Strategies

BirdNest Tech Talk

Dec 29, 2024 · Fundamentals

Unlocking Distributed System Design: 20 Core Patterns Explained

This article distills the key design patterns behind distributed systems—covering replication, partitioning, consensus, and fault‑tolerance—by presenting each pattern’s problem statement, concrete solution, trade‑offs, and technical considerations, all illustrated with real‑world examples from projects like Kafka and Cassandra.

Consensusdesign patternsdistributed systems

0 likes · 18 min read

Unlocking Distributed System Design: 20 Core Patterns Explained

FunTester

Dec 9, 2024 · Operations

How to Prevent Fault Propagation in Microservices: Best Practices for Resilience

This article outlines practical strategies such as service isolation, circuit breaking, rate limiting, dependency governance, and chaos engineering to keep microservice systems highly available and resilient, reducing outage impact and operational costs.

Microserviceschaos engineeringcircuit breaker

0 likes · 12 min read

How to Prevent Fault Propagation in Microservices: Best Practices for Resilience

DevOps Cloud Academy

Dec 2, 2024 · Artificial Intelligence

Key Kubernetes Features that Benefit AI Inference Workloads

This article explains how Kubernetes’ native scalability, resource optimization, performance tuning, portability, and fault‑tolerance features align with the demands of AI inference, helping organizations run large ML models efficiently, cost‑effectively, and reliably across diverse environments.

AI inferencePortabilityfault tolerance

0 likes · 15 min read

Key Kubernetes Features that Benefit AI Inference Workloads

Sanyou's Java Diary

Nov 25, 2024 · Cloud Native

Designing Resilient Stateful Distributed Systems: From Theory to Microservice Architecture

This article explores the fundamentals of distributed systems, compares stateful and stateless services, examines monolithic, SOA, and microservice models, and provides practical guidance on access layers, fault tolerance, service discovery, scaling, and data storage for building robust cloud‑native architectures.

Cloud NativeMicroservicesfault tolerance

0 likes · 29 min read

Designing Resilient Stateful Distributed Systems: From Theory to Microservice Architecture

Zhuanzhuan Tech

Nov 20, 2024 · Backend Development

Design and Implementation of a High‑Performance Message Notification System

This article presents a comprehensive design of a high‑performance, fault‑tolerant message notification system, covering service partitioning, system architecture, idempotent processing, dynamic error detection, thread‑pool management, retry mechanisms, and stability measures such as traffic‑spike handling, resource isolation, third‑party protection, monitoring, and active‑active deployment.

JavaMessage Notificationbackend-architecture

0 likes · 16 min read

Design and Implementation of a High‑Performance Message Notification System

Tencent Cloud Developer

Oct 22, 2024 · Industry Insights

Designing Stateful Distributed Systems: Core Principles and Architecture Patterns

This article analyzes the motivations, benefits, and challenges of building stateful distributed systems, compares monolithic, SOA, and microservice models, and provides detailed guidance on access layers, service discovery, fault tolerance, scaling, and data storage for cloud‑native architectures.

Cloud NativeMicroservicesdistributed systems

0 likes · 29 min read

Designing Stateful Distributed Systems: Core Principles and Architecture Patterns

JavaEdge

Oct 21, 2024 · Operations

Why Move Beyond Microservices? Unlocking Resilience with Unitized Architecture

This article explores the advantages of unitized architecture over traditional microservices, detailing how its modular design, dedicated routing layer, and tailored observability practices enhance system resilience, fault‑tolerance, and operational insight for large‑scale distributed applications.

Resiliencedistributed systemsfault tolerance

0 likes · 17 min read

Why Move Beyond Microservices? Unlocking Resilience with Unitized Architecture

Baidu Geek Talk

Oct 9, 2024 · Artificial Intelligence

How Baidu’s Baige 4.0 Architecture Redefines AI Compute Efficiency

This article analyzes Baidu's Baige 4.0 AI infrastructure, detailing its four‑layer architecture, XMAN 5.0 hardware, HPN network, BCCL communication library, and AIAK inference upgrades, and explains how these innovations address large‑model training and inference challenges while boosting performance, utilization, and cost efficiency.

AI InfrastructureGPU AccelerationHigh-performance computing

0 likes · 16 min read

How Baidu’s Baige 4.0 Architecture Redefines AI Compute Efficiency

IT Services Circle

Oct 4, 2024 · Databases

Understanding Redis Split‑Brain: Causes, Data Loss, and Prevention Strategies

This article explains Redis split‑brain behavior, describing its definition, causes such as network failures and Sentinel elections, the resulting data loss during master‑slave switches, and practical prevention measures including quorum configuration, timeout tuning, network monitoring, proxy layers, and the min‑slaves‑to‑write and min‑slaves‑max‑lag settings.

High AvailabilityMaster‑SlaveSentinel

0 likes · 7 min read

Understanding Redis Split‑Brain: Causes, Data Loss, and Prevention Strategies

FunTester

Sep 19, 2024 · Fundamentals

Software Antifragility: Rethinking Error Handling and Reliability

This paper introduces the concept of software antifragility, drawing on Taleb’s theory to argue that embracing errors through fault tolerance, automatic runtime repair, and fault injection can transform software systems into self‑improving, more robust entities, and discusses implications for development processes and product reliability.

Antifragilitychaos engineeringfault tolerance

0 likes · 13 min read

Software Antifragility: Rethinking Error Handling and Reliability

Top Architect

Aug 15, 2024 · Backend Development

Handling Interface‑Level Failures: Degradation, Circuit Breaking, Rate Limiting, and Queuing

The article explains how interface‑level faults—where the system stays up but business performance degrades—can be mitigated through four core techniques (degradation, circuit breaking, rate limiting, and queuing), detailing their principles, implementation patterns, and practical trade‑offs for backend services.

backendcircuit breakerdegradation

0 likes · 20 min read

Handling Interface‑Level Failures: Degradation, Circuit Breaking, Rate Limiting, and Queuing

dbaplus Community

Aug 13, 2024 · Artificial Intelligence

Why Kubernetes Is the Ideal Platform for AI Inference: 5 Key Benefits

Kubernetes aligns perfectly with AI inference demands by offering built‑in scalability, resource and performance optimization, seamless portability across clouds, and robust fault‑tolerance, making it a cost‑effective, high‑availability foundation for deploying large‑scale machine‑learning models.

AI inferencefault tolerancekubernetes

0 likes · 10 min read

Why Kubernetes Is the Ideal Platform for AI Inference: 5 Key Benefits

MaGe Linux Operations

Aug 9, 2024 · Operations

Mastering Elasticsearch Data Sync and Cluster Architecture: Strategies & Best Practices

This article explains how to keep MySQL and Elasticsearch data in sync using synchronous calls, asynchronous notifications, or binlog listeners, and dives deep into Elasticsearch cluster design, node roles, distributed storage, query phases, split‑brain handling, and fault‑tolerance mechanisms.

Cluster ArchitectureData synchronizationDistributed Query

0 likes · 8 min read

Mastering Elasticsearch Data Sync and Cluster Architecture: Strategies & Best Practices

Architects' Tech Alliance

Jul 30, 2024 · Artificial Intelligence

Unlocking 10K‑GPU LLM Training: Inside MegaScale’s 55% MFU Breakthrough

This article translates and analyzes the MegaScale system—co‑developed by ByteDance and Peking University—that enables efficient, stable training of massive language models on clusters of more than 10,000 GPUs, achieving 55.2% MFU and a 1.34× speedup over Megatron‑LM.

GPU scalingLLM trainingMegaScale

0 likes · 15 min read

Unlocking 10K‑GPU LLM Training: Inside MegaScale’s 55% MFU Breakthrough

Top Architecture Tech Stack

Jul 16, 2024 · Cloud Native

Designing Fault‑Tolerant Microservices Architecture: Patterns and Practices

The article explains how to build reliable microservices by isolating failures, applying graceful degradation, change‑management, health checks, self‑healing, fallback caching, retry strategies, rate limiting, fast‑fail principles, circuit breakers, and failure‑testing to ensure high availability in distributed cloud‑native systems.

Cloud NativeMicroservicesOperations

0 likes · 14 min read

Designing Fault‑Tolerant Microservices Architecture: Patterns and Practices

Su San Talks Tech

Jul 6, 2024 · Backend Development

Mastering High Availability: 10 Essential Design Techniques for Scalable Systems

This article explains ten core techniques—system splitting, decoupling, asynchrony, retry, compensation, backup, multi‑active strategies, isolation, rate limiting, circuit breaking, and degradation—that together enable robust, high‑availability architectures for modern backend services.

High AvailabilitySystem Designdistributed systems

0 likes · 12 min read

Mastering High Availability: 10 Essential Design Techniques for Scalable Systems

Ctrip Technology

Jun 20, 2024 · Backend Development

Design and Architecture of Ctrip Service Registration Center

The article explains Ctrip's service registration center architecture, including its two‑layer Data and Session design, multi‑sharding, fault‑tolerance mechanisms, Redis‑based cluster discovery, design trade‑offs such as proxy versus Smart SDK, hashing strategy, and operational considerations for burst traffic and future scaling.

Redis discoveryService Registrydistributed systems

0 likes · 16 min read

Design and Architecture of Ctrip Service Registration Center

Ops Development & AI Practice

Jun 5, 2024 · Fundamentals

How Paxos Guarantees Consistency in Distributed Systems – A Deep Dive

This article explains the Paxos consensus algorithm, detailing its roles, three-phase execution process, key properties such as consistency, availability and fault tolerance, and showcases its practical applications in distributed databases, file systems, and coordination services.

Consensus AlgorithmPaxosconsistency

0 likes · 7 min read

How Paxos Guarantees Consistency in Distributed Systems – A Deep Dive

Mike Chen's Internet Architecture

May 31, 2024 · Backend Development

Mastering Microservice Splitting: 6 Essential Design Principles

This article outlines six fundamental microservice splitting principles—including single responsibility, appropriate granularity, interface segregation, product impact avoidance, scalability, and fault tolerance—to help architects design maintainable, decoupled, and resilient services.

Microservicesfault toleranceinterface segregation

0 likes · 5 min read

Mastering Microservice Splitting: 6 Essential Design Principles

Alibaba Cloud Big Data AI Platform

May 24, 2024 · Artificial Intelligence

How DeepRec Extension Boosts Distributed Sparse Model Training with Elasticity and Fault Tolerance

DeepRec Extension enhances large‑scale sparse model training by adding automatic elastic training, resource‑aware scheduling, real‑time monitoring, and efficient fault‑tolerance mechanisms, enabling lower cost, higher throughput, and more reliable distributed training for AI workloads.

AI InfrastructureDeepRecelastic training

0 likes · 13 min read

How DeepRec Extension Boosts Distributed Sparse Model Training with Elasticity and Fault Tolerance

Tongcheng Travel Technology Center

Apr 17, 2024 · Backend Development

In-Depth Analysis of Apache RocketMQ Architecture, Operation Principles, and High‑Throughput Mechanisms

This article provides a comprehensive overview of Apache RocketMQ, detailing its core components, producer and consumer workflows, storage strategies, master‑slave synchronization, Raft‑based half‑write and leader election mechanisms, and best‑practice recommendations for high‑throughput, fault‑tolerant messaging systems.

Backend DevelopmentMessage QueueRaft

0 likes · 22 min read

In-Depth Analysis of Apache RocketMQ Architecture, Operation Principles, and High‑Throughput Mechanisms

Huolala Tech

Apr 11, 2024 · Operations

How DataMesh Achieves 99.999% SLA with Architecture and High‑Availability Tactics

This article explains how DataMesh, a sidecar‑deployed Redis proxy, uses a layered architecture, risk analysis, sub‑second recovery mechanisms, large‑scale deployment strategies, and fault‑transfer capabilities to consistently meet a five‑nine service level agreement.

Cache MiddlewareDataMeshSLA

0 likes · 12 min read

How DataMesh Achieves 99.999% SLA with Architecture and High‑Availability Tactics

Java Architect Essentials

Apr 8, 2024 · Operations

Improving System Availability: Fault Prevention, Real‑time Detection, and Rapid Recovery

The article examines how a payment platform improves its 24/7 availability by preventing failures, detecting incidents in real time, and implementing rapid recovery measures such as dynamic routing, resource limits, monitoring, logging, and service degradation, while sharing practical Q&A insights.

High AvailabilityMonitoringOperations

0 likes · 25 min read

Improving System Availability: Fault Prevention, Real‑time Detection, and Rapid Recovery

Architects' Tech Alliance

Apr 6, 2024 · Artificial Intelligence

How ByteDance Scaled LLM Training to Over 10,000 GPUs: Inside the MegaScale System

The article analyzes ByteDance and Peking University's MegaScale system that enables efficient, stable training of large language models on clusters exceeding ten thousand GPUs, detailing algorithmic tweaks, 3D parallel communication overlap, operator optimizations, data‑pipeline improvements, network tuning, and fault‑tolerance mechanisms that together achieve a 55.2% MFU on a 175B model.

GPU clustersLLM trainingMegaScale

0 likes · 15 min read

How ByteDance Scaled LLM Training to Over 10,000 GPUs: Inside the MegaScale System

Architect

Apr 4, 2024 · Backend Development

Mastering High Availability: 9 Essential Design Techniques for Scalable Systems

The article walks through nine practical techniques—system splitting, decoupling, asynchronous processing, retry, compensation, backup, multi‑active deployment, rate limiting, circuit breaking, and degradation—explaining why each is needed, how they are implemented in real‑world microservice architectures, and what trade‑offs to consider.

High AvailabilityMicroservicesSystem Design

0 likes · 13 min read

Mastering High Availability: 9 Essential Design Techniques for Scalable Systems

Architecture & Thinking

Mar 5, 2024 · Databases

How Database Middleware Solves High‑Traffic Challenges: Connection Pools, Sharding, and More

This article examines how database middleware tackles the demanding needs of large‑scale internet services by providing centralized connection‑pool management, transparent read‑write splitting, diverse load‑balancing algorithms, sharding support, automatic failover, security controls, comprehensive monitoring, and flexible backup‑recovery mechanisms.

Connection PoolMonitoringSharding

0 likes · 9 min read

How Database Middleware Solves High‑Traffic Challenges: Connection Pools, Sharding, and More

Linux Cloud Computing Practice

Mar 4, 2024 · Operations

Building a High‑Performance, Highly Available Membership System with ES, Redis & MySQL

To ensure the massive, multi‑platform membership service remains fast and reliable, this article details a multi‑center architecture using Elasticsearch for unified member data, Redis caching, and MySQL partitioning, along with traffic isolation, fault‑tolerant syncing, and fine‑grained flow‑control and degradation strategies.

Redisfault tolerancemysql

0 likes · 23 min read

Building a High‑Performance, Highly Available Membership System with ES, Redis & MySQL

Architect's Guide

Mar 2, 2024 · Fundamentals

RabbitMQ vs Kafka: Core Differences and When to Use Each

This article compares RabbitMQ and Apache Kafka across architecture, message ordering, routing, timing, retention, fault handling, scalability, and consumer complexity, and provides guidance on which platform suits specific use‑cases such as flexible routing, strict ordering, long‑term retention, or high throughput.

Message OrderingMessage QueueRabbitMQ

0 likes · 19 min read

RabbitMQ vs Kafka: Core Differences and When to Use Each

Architecture & Thinking

Dec 25, 2023 · Databases

How to Detect, Analyze, and Prevent Redis Hot Keys to Avoid Outages

This article explains what Redis hot keys are, the scenarios that generate them, their risks, and provides practical monitoring methods and mitigation strategies—including cache pre‑warming, distributed caching, rate limiting, and secondary caches—to keep production systems stable.

Hot KeyMonitoringfault tolerance

0 likes · 11 min read

How to Detect, Analyze, and Prevent Redis Hot Keys to Avoid Outages

ITPUB

Dec 5, 2023 · Cloud Native

Prevent Massive K8s Outages: Scale, Redundancy, and Embrace Restarts

The article analyzes the November 27 Didi outage caused by an aggressive Kubernetes upgrade, then presents four engineering principles—controlling cluster size, eliminating single points of failure, treating restarts as normal, and decoupling data and control planes—to build more resilient cloud‑native systems.

Cloud Nativecluster upgradefault tolerance

0 likes · 13 min read

Prevent Massive K8s Outages: Scale, Redundancy, and Embrace Restarts

Spring Full-Stack Practical Cases

Dec 1, 2023 · Backend Development

Resilience4j Essentials: Circuit Breaker, TimeLimiter, Bulkhead & RateLimiter

This article introduces Resilience4j, a lightweight fault‑tolerance library for Spring Boot, explaining its core decorators—CircuitBreaker, TimeLimiter, Bulkhead, and RateLimiter—along with configuration examples, annotation usage, fallback handling, and practical test code to improve system stability and resilience.

JavaResilience4jSpring Boot

0 likes · 16 min read

Resilience4j Essentials: Circuit Breaker, TimeLimiter, Bulkhead & RateLimiter

Open Source Linux

Nov 23, 2023 · Operations

Mastering RAID Fault Tolerance: Consistency, Hot Spare, Rebuild & More

This article explains RAID fault tolerance mechanisms—including redundancy levels of RAID 1,5,6,10,50,60—covers consistency checks, hot‑spare and emergency backup, data reconstruction, read/write policies, power‑loss protection, striping, mirroring, foreign configurations, energy‑saving and JBOD, providing a comprehensive guide for storage administrators.

Data ProtectionRAIDStorage Management

0 likes · 15 min read

Mastering RAID Fault Tolerance: Consistency, Hot Spare, Rebuild & More

Open Source Linux

Nov 21, 2023 · Fundamentals

Understanding RAID Levels: Choose the Right Storage Solution for Performance and Reliability

RAID combines multiple physical disks into virtual drives, offering various levels—RAID 0, 1, 1ADM, 5, 6, 10, 10ADM, 1E, 50, and 60—each balancing performance, fault tolerance, and capacity, with detailed processing flows, storage calculations, and best‑practice recommendations for optimal deployment.

RAIDdata redundancyfault tolerance

0 likes · 20 min read

Understanding RAID Levels: Choose the Right Storage Solution for Performance and Reliability

Sanyou's Java Diary

Nov 20, 2023 · Operations

Mastering High Availability: 10 Essential Design Techniques for Scalable Systems

This article outlines ten practical techniques—including system splitting, decoupling, asynchronous processing, retry strategies, compensation, backup, multi‑active deployment, isolation, rate limiting, circuit breaking, and degradation—to help engineers design highly available, resilient architectures for large‑scale internet applications.

MicroservicesSystem Designfault tolerance

0 likes · 14 min read

Architects' Tech Alliance

Nov 6, 2023 · Fundamentals

Comprehensive Guide to RAID Levels: Architecture, Fault Tolerance, Performance, and Capacity

This article provides a comprehensive overview of RAID technology, explaining disk groups, virtual disks, detailed characteristics of RAID 0, 1, 1ADM, 5, 6, 10, 10ADM, 1E, 50, and 60, and compares their fault tolerance, I/O performance, and storage capacity considerations.

RAIDcapacitydata redundancy

0 likes · 26 min read

Comprehensive Guide to RAID Levels: Architecture, Fault Tolerance, Performance, and Capacity

Architects' Tech Alliance

Nov 5, 2023 · Fundamentals

Understanding RAID Fault Tolerance, Consistency Checks, Hot Spare, Rebuild, and Data Protection Features

This article explains RAID fault‑tolerance mechanisms, consistency verification, hot‑spare and emergency backup, rebuild processes, virtual‑disk read/write policies, power‑loss protection, disk striping, mirroring, foreign configurations, power‑saving and pass‑through features, providing a comprehensive overview of modern storage system capabilities.

RAIDdisk stripingfault tolerance

0 likes · 16 min read

Understanding RAID Fault Tolerance, Consistency Checks, Hot Spare, Rebuild, and Data Protection Features

Alibaba Cloud Native

Oct 13, 2023 · Cloud Native

Why Microservice Governance Matters and How OpenSergo Tackles Its Challenges

The article explains the stability challenges of modern microservice architectures, outlines the three governance domains (development/testing, change, runtime), and introduces OpenSergo’s open, cloud‑native specifications, control‑plane, and data‑plane solutions for traffic routing, gray‑release, and fault‑tolerance.

OpenSergofault tolerancegray-release

0 likes · 18 min read

Why Microservice Governance Matters and How OpenSergo Tackles Its Challenges

dbaplus Community

Oct 7, 2023 · Operations

How to Build a Truly High‑Availability System: 6 Essential Design Layers

This article breaks down high‑availability system design into six critical layers—architecture, development standards, application services, storage, product safeguards, and operations—offering concrete practices such as capacity planning, fault‑tolerant patterns, monitoring, and incident‑response strategies to achieve four‑nine (99.99%) uptime.

OperationsSystem Designcapacity planning

0 likes · 26 min read

How to Build a Truly High‑Availability System: 6 Essential Design Layers

MaGe Linux Operations

Aug 29, 2023 · Operations

How to Effectively Monitor and Recover a Kafka Cluster

This guide explains essential Kafka monitoring techniques, third‑party tools, custom scripts, key metrics, and practical strategies for high availability, fault detection, rapid recovery, and ongoing testing to keep Kafka clusters stable and performant.

Operationsdistributed-systemsfault tolerance

0 likes · 7 min read

How to Effectively Monitor and Recover a Kafka Cluster

JD Retail Technology

Aug 14, 2023 · Backend Development

Implementing a Lightweight Distributed Scheduling Solution to Replace TBSchedule

To improve stability and reduce costs during high‑traffic events, we replaced the Zookeeper‑dependent TBSchedule framework with a lightweight, Redis‑based distributed scheduler that decentralizes task execution, uses thread pools instead of timers, and supports dynamic scaling and seamless degradation for reliable order processing.

Distributed SchedulingMicroservicesRedis

0 likes · 4 min read

Implementing a Lightweight Distributed Scheduling Solution to Replace TBSchedule

JD Cloud Developers

Aug 9, 2023 · Backend Development

Mastering Hystrix: Implementing Circuit Breakers in Spring Cloud Microservices

This article explains why circuit breakers are essential in microservice architectures, introduces Netflix's Hystrix library, details its design principles, shows step‑by‑step demos for Ribbon and Feign integration, and covers dashboards, Turbine, isolation strategies, request merging, caching, and related Spring Boot SPI mechanisms.

HystrixJavaMicroservices

0 likes · 29 min read

Mastering Hystrix: Implementing Circuit Breakers in Spring Cloud Microservices

Architect

Aug 4, 2023 · Fundamentals

What Exactly Is Software Architecture? A Deep Dive into Systems, Modules, and Design Principles

The article systematically defines software architecture, distinguishes systems, subsystems, modules, and components, compares frameworks with architectures, explores TOGAF and RUP classifications, traces the evolution from monoliths to micro‑services, and presents concrete design principles and common pitfalls for building scalable, maintainable systems.

MicroservicesSystem DesignTOGAF

0 likes · 25 min read

What Exactly Is Software Architecture? A Deep Dive into Systems, Modules, and Design Principles

Architects Research Society

Jul 13, 2023 · Operations

Five Patterns to Make Your Microservice Fault‑Tolerant

This article explains essential fault‑tolerance patterns for microservices—including timeouts, retries, circuit breakers, distributed deadlines, and rate limiting—detailing their basic forms, drawbacks, and practical implementation strategies to improve reliability and prevent cascading failures.

Microservicescircuit breakerfault tolerance

0 likes · 12 min read

Five Patterns to Make Your Microservice Fault‑Tolerant

Ops Development Stories

Jun 6, 2023 · Operations

When Fancy PPTs Meet Real Outages: Lessons from a Major E‑commerce Crash

The article examines Vipshop's massive March 2023 outage caused by an IDC cooling failure, critiques superficial PPT‑driven reliability claims, and offers practical SRE insights on fault drills, true multi‑active architectures, and how ops teams can gain influence despite budget constraints.

OperationsSREfault tolerance

0 likes · 7 min read

When Fancy PPTs Meet Real Outages: Lessons from a Major E‑commerce Crash