Tagged articles

fault-recovery

20 articles · Page 1 of 1

Dec 20, 2025 · Operations

How RocketMQ Achieves High‑Availability Storage and Fast Fault Recovery

RocketMQ ensures durable, consistent, and highly available message storage through fixed‑length append‑only files, efficient index rebuilding, checkpoint tracking, and configurable master‑slave replication, offering both synchronous and asynchronous HA modes, detailed recovery steps, performance trade‑offs, and practical operational guidelines for robust fault tolerance.

OperationsRocketMQfault-recovery

0 likes · 10 min read

How RocketMQ Achieves High‑Availability Storage and Fast Fault Recovery

Aikesheng Open Source Community

Nov 25, 2025 · Databases

How to Diagnose and Recover MySQL InnoDB Cluster Failures: Real‑World Scenarios

This article walks MySQL DBAs through common MySQL InnoDB Cluster fault scenarios—node restarts, crashes, network partitions, and full‑cluster reboots—providing step‑by‑step commands, status outputs, recovery actions, and impact analysis to ensure high availability and data safety.

Database operationsHigh AvailabilityInnoDB Cluster

0 likes · 26 min read

How to Diagnose and Recover MySQL InnoDB Cluster Failures: Real‑World Scenarios

AI Large Model Application Practice

Jun 3, 2025 · Backend Development

Scaling Human‑in‑the‑Loop Agents to Distributed Environments with Robust Fault Recovery

This article explains how to extend a single‑process Human‑in‑the‑Loop (HITL) agent to a distributed, multi‑user API service using FastAPI, detailing session management, interrupt handling, client and server fault‑recovery strategies, and providing concrete code snippets and architectural diagrams.

LangGraphdistributed systemsfault-recovery

0 likes · 16 min read

Scaling Human‑in‑the‑Loop Agents to Distributed Environments with Robust Fault Recovery

360 Zhihui Cloud Developer

Sep 19, 2024 · Operations

How TAI Platform Optimizes Large‑Model Scheduling and Fault Recovery on Kubernetes

This article explains how the TAI platform leverages Kubernetes and Volcano to tackle fault, efficiency, and usability challenges in large‑model training and inference, detailing custom resources, automated fault detection, and advanced scheduling strategies that boost resource utilization and performance.

SchedulingVolcanofault-recovery

0 likes · 9 min read

How TAI Platform Optimizes Large‑Model Scheduling and Fault Recovery on Kubernetes

Sanyou's Java Diary

Sep 7, 2023 · Operations

How to Keep Kafka Stable: Proven Practices for Prevention, Monitoring, and Recovery

This comprehensive guide explains how to ensure Kafka stability by applying proactive prevention, continuous runtime monitoring, and effective fault‑resolution strategies, covering producer and consumer tuning, cluster configuration, performance optimization, alerting, and idempotent consumption to prevent message loss and service disruption.

Performance TuningStabilityfault-recovery

0 likes · 30 min read

How to Keep Kafka Stable: Proven Practices for Prevention, Monitoring, and Recovery

Architect's Guide

Mar 14, 2023 · Operations

Incident Handling and Fault Recovery Practices for Call Center Systems

This article outlines a call‑center outage scenario, explains how operators diagnose and resolve the issue, and presents a comprehensive set of fault‑handling methods, monitoring enhancements, and emergency‑plan recommendations aimed at faster recovery and eventual self‑healing of services.

Incident Managementcall centerfault-recovery

0 likes · 12 min read

Incident Handling and Fault Recovery Practices for Call Center Systems

Ops Development Stories

Jun 16, 2022 · Operations

How to Streamline Call Center Incident Management: From Rapid Diagnosis to Automated Recovery

This article outlines a comprehensive approach to handling call‑center incidents, covering fault boundary definition, emergency recovery actions, rapid root‑cause localization, enhanced monitoring strategies, clear alerting, proactive automation, and the creation of concise, regularly exercised emergency response plans.

Incident ManagementMonitoringOperations

0 likes · 14 min read

How to Streamline Call Center Incident Management: From Rapid Diagnosis to Automated Recovery

Top Architect

Jun 11, 2022 · Operations

Comprehensive Fault Handling and Emergency Response Guide for Call Center Systems

This guide details a call‑center system fault scenario and provides a step‑by‑step approach for operations teams to identify symptoms, assess impact, implement rapid recovery actions, improve monitoring, and maintain an effective emergency response plan, ensuring faster resolution and long‑term fault self‑healing.

Incident ManagementMonitoringOperations

0 likes · 12 min read

Comprehensive Fault Handling and Emergency Response Guide for Call Center Systems

Architecture Digest

Jun 2, 2022 · Operations

Incident Handling and Fault Recovery Practices for Call Center Systems

The article outlines a comprehensive approach to diagnosing, responding to, and preventing call‑center system failures by describing typical fault scenarios, step‑by‑step recovery actions, monitoring enhancements, emergency plan components, and continuous improvement strategies for operations teams.

Incident ManagementMonitoringOperations

0 likes · 13 min read

Open Source Linux

Apr 2, 2022 · Operations

How to Speed Up Call Center Incident Recovery with Proven Ops Strategies

This article walks through a real call‑center outage scenario, outlines systematic fault‑identification steps, practical emergency recovery actions, monitoring enhancements, concise emergency‑plan design, and introduces intelligent event‑handling to help operations teams resolve incidents faster and more reliably.

AutomationIncident ManagementMonitoring

0 likes · 13 min read

How to Speed Up Call Center Incident Recovery with Proven Ops Strategies

Java Interview Crash Guide

Mar 1, 2022 · Operations

How to Accelerate Call Center Incident Recovery with Proactive Monitoring

This article outlines a comprehensive approach to handling call‑center system failures, covering rapid fault identification, emergency recovery steps, enhanced monitoring visualisation, and the creation of sustainable, automated incident‑response plans to improve overall operational resilience.

AutomationIncident Managementcall center

0 likes · 13 min read

How to Accelerate Call Center Incident Recovery with Proactive Monitoring

dbaplus Community

Jan 29, 2022 · Operations

Accelerating Call Center Incident Recovery: Practical Fault Handling and Monitoring Strategies

This article walks through a real call‑center outage scenario, outlines step‑by‑step fault identification, emergency recovery actions, monitoring enhancements, concise emergency‑plan design, and introduces intelligent, automated event handling to help operations teams resolve incidents faster and more reliably.

Operationscall centeremergency plan

0 likes · 14 min read

Accelerating Call Center Incident Recovery: Practical Fault Handling and Monitoring Strategies

Big Data Technology Architecture

Aug 12, 2021 · Databases

Understanding HBase HLog and Fault Recovery Mechanisms

This article explains HBase's write path using Memstore and HLog, details the lifecycle of HLog including construction, rolling, expiration, and deletion, and thoroughly analyzes the three fault‑recovery models—Log Splitting, Distributed Log Splitting, and Distributed Log Replay—highlighting their processes, advantages, and configuration nuances.

HBaseHLogLog Splitting

0 likes · 14 min read

Understanding HBase HLog and Fault Recovery Mechanisms

Efficient Ops

Jul 19, 2021 · Operations

Mastering Call Center Incident Management: Fast Fault Recovery and Proactive Monitoring

Learn practical strategies to accelerate call‑center fault recovery, from rapid root‑cause identification and emergency actions to enhanced monitoring, self‑healing goals, and comprehensive emergency plans that empower ops teams to resolve incidents efficiently and prevent future outages.

call centeremergency planfault-recovery

0 likes · 13 min read

Mastering Call Center Incident Management: Fast Fault Recovery and Proactive Monitoring

Alibaba Cloud Developer

May 18, 2021 · Operations

Mastering Incident Response: Structured Problem Solving and Key Roles

This guide outlines a structured approach to incident response, detailing problem definition, temporary fixes, root‑cause analysis, solution design, implementation, and standardization, while highlighting four critical roles—commander, communicator, rapid‑recovery lead, and diagnosis lead—to ensure swift, coordinated recovery of production services.

OperationsSREfault-recovery

0 likes · 10 min read

Mastering Incident Response: Structured Problem Solving and Key Roles

ITPUB

Mar 25, 2021 · Backend Development

Why a TCC Framework Must Own the Spring TransactionManager

This article examines the challenges of building a generic TCC distributed‑transaction framework on Spring, explaining why every TCC service must participate in RM‑local transactions, why the framework should intercept the Spring TransactionManager, how to handle fault recovery, idempotency of Confirm/Cancel, and the pitfalls of relying on Cancel for rollback, concluding with practical recommendations.

SpringTCCTransactionManager

0 likes · 18 min read

Why a TCC Framework Must Own the Spring TransactionManager

Top Architect

Sep 12, 2020 · Backend Development

Key Considerations for Building a Generic TCC Distributed Transaction Framework

This article explains the essential design principles of a TCC (Try‑Confirm‑Cancel) distributed transaction framework, covering the necessity of RM local transactions, integration with Spring's TransactionManager, fault‑recovery mechanisms, idempotency guarantees, and handling of parallel Try/Confirm/Cancel operations.

SpringTCCTransactionManager

0 likes · 22 min read

Key Considerations for Building a Generic TCC Distributed Transaction Framework

Java Backend Technology

Sep 9, 2019 · Backend Development

Mastering TCC Distributed Transactions: Key Design Principles and Pitfalls

This article explores the complexities of building a generic TCC distributed transaction framework, emphasizing the need for RM-local transaction integration, Spring TransactionManager takeover, fault‑recovery mechanisms, idempotency guarantees, and proper handling of Try/Confirm/Cancel phases to ensure global consistency.

SpringTCCTransaction Management

0 likes · 18 min read

Mastering TCC Distributed Transactions: Key Design Principles and Pitfalls

Alibaba Cloud Infrastructure

Dec 15, 2017 · Operations

Automated Fault Recovery Architecture for Alibaba's Network during Double Eleven

The article describes Alibaba's end‑to‑end automated fault recovery system for its massive network, covering extensive data collection, Spark‑based event processing, flexible alerting with Siddhi, alert convergence using PageRank, and scripted recovery actions to achieve high availability during the Double Eleven traffic surge.

AutomationBig DataOperations

0 likes · 9 min read

Automated Fault Recovery Architecture for Alibaba's Network during Double Eleven

Efficient Ops

Jul 28, 2015 · Operations

How Tencent’s BlueKing Automates Fault Recovery and Zero‑Touch Game Server Launch

This article explains how Tencent Game's BlueKing platform redesigns operations by building open‑source PaaS capabilities, automating fault self‑healing, enabling fully automated game server region launches, supporting self‑service change releases, leveraging big‑data for real‑time decisions, and moving toward open‑source and hybrid‑cloud solutions.

AutomationBig Datafault-recovery

0 likes · 19 min read

How Tencent’s BlueKing Automates Fault Recovery and Zero‑Touch Game Server Launch