Tagged articles
20 articles
Page 1 of 1
Ray's Galactic Tech
Ray's Galactic Tech
Dec 20, 2025 · Operations

How RocketMQ Achieves High‑Availability Storage and Fast Fault Recovery

RocketMQ ensures durable, consistent, and highly available message storage through fixed‑length append‑only files, efficient index rebuilding, checkpoint tracking, and configurable master‑slave replication, offering both synchronous and asynchronous HA modes, detailed recovery steps, performance trade‑offs, and practical operational guidelines for robust fault tolerance.

OperationsRocketMQfault-recovery
0 likes · 10 min read
How RocketMQ Achieves High‑Availability Storage and Fast Fault Recovery
AI Large Model Application Practice
AI Large Model Application Practice
Jun 3, 2025 · Backend Development

Scaling Human‑in‑the‑Loop Agents to Distributed Environments with Robust Fault Recovery

This article explains how to extend a single‑process Human‑in‑the‑Loop (HITL) agent to a distributed, multi‑user API service using FastAPI, detailing session management, interrupt handling, client and server fault‑recovery strategies, and providing concrete code snippets and architectural diagrams.

Distributed SystemsHuman-in-the-LoopLangGraph
0 likes · 16 min read
Scaling Human‑in‑the‑Loop Agents to Distributed Environments with Robust Fault Recovery
Sanyou's Java Diary
Sanyou's Java Diary
Sep 7, 2023 · Operations

How to Keep Kafka Stable: Proven Practices for Prevention, Monitoring, and Recovery

This comprehensive guide explains how to ensure Kafka stability by applying proactive prevention, continuous runtime monitoring, and effective fault‑resolution strategies, covering producer and consumer tuning, cluster configuration, performance optimization, alerting, and idempotent consumption to prevent message loss and service disruption.

Kafkafault-recoveryperformance tuning
0 likes · 30 min read
How to Keep Kafka Stable: Proven Practices for Prevention, Monitoring, and Recovery
Architect's Guide
Architect's Guide
Mar 14, 2023 · Operations

Incident Handling and Fault Recovery Practices for Call Center Systems

This article outlines a call‑center outage scenario, explains how operators diagnose and resolve the issue, and presents a comprehensive set of fault‑handling methods, monitoring enhancements, and emergency‑plan recommendations aimed at faster recovery and eventual self‑healing of services.

call centerfault-recoveryincident management
0 likes · 12 min read
Incident Handling and Fault Recovery Practices for Call Center Systems
Ops Development Stories
Ops Development Stories
Jun 16, 2022 · Operations

How to Streamline Call Center Incident Management: From Rapid Diagnosis to Automated Recovery

This article outlines a comprehensive approach to handling call‑center incidents, covering fault boundary definition, emergency recovery actions, rapid root‑cause localization, enhanced monitoring strategies, clear alerting, proactive automation, and the creation of concise, regularly exercised emergency response plans.

Operationscall centerfault-recovery
0 likes · 14 min read
How to Streamline Call Center Incident Management: From Rapid Diagnosis to Automated Recovery
Top Architect
Top Architect
Jun 11, 2022 · Operations

Comprehensive Fault Handling and Emergency Response Guide for Call Center Systems

This guide details a call‑center system fault scenario and provides a step‑by‑step approach for operations teams to identify symptoms, assess impact, implement rapid recovery actions, improve monitoring, and maintain an effective emergency response plan, ensuring faster resolution and long‑term fault self‑healing.

Operationscall centeremergency plan
0 likes · 12 min read
Comprehensive Fault Handling and Emergency Response Guide for Call Center Systems
Architecture Digest
Architecture Digest
Jun 2, 2022 · Operations

Incident Handling and Fault Recovery Practices for Call Center Systems

The article outlines a comprehensive approach to diagnosing, responding to, and preventing call‑center system failures by describing typical fault scenarios, step‑by‑step recovery actions, monitoring enhancements, emergency plan components, and continuous improvement strategies for operations teams.

Operationscall centeremergency procedures
0 likes · 13 min read
Incident Handling and Fault Recovery Practices for Call Center Systems
Open Source Linux
Open Source Linux
Apr 2, 2022 · Operations

How to Speed Up Call Center Incident Recovery with Proven Ops Strategies

This article walks through a real call‑center outage scenario, outlines systematic fault‑identification steps, practical emergency recovery actions, monitoring enhancements, concise emergency‑plan design, and introduces intelligent event‑handling to help operations teams resolve incidents faster and more reliably.

Operationsautomationcall center
0 likes · 13 min read
How to Speed Up Call Center Incident Recovery with Proven Ops Strategies
dbaplus Community
dbaplus Community
Jan 29, 2022 · Operations

Accelerating Call Center Incident Recovery: Practical Fault Handling and Monitoring Strategies

This article walks through a real call‑center outage scenario, outlines step‑by‑step fault identification, emergency recovery actions, monitoring enhancements, concise emergency‑plan design, and introduces intelligent, automated event handling to help operations teams resolve incidents faster and more reliably.

Operationscall centeremergency plan
0 likes · 14 min read
Accelerating Call Center Incident Recovery: Practical Fault Handling and Monitoring Strategies
Big Data Technology Architecture
Big Data Technology Architecture
Aug 12, 2021 · Databases

Understanding HBase HLog and Fault Recovery Mechanisms

This article explains HBase's write path using Memstore and HLog, details the lifecycle of HLog including construction, rolling, expiration, and deletion, and thoroughly analyzes the three fault‑recovery models—Log Splitting, Distributed Log Splitting, and Distributed Log Replay—highlighting their processes, advantages, and configuration nuances.

Distributed SystemsHBaseHLog
0 likes · 14 min read
Understanding HBase HLog and Fault Recovery Mechanisms
Alibaba Cloud Developer
Alibaba Cloud Developer
May 18, 2021 · Operations

Mastering Incident Response: Structured Problem Solving and Key Roles

This guide outlines a structured approach to incident response, detailing problem definition, temporary fixes, root‑cause analysis, solution design, implementation, and standardization, while highlighting four critical roles—commander, communicator, rapid‑recovery lead, and diagnosis lead—to ensure swift, coordinated recovery of production services.

OperationsSRETeam Roles
0 likes · 10 min read
Mastering Incident Response: Structured Problem Solving and Key Roles
ITPUB
ITPUB
Mar 25, 2021 · Backend Development

Why a TCC Framework Must Own the Spring TransactionManager

This article examines the challenges of building a generic TCC distributed‑transaction framework on Spring, explaining why every TCC service must participate in RM‑local transactions, why the framework should intercept the Spring TransactionManager, how to handle fault recovery, idempotency of Confirm/Cancel, and the pitfalls of relying on Cancel for rollback, concluding with practical recommendations.

Distributed TransactionsIdempotencyTransactionManager
0 likes · 18 min read
Why a TCC Framework Must Own the Spring TransactionManager
Top Architect
Top Architect
Sep 12, 2020 · Backend Development

Key Considerations for Building a Generic TCC Distributed Transaction Framework

This article explains the essential design principles of a TCC (Try‑Confirm‑Cancel) distributed transaction framework, covering the necessity of RM local transactions, integration with Spring's TransactionManager, fault‑recovery mechanisms, idempotency guarantees, and handling of parallel Try/Confirm/Cancel operations.

Distributed TransactionsTransactionManagerfault-recovery
0 likes · 22 min read
Key Considerations for Building a Generic TCC Distributed Transaction Framework
Java Backend Technology
Java Backend Technology
Sep 9, 2019 · Backend Development

Mastering TCC Distributed Transactions: Key Design Principles and Pitfalls

This article explores the complexities of building a generic TCC distributed transaction framework, emphasizing the need for RM-local transaction integration, Spring TransactionManager takeover, fault‑recovery mechanisms, idempotency guarantees, and proper handling of Try/Confirm/Cancel phases to ensure global consistency.

fault-recoveryspringtcc
0 likes · 18 min read
Mastering TCC Distributed Transactions: Key Design Principles and Pitfalls
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Dec 15, 2017 · Operations

Automated Fault Recovery Architecture for Alibaba's Network during Double Eleven

The article describes Alibaba's end‑to‑end automated fault recovery system for its massive network, covering extensive data collection, Spark‑based event processing, flexible alerting with Siddhi, alert convergence using PageRank, and scripted recovery actions to achieve high availability during the Double Eleven traffic surge.

Big DataNetwork MonitoringOperations
0 likes · 9 min read
Automated Fault Recovery Architecture for Alibaba's Network during Double Eleven
Efficient Ops
Efficient Ops
Jul 28, 2015 · Operations

How Tencent’s BlueKing Automates Fault Recovery and Zero‑Touch Game Server Launch

This article explains how Tencent Game's BlueKing platform redesigns operations by building open‑source PaaS capabilities, automating fault self‑healing, enabling fully automated game server region launches, supporting self‑service change releases, leveraging big‑data for real‑time decisions, and moving toward open‑source and hybrid‑cloud solutions.

Big Dataautomationfault-recovery
0 likes · 19 min read
How Tencent’s BlueKing Automates Fault Recovery and Zero‑Touch Game Server Launch