Tag

fault recovery

0 views collected around this technical thread.

360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
Sep 19, 2024 · Operations

How TAI Platform Optimizes Large‑Model Scheduling and Fault Recovery on Kubernetes

This article explains how the TAI platform leverages Kubernetes and Volcano to tackle fault, efficiency, and usability challenges in large‑model training and inference, detailing custom resources, automated fault detection, and advanced scheduling strategies that boost resource utilization and performance.

AI infrastructureKubernetesLarge Models
0 likes · 9 min read
How TAI Platform Optimizes Large‑Model Scheduling and Fault Recovery on Kubernetes
Sanyou's Java Diary
Sanyou's Java Diary
Sep 7, 2023 · Operations

How to Keep Kafka Stable: Proven Practices for Prevention, Monitoring, and Recovery

This comprehensive guide explains how to ensure Kafka stability by applying proactive prevention, continuous runtime monitoring, and effective fault‑resolution strategies, covering producer and consumer tuning, cluster configuration, performance optimization, alerting, and idempotent consumption to prevent message loss and service disruption.

KafkaPerformance Tuningbest practices
0 likes · 30 min read
How to Keep Kafka Stable: Proven Practices for Prevention, Monitoring, and Recovery
Architect's Guide
Architect's Guide
Mar 14, 2023 · Operations

Incident Handling and Fault Recovery Practices for Call Center Systems

This article outlines a call‑center outage scenario, explains how operators diagnose and resolve the issue, and presents a comprehensive set of fault‑handling methods, monitoring enhancements, and emergency‑plan recommendations aimed at faster recovery and eventual self‑healing of services.

call centerfault recoveryincident management
0 likes · 12 min read
Incident Handling and Fault Recovery Practices for Call Center Systems
Top Architect
Top Architect
Jun 11, 2022 · Operations

Comprehensive Fault Handling and Emergency Response Guide for Call Center Systems

This guide details a call‑center system fault scenario and provides a step‑by‑step approach for operations teams to identify symptoms, assess impact, implement rapid recovery actions, improve monitoring, and maintain an effective emergency response plan, ensuring faster resolution and long‑term fault self‑healing.

call centeremergency planfault recovery
0 likes · 12 min read
Comprehensive Fault Handling and Emergency Response Guide for Call Center Systems
Architecture Digest
Architecture Digest
Jun 2, 2022 · Operations

Incident Handling and Fault Recovery Practices for Call Center Systems

The article outlines a comprehensive approach to diagnosing, responding to, and preventing call‑center system failures by describing typical fault scenarios, step‑by‑step recovery actions, monitoring enhancements, emergency plan components, and continuous improvement strategies for operations teams.

call centeremergency proceduresfault recovery
0 likes · 13 min read
Incident Handling and Fault Recovery Practices for Call Center Systems
Big Data Technology Architecture
Big Data Technology Architecture
Aug 12, 2021 · Databases

Understanding HBase HLog and Fault Recovery Mechanisms

This article explains HBase's write path using Memstore and HLog, details the lifecycle of HLog including construction, rolling, expiration, and deletion, and thoroughly analyzes the three fault‑recovery models—Log Splitting, Distributed Log Splitting, and Distributed Log Replay—highlighting their processes, advantages, and configuration nuances.

Big DataDistributed SystemsHBase
0 likes · 14 min read
Understanding HBase HLog and Fault Recovery Mechanisms
Efficient Ops
Efficient Ops
Jul 19, 2021 · Operations

Mastering Call Center Incident Management: Fast Fault Recovery and Proactive Monitoring

Learn practical strategies to accelerate call‑center fault recovery, from rapid root‑cause identification and emergency actions to enhanced monitoring, self‑healing goals, and comprehensive emergency plans that empower ops teams to resolve incidents efficiently and prevent future outages.

call centeremergency planfault recovery
0 likes · 13 min read
Mastering Call Center Incident Management: Fast Fault Recovery and Proactive Monitoring
Top Architect
Top Architect
Sep 12, 2020 · Backend Development

Key Considerations for Building a Generic TCC Distributed Transaction Framework

This article explains the essential design principles of a TCC (Try‑Confirm‑Cancel) distributed transaction framework, covering the necessity of RM local transactions, integration with Spring's TransactionManager, fault‑recovery mechanisms, idempotency guarantees, and handling of parallel Try/Confirm/Cancel operations.

IdempotencySpringTransactionManager
0 likes · 22 min read
Key Considerations for Building a Generic TCC Distributed Transaction Framework
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Dec 15, 2017 · Operations

Automated Fault Recovery Architecture for Alibaba's Network during Double Eleven

The article describes Alibaba's end‑to‑end automated fault recovery system for its massive network, covering extensive data collection, Spark‑based event processing, flexible alerting with Siddhi, alert convergence using PageRank, and scripted recovery actions to achieve high availability during the Double Eleven traffic surge.

Big Dataautomationfault recovery
0 likes · 9 min read
Automated Fault Recovery Architecture for Alibaba's Network during Double Eleven
Efficient Ops
Efficient Ops
Jul 28, 2015 · Operations

How Tencent’s BlueKing Automates Fault Recovery and Zero‑Touch Game Server Launch

This article explains how Tencent Game's BlueKing platform redesigns operations by building open‑source PaaS capabilities, automating fault self‑healing, enabling fully automated game server region launches, supporting self‑service change releases, leveraging big‑data for real‑time decisions, and moving toward open‑source and hybrid‑cloud solutions.

Big DataDevOpsautomation
0 likes · 19 min read
How Tencent’s BlueKing Automates Fault Recovery and Zero‑Touch Game Server Launch