Tagged articles

fault-analysis

12 articles · Page 1 of 1

Dec 6, 2025 · Cloud Native

How Graph Queries Transform Cloud‑Native Observability and Fault Diagnosis

In modern cloud‑native systems, treating each service, container, or middleware as an isolated entity hides the essential connections between components, so this article explains how integrating graph‑based data models and query languages like graph‑match and Cypher unlocks powerful fault‑impact analysis, topology insights, and performance‑optimized troubleshooting.

CypherObservabilityfault-analysis

0 likes · 28 min read

How Graph Queries Transform Cloud‑Native Observability and Fault Diagnosis

Tech Architecture Stories

Aug 8, 2023 · Operations

Mastering Fault Postmortems: Proven Methods to Boost System Reliability

This comprehensive guide explains the origins, methodologies, and practical steps of fault postmortems—including PDCA, GRIA, aviation safety lessons, industrial accident theory, and software reliability metrics—to help teams systematically investigate incidents, derive actionable improvements, and continuously enhance system availability.

GRIAPDCAReliability

0 likes · 22 min read

Mastering Fault Postmortems: Proven Methods to Boost System Reliability

vivo Internet Technology

Jul 19, 2023 · Databases

Analysis of Service Avalanche Caused by Misconfigured Jedis Parameters During Redis Cluster Master‑Slave Switch

A service‑wide avalanche occurred when a Redis 3.x master‑slave failover coincided with Jedis’ default 2‑second connection timeout and six retry attempts, causing up to 60‑second latencies; adjusting connectionTimeout, soTimeout to 100 ms and reducing maxAttempts to two limited latency to about one second and prevented cascade failures.

Connection RetryJedisPerformance

0 likes · 13 min read

Analysis of Service Avalanche Caused by Misconfigured Jedis Parameters During Redis Cluster Master‑Slave Switch

Architecture Digest

May 25, 2022 · Big Data

Kafka Cluster Deployment Architecture, Fault Analysis, and Default Partitioner Behavior

This article explains the design of a multi‑tenant Kafka cluster, the business onboarding process, detailed fault symptoms and monitoring metrics, analyzes the root cause of a topic‑wide traffic drop, and examines the default partitioner’s rules to propose mitigation recommendations.

Partitionerbig-datafault-analysis

0 likes · 11 min read

Kafka Cluster Deployment Architecture, Fault Analysis, and Default Partitioner Behavior

vivo Internet Technology

May 18, 2022 · Backend Development

Kafka Cluster Fault Analysis: Root Cause and Cascading Failure Mechanism

A Kafka cluster at vivo suffered a total traffic drop across a resource group when a broker’s disk failed, because the default producer partitioner still hashed keys to the failed partition, exhausting client buffers and blocking all healthy partitions, prompting recommendations to avoid keys or use custom partitioners.

KafkaPerformance OptimizationTroubleshooting

0 likes · 9 min read

Kafka Cluster Fault Analysis: Root Cause and Cascading Failure Mechanism

Open Source Linux

Dec 2, 2021 · Operations

Master PC Power-On and Fault Diagnosis: A Complete Technician's Guide

This comprehensive guide details the definitions, common symptoms, likely hardware components, and step‑by‑step diagnostic procedures for power‑on, startup, shutdown, disk, display, installation, application, network, internet, peripheral, audio‑video, and compatibility faults in personal computers, illustrated with real‑world case studies and practical troubleshooting tips.

Hardware TroubleshootingIT OperationsPC repair

0 likes · 71 min read

Master PC Power-On and Fault Diagnosis: A Complete Technician's Guide

Baidu Intelligent Testing

Aug 5, 2021 · Operations

Baidu Search Stability Issue Analysis: Automated Fault Detection and Resolution Techniques

This article details Baidu Search's high‑availability engineering, describing eight major challenges in fault analysis and the corresponding innovations—index mirroring, streaming analysis, comprehensive label sets, feature engineering, query reconstruction, intelligent ranking, timeline analysis, and chaos engineering—that together enable near‑real‑time, 99% accurate detection and mitigation of search service failures.

Big DataReliabilitySearch

0 likes · 13 min read

Baidu Search Stability Issue Analysis: Automated Fault Detection and Resolution Techniques

Baidu Geek Talk

Jul 5, 2021 · Operations

Automated and Intelligent Analysis of Baidu Search Stability Issues

The team automated Baidu Search fault diagnosis by building a side‑index for instant log lookup, streaming incremental analysis, exhaustive rule templates, feature‑engineering pipelines, query‑scene reconstruction, entropy‑based ranking, per‑second timeline views, and chaos‑engineered fault injection, achieving near‑99% accuracy and second‑level, module‑granular stability tracing.

ObservabilitySearch Stabilitychaos engineering

0 likes · 15 min read

Automated and Intelligent Analysis of Baidu Search Stability Issues

dbaplus Community

Mar 23, 2021 · Operations

Why RocketMQ Took 10 Minutes to Recover: Deep Dive into a Production Outage

A production RocketMQ cluster suffered a 10‑minute message‑send timeout after a broker’s memory failure caused a server reboot, and the post analyzes the routing registration, name‑server removal logic, client‑side timeout handling, root cause of name‑server “dead‑lock”, and proposes deployment and code‑level fixes to prevent recurrence.

OperationsRocketMQbest-practices

0 likes · 9 min read

Why RocketMQ Took 10 Minutes to Recover: Deep Dive into a Production Outage

iQIYI Technical Product Team

Nov 13, 2020 · Operations

Building and Optimizing a Consul‑Based Service Registry for iQIYI's Microservice Platform

iQIYI’s Consul‑based service registry, tightly integrated with its QAE container platform and API gateway, suffered a multi‑DC outage caused by network jitter and a metrics‑library lock‑contention bug, which was resolved by upgrading Go, go‑metrics, and Raft, adding extensive monitoring, redundant DC registration, and dedicated per‑gateway Consul clusters to ensure continued stability and scalability.

ConsulMicroservicesOperations

0 likes · 17 min read

Building and Optimizing a Consul‑Based Service Registry for iQIYI's Microservice Platform

Tencent Cloud Developer

May 16, 2019 · Operations

TDSQL Intelligent Operation Platform – Bianque Architecture and Practice

Bianque, TDSQL’s intelligent operation platform, automatically collects and indexes database metrics, applies a knowledge‑base‑driven analysis engine to diagnose availability, performance and reliability issues, issue risk warnings and optimization recommendations, dramatically cutting DBA effort and support tickets across Tencent’s cloud services.

AutomationDatabase operationsIntelligent Diagnosis

0 likes · 17 min read

TDSQL Intelligent Operation Platform – Bianque Architecture and Practice

Efficient Ops

May 17, 2016 · Operations

When a Single Cable Crashes a Network: Real Ops Incident Lessons

This article recounts two real‑world operations incidents—a network outage caused by an improperly configured portfast on a trunk link and an NFS failure that crippled an API service—then distills practical lessons on pre‑incident procedures, monitoring, fault handling, recovery, and post‑mortem practices.

ITILIncident ManagementNFS

0 likes · 11 min read

When a Single Cable Crashes a Network: Real Ops Incident Lessons