Tagged articles
12 articles
Page 1 of 1
Alibaba Cloud Native
Alibaba Cloud Native
Dec 6, 2025 · Cloud Native

How Graph Queries Transform Cloud‑Native Observability and Fault Diagnosis

In modern cloud‑native systems, treating each service, container, or middleware as an isolated entity hides the essential connections between components, so this article explains how integrating graph‑based data models and query languages like graph‑match and Cypher unlocks powerful fault‑impact analysis, topology insights, and performance‑optimized troubleshooting.

Cypherfault-analysisgraph query
0 likes · 28 min read
How Graph Queries Transform Cloud‑Native Observability and Fault Diagnosis
Tech Architecture Stories
Tech Architecture Stories
Aug 8, 2023 · Operations

Mastering Fault Postmortems: Proven Methods to Boost System Reliability

This comprehensive guide explains the origins, methodologies, and practical steps of fault postmortems—including PDCA, GRIA, aviation safety lessons, industrial accident theory, and software reliability metrics—to help teams systematically investigate incidents, derive actionable improvements, and continuously enhance system availability.

GRIAPDCAReliability
0 likes · 22 min read
Mastering Fault Postmortems: Proven Methods to Boost System Reliability
vivo Internet Technology
vivo Internet Technology
Jul 19, 2023 · Databases

Analysis of Service Avalanche Caused by Misconfigured Jedis Parameters During Redis Cluster Master‑Slave Switch

A service‑wide avalanche occurred when a Redis 3.x master‑slave failover coincided with Jedis’ default 2‑second connection timeout and six retry attempts, causing up to 60‑second latencies; adjusting connectionTimeout, soTimeout to 100 ms and reducing maxAttempts to two limited latency to about one second and prevented cascade failures.

ClusterConnection RetryJedis
0 likes · 13 min read
Analysis of Service Avalanche Caused by Misconfigured Jedis Parameters During Redis Cluster Master‑Slave Switch
vivo Internet Technology
vivo Internet Technology
May 18, 2022 · Backend Development

Kafka Cluster Fault Analysis: Root Cause and Cascading Failure Mechanism

A Kafka cluster at vivo suffered a total traffic drop across a resource group when a broker’s disk failed, because the default producer partitioner still hashed keys to the failed partition, exhausting client buffers and blocking all healthy partitions, prompting recommendations to avoid keys or use custom partitioners.

Distributed SystemsKafkafault-analysis
0 likes · 9 min read
Kafka Cluster Fault Analysis: Root Cause and Cascading Failure Mechanism
Open Source Linux
Open Source Linux
Dec 2, 2021 · Operations

Master PC Power-On and Fault Diagnosis: A Complete Technician's Guide

This comprehensive guide details the definitions, common symptoms, likely hardware components, and step‑by‑step diagnostic procedures for power‑on, startup, shutdown, disk, display, installation, application, network, internet, peripheral, audio‑video, and compatibility faults in personal computers, illustrated with real‑world case studies and practical troubleshooting tips.

Hardware TroubleshootingIT OperationsPC repair
0 likes · 71 min read
Master PC Power-On and Fault Diagnosis: A Complete Technician's Guide
Baidu Intelligent Testing
Baidu Intelligent Testing
Aug 5, 2021 · Operations

Baidu Search Stability Issue Analysis: Automated Fault Detection and Resolution Techniques

This article details Baidu Search's high‑availability engineering, describing eight major challenges in fault analysis and the corresponding innovations—index mirroring, streaming analysis, comprehensive label sets, feature engineering, query reconstruction, intelligent ranking, timeline analysis, and chaos engineering—that together enable near‑real‑time, 99% accurate detection and mitigation of search service failures.

Big DataReliabilitySearch
0 likes · 13 min read
Baidu Search Stability Issue Analysis: Automated Fault Detection and Resolution Techniques
Baidu Geek Talk
Baidu Geek Talk
Jul 5, 2021 · Operations

Automated and Intelligent Analysis of Baidu Search Stability Issues

The team automated Baidu Search fault diagnosis by building a side‑index for instant log lookup, streaming incremental analysis, exhaustive rule templates, feature‑engineering pipelines, query‑scene reconstruction, entropy‑based ranking, per‑second timeline views, and chaos‑engineered fault injection, achieving near‑99% accuracy and second‑level, module‑granular stability tracing.

Search Stabilitychaos engineeringfault-analysis
0 likes · 15 min read
Automated and Intelligent Analysis of Baidu Search Stability Issues
dbaplus Community
dbaplus Community
Mar 23, 2021 · Operations

Why RocketMQ Took 10 Minutes to Recover: Deep Dive into a Production Outage

A production RocketMQ cluster suffered a 10‑minute message‑send timeout after a broker’s memory failure caused a server reboot, and the post analyzes the routing registration, name‑server removal logic, client‑side timeout handling, root cause of name‑server “dead‑lock”, and proposes deployment and code‑level fixes to prevent recurrence.

MessagingOperationsRocketMQ
0 likes · 9 min read
Why RocketMQ Took 10 Minutes to Recover: Deep Dive into a Production Outage
iQIYI Technical Product Team
iQIYI Technical Product Team
Nov 13, 2020 · Operations

Building and Optimizing a Consul‑Based Service Registry for iQIYI's Microservice Platform

iQIYI’s Consul‑based service registry, tightly integrated with its QAE container platform and API gateway, suffered a multi‑DC outage caused by network jitter and a metrics‑library lock‑contention bug, which was resolved by upgrading Go, go‑metrics, and Raft, adding extensive monitoring, redundant DC registration, and dedicated per‑gateway Consul clusters to ensure continued stability and scalability.

ConsulMicroservicesOperations
0 likes · 17 min read
Building and Optimizing a Consul‑Based Service Registry for iQIYI's Microservice Platform
Tencent Cloud Developer
Tencent Cloud Developer
May 16, 2019 · Operations

TDSQL Intelligent Operation Platform – Bianque Architecture and Practice

Bianque, TDSQL’s intelligent operation platform, automatically collects and indexes database metrics, applies a knowledge‑base‑driven analysis engine to diagnose availability, performance and reliability issues, issue risk warnings and optimization recommendations, dramatically cutting DBA effort and support tickets across Tencent’s cloud services.

Database operationsIntelligent DiagnosisPerformance Monitoring
0 likes · 17 min read
TDSQL Intelligent Operation Platform – Bianque Architecture and Practice
Efficient Ops
Efficient Ops
May 17, 2016 · Operations

When a Single Cable Crashes a Network: Real Ops Incident Lessons

This article recounts two real‑world operations incidents—a network outage caused by an improperly configured portfast on a trunk link and an NFS failure that crippled an API service—then distills practical lessons on pre‑incident procedures, monitoring, fault handling, recovery, and post‑mortem practices.

ITILNFSOperations
0 likes · 11 min read
When a Single Cable Crashes a Network: Real Ops Incident Lessons