Tagged articles
20 articles
Page 1 of 1
Linux Cloud Computing Practice
Linux Cloud Computing Practice
Nov 8, 2025 · Operations

40+ Common Linux Ops Faults and How to Diagnose Them

Linux system administrators often encounter diverse failures, and this guide compiles over 40 distinct fault types—including system, network, hardware, and software issues—offering practical troubleshooting steps to help engineers quickly diagnose and resolve problems while building a solid knowledge base.

Fault Diagnosislinuxtroubleshooting
0 likes · 2 min read
40+ Common Linux Ops Faults and How to Diagnose Them
Nightwalker Tech
Nightwalker Tech
Aug 28, 2025 · Operations

How to Diagnose and Fix E‑commerce Order Failures with Observability, APM, and Distributed Tracing

This article explains the hierarchical relationship between APM, distributed tracing, and observability, walks through a real Double‑11 e‑commerce incident, and demonstrates how a well‑designed observability stack can pinpoint the root cause, apply emergency fixes, and restore system performance within minutes.

APMDistributed TracingFault Diagnosis
0 likes · 16 min read
How to Diagnose and Fix E‑commerce Order Failures with Observability, APM, and Distributed Tracing
Alibaba Cloud Observability
Alibaba Cloud Observability
Dec 30, 2024 · Operations

Alibaba Cloud’s Mint Tracing Framework and FAMOS Diagnosis Earn Top‑Conference Spot

Alibaba Cloud’s recent research breakthroughs—Mint, a cost‑efficient tracing framework that captures all request flows while drastically cutting storage and network overhead, and FAMOS, a multi‑modal fault‑diagnosis method for microservice systems—have been accepted to the prestigious ASPLOS and ICSE conferences, marking the first top‑conference publications in observability for the company.

Fault DiagnosisMicroservicesObservability
0 likes · 6 min read
Alibaba Cloud’s Mint Tracing Framework and FAMOS Diagnosis Earn Top‑Conference Spot
Aikesheng Open Source Community
Aikesheng Open Source Community
Nov 12, 2024 · Artificial Intelligence

ChatDBA: An AI‑Powered Database Fault Diagnosis Assistant Using Large Language Models

ChatDBA is a conversational AI system built by Shanghai Aikesheng that employs large language models and Retrieval‑Augmented Generation to help database administrators diagnose faults, learn domain knowledge, and generate or optimize SQL, with a redesigned architecture that addresses early‑stage shortcomings and outlines future enhancements.

ChatDBAFault DiagnosisKnowledge Base
0 likes · 10 min read
ChatDBA: An AI‑Powered Database Fault Diagnosis Assistant Using Large Language Models
DataFunSummit
DataFunSummit
Nov 8, 2024 · Artificial Intelligence

ChatDBA: An AI‑Powered Database Fault Diagnosis Assistant Using Retrieval‑Augmented Generation

ChatDBA, developed by Shanghai Aikesheng, is an AI-driven database operation assistant that leverages large language models and Retrieval‑Augmented Generation to provide fault diagnosis, knowledge learning, SQL generation and optimization, addressing challenges such as vague outputs, complex troubleshooting logic, and memory management through a structured architecture and multi‑modal retrieval strategies.

AIFault DiagnosisRAG
0 likes · 10 min read
ChatDBA: An AI‑Powered Database Fault Diagnosis Assistant Using Retrieval‑Augmented Generation
Baidu Geek Talk
Baidu Geek Talk
Mar 6, 2024 · Artificial Intelligence

How Baidu’s BCCL Boosts Large‑Model Training with Real‑Time Observability and Fault Diagnosis

The article explains why collective communication is critical for distributed large‑model training, outlines the new requirements for system reliability, and introduces Baidu’s Collective Communication Library (BCCL), detailing its enhanced observability, fault‑diagnosis, stability, and performance optimizations that raise effective training time to 98 % and bandwidth utilization to 95 %.

AI InfrastructureDistributed TrainingFault Diagnosis
0 likes · 11 min read
How Baidu’s BCCL Boosts Large‑Model Training with Real‑Time Observability and Fault Diagnosis
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Mar 1, 2024 · Artificial Intelligence

How Baidu’s BCCL Boosts Distributed AI Training with Real‑Time Observability and Fault Diagnosis

Baidu’s Collective Communication Library (BCCL) enhances large‑model distributed training by improving real‑time bandwidth monitoring, fault diagnosis, network stability, and performance, leveraging RDMA networks and GPU‑specific optimizations to increase effective training time to 98% and bandwidth utilization to 95%.

AI InfrastructureDistributed TrainingFault Diagnosis
0 likes · 11 min read
How Baidu’s BCCL Boosts Distributed AI Training with Real‑Time Observability and Fault Diagnosis
Java Captain
Java Captain
Jan 15, 2024 · Operations

Java Distributed Tracing: Concepts, Principles, Implementation, and Application Scenarios

This article explains the concept of distributed tracing, outlines its underlying principles in Java, details step‑by‑step implementation using popular SDKs, and describes common application scenarios such as performance monitoring, fault diagnosis, complex event handling, traffic analysis, and system optimization.

Distributed TracingFault DiagnosisJava
0 likes · 5 min read
Java Distributed Tracing: Concepts, Principles, Implementation, and Application Scenarios
Ops Development Stories
Ops Development Stories
Nov 20, 2023 · Operations

How eBPF Powers Next‑Gen Observability and Fault Diagnosis in Kubernetes

At KubeCon China 2023, experts Liu Kai and Dong Shandong presented a three‑part deep dive into Kubernetes observability challenges, demonstrating how eBPF enables comprehensive data collection across all stack layers, seamless integration, and intelligent root‑cause analysis through dimension attribution, anomaly bounding, and fault‑tree methods.

Cloud NativeFault DiagnosisKubernetes
0 likes · 20 min read
How eBPF Powers Next‑Gen Observability and Fault Diagnosis in Kubernetes
Alibaba Cloud Native
Alibaba Cloud Native
Nov 18, 2023 · Cloud Native

How eBPF Powers Next‑Gen Observability and Root‑Cause Analysis in Kubernetes

This talk explains the three major observability challenges in Kubernetes, demonstrates how eBPF enables comprehensive, low‑overhead data collection across all stack layers, and outlines a practical workflow that combines architecture awareness, application‑level metrics, and fault‑tree analysis to achieve automated root‑cause diagnosis.

Fault DiagnosisKuberneteseBPF
0 likes · 21 min read
How eBPF Powers Next‑Gen Observability and Root‑Cause Analysis in Kubernetes
Efficient Ops
Efficient Ops
Jun 7, 2023 · Artificial Intelligence

How Guangdong Mobile Scaled AIOps: From Manual Ops to Intelligent Automation

This article details Guangdong Mobile's evolution of IT systems and operations, explains the four domain architecture, chronicles the AIOps adoption timeline, showcases intelligent anomaly detection, change assessment, fault diagnosis, and operation robots, and shares practical promotion methods and future outlook for AI‑driven IT operations.

AutomationFault DiagnosisIT Operations
0 likes · 19 min read
How Guangdong Mobile Scaled AIOps: From Manual Ops to Intelligent Automation
Ctrip Technology
Ctrip Technology
Feb 2, 2023 · Databases

MySQL to OceanBase Migration: Evaluation Tools, Migration Process, Monitoring, and Automated Fault Diagnosis

This article details Ctrip's experience migrating MySQL workloads to the distributed OceanBase database, covering the design of an assessment tool, a one‑click migration workflow, comprehensive monitoring dashboards, automated fault‑diagnosis pipelines, encountered compatibility issues, and future roadmap for the platform.

Database MonitoringFault DiagnosisMySQL Migration
0 likes · 17 min read
MySQL to OceanBase Migration: Evaluation Tools, Migration Process, Monitoring, and Automated Fault Diagnosis
DataFunSummit
DataFunSummit
Dec 14, 2021 · Big Data

Data Map: Background, Definition, and Youzan’s Practical Implementation

This article introduces the concept of a data map, explains its background and goals, describes Youzan’s end‑to‑end data‑map practice—including full data lineage, search, management, link analysis, impact estimation, and optimization—and concludes with a summary and future outlook.

Big DataData GovernanceData Lineage
0 likes · 16 min read
Data Map: Background, Definition, and Youzan’s Practical Implementation
dbaplus Community
dbaplus Community
Aug 13, 2019 · Big Data

How Xianyu Built a Sub‑3‑Second Real‑Time Data Pipeline for Rapid Fault Diagnosis

Xianyu’s production environment grew complex, prompting the creation of a high‑performance, sub‑3‑second real‑time data processing pipeline that ingests logs and metrics, uses Alibaba’s Logtail, LogHub, and Blink (enhanced Flink) for collection, transport, pre‑processing, computation, and persistent graph‑based fault analysis.

Fault DiagnosisPerformance Optimizationblink
0 likes · 13 min read
How Xianyu Built a Sub‑3‑Second Real‑Time Data Pipeline for Rapid Fault Diagnosis
Architects' Tech Alliance
Architects' Tech Alliance
Mar 23, 2019 · Operations

Common Causes of Fiber‑Optic Cable Interruptions and Repair Guidelines

The article explains the March 23 Shanghai fiber‑optic outage that disrupted several Tencent apps and outlines eight typical reasons for fiber‑cable faults—such as construction cuts, vehicle accidents, fires, pole collisions, theft, animal damage, aging, and natural disasters—along with practical on‑site repair procedures.

Fault DiagnosisOTDR testingcable repair
0 likes · 8 min read
Common Causes of Fiber‑Optic Cable Interruptions and Repair Guidelines
Efficient Ops
Efficient Ops
Jul 31, 2018 · Operations

How Ctrip Boosted Efficiency with AIOps: Real-World AI Operations Cases

This article explores Ctrip's adoption of AIOps—AI‑driven IT operations—detailing its concepts, typical use cases such as anomaly detection, intelligent fault diagnosis, and resource‑utilization improvement, and demonstrating how machine‑learning models like ARMA, FFT, and SVM have transformed operational efficiency, availability, and cost.

Fault DiagnosisResource Optimizationaiops
0 likes · 15 min read
How Ctrip Boosted Efficiency with AIOps: Real-World AI Operations Cases
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
Nov 14, 2017 · Operations

Unlocking Scalable Network Automation: Lessons from 360’s Ops Strategy

This article explores how rapid growth in network devices drives the need for comprehensive automation—covering script‑based tasks, zero‑touch provisioning, orchestration with OpenStack, device selection criteria, fault diagnosis, and monitoring—to keep operations ahead of business demands.

Fault DiagnosisNetwork MonitoringOpenStack integration
0 likes · 10 min read
Unlocking Scalable Network Automation: Lessons from 360’s Ops Strategy
Big Data and Microservices
Big Data and Microservices
Apr 1, 2016 · Operations

How to Build a Business‑Transaction‑Centric IT Operations Monitoring System

This article outlines a comprehensive approach for designing an IT operations monitoring platform that focuses on real‑time business transaction metrics, automatic topology discovery, event‑transaction correlation, deep component diagnostics, and unified data processing to improve availability, performance, and fault‑resolution speed in large‑scale data centers.

AutomationBusiness TransactionFault Diagnosis
0 likes · 15 min read
How to Build a Business‑Transaction‑Centric IT Operations Monitoring System