Operations 11 min read

45 Must‑Read Operations Articles: Monitoring, Incident Recovery, Disaster Backup & Intelligent Ops

This curated collection gathers 45 essential operations articles covering monitoring alerts, fault recovery, disaster‑backup strategies, intelligent operations, tool selection, and additional expert insights, each linked to its original source for deeper technical reading.

dbaplus Community
dbaplus Community
dbaplus Community
45 Must‑Read Operations Articles: Monitoring, Incident Recovery, Disaster Backup & Intelligent Ops

Monitoring Alerts

Why I Use ES for Redis Monitoring Instead of Prometheus or Zabbix? – Li Meng –

http://mp.weixin.qq.com/s?__biz=MzI4NTA1MDEwNg==&mid=2650788187&idx=1&sn=9363cc47966b4464e82f84904ac0b4b8

Bank Monitoring Alarm System Performance Boosted 50× Using Open‑Source Components – Pang Yaping –

http://mp.weixin.qq.com/s?__biz=MzI4NTA1MDEwNg==&mid=2650795501&idx=1&sn=7d2ea32c7f9c99d9408fb944d341bccf

Why Prometheus Can Replace Zabbix as the Ultimate Monitoring Tool – Chen Xiaoyu –

http://mp.weixin.qq.com/s?__biz=MzI4NTA1MDEwNg==&mid=2650781498&idx=1&sn=18cee3ae108faa53700bd1837734c3c5

Guangda Bank Monitoring Platform Practice, Including Tool and Architecture Choices – Pang Yaping –

http://mp.weixin.qq.com/s?__biz=MzI4NTA1MDEwNg==&mid=2650789404&idx=1&sn=7f588c50269a5c4567f39b25424fff52

Choosing Between Zabbix and Prometheus: A Comprehensive Guide – Shi Peng, Cai Xianghua, Liu Yu –

http://mp.weixin.qq.com/s?__biz=MzI4NTA1MDEwNg==&mid=2650796364&idx=1&sn=b27ec041a895d902b5b53566d659bd11

Enterprise Monitoring Platform Design and Implementation Based on Prometheus – Liu Hengtang –

http://mp.weixin.qq.com/s?__biz=MzI4NTA1MDEwNg==&mid=2650791880&idx=1&sn=4c14c21f58f16e3e22f29ee181cfe77b

From Zabbix to Prometheus: Tongcheng Yilong Database Monitoring System Practice – Yan Xiaoyu –

http://mp.weixin.qq.com/s?__biz=MzI4NTA1MDEwNg==&mid=2650782138&idx=1&sn=7f4e6fba5910f6cd28d40498d9ed79a9

Ele.me Monitoring System: Evolution Through Architectural Simplification – Huang Jie –

http://mp.weixin.qq.com/s?__biz=MzI4NTA1MDEwNg==&mid=2650783266&idx=1&sn=7ff441d483d16c541d7aabdac589f4a3

Solving ELK Pain Points: Almost All Issues Covered – alonghub –

http://mp.weixin.qq.com/s?__biz=MzI4NTA1MDEwNg==&mid=2650776086&idx=1&sn=ff707f7dfcce0bf33eb3786da3d8c5d9

Handling 900 TB Daily Real‑Time Monitoring Traffic with CAT – Liang Jinhua –

http://mp.weixin.qq.com/s?__biz=MzI4NTA1MDEwNg==&mid=2650788925&idx=1&sn=199c321518cc6fef0994ad63a86a13f0

Misusing Prometheus: The Newbie’s Sword – Xu Yason –

http://mp.weixin.qq.com/s?__biz=MzI4NTA1MDEwNg==&mid=2650788382&idx=1&sn=6f9e72b4c8bff557552304fcfdd398ee

Which Monitoring System Is Stronger? A Comparison of Ele.me and Meituan Dianping – Li Gang –

http://mp.weixin.qq.com/s?__biz=MzI4NTA1MDEwNg==&mid=2650782814&idx=1&sn=99dfe98e4d2add296d4d35e34f546b19

Why Do Other Companies’ Full‑Link Automated Monitoring Platforms Feel So Hassle‑Free? – chunlian/xiaojun –

http://mp.weixin.qq.com/s?__biz=MzI4NTA1MDEwNg==&mid=2650793159&idx=1&sn=ddfcdf491dd993e84cf4d1c60201b70e

Professional HDFS Monitoring Implementation Thoughts – Application R&D Dept –

http://mp.weixin.qq.com/s?__biz=MzI4NTA1MDEwNg==&mid=2650774961&idx=1&sn=c36611985c7cf8c9a361f4a8b25b9454

Fault Recovery / Disaster Backup

Guidelines for Maintaining Thousands of MySQL Instances and Building a Disaster‑Recovery System – Liu Shuhao –

http://mp.weixin.qq.com/s?__biz=MzI4NTA1MDEwNg==&mid=2650781638&idx=2&sn=af78070ad26230f6ec3d40cd0a6bfcbe

China Unicom Big‑Data 5000‑Node Cluster Fault Self‑Healing Practice – Yu Che –

http://mp.weixin.qq.com/s?__biz=MzI4NTA1MDEwNg==&mid=2650780266&idx=1&sn=64e98d7d49f5700c23558bc1d010564b

What to Do After Accidentally Running rm -fr /* ? – Xiao Lin coding –

http://mp.weixin.qq.com/s?__biz=MzI4NTA1MDEwNg==&mid=2650788775&idx=1&sn=ec736cb88157a6f896df80633669d551

Major Incident: IO Issue Crashed 20 Machines Simultaneously – Er Ma Reading –

http://mp.weixin.qq.com/s?__biz=MzI4NTA1MDEwNg==&mid=2650792934&idx=1&sn=e13a3f750bfd1c9d94771301eb534529

Memory Leak Caused by Development – How Ops Can Diagnose Without Being Blamed – Zhuan Bian Shu –

http://mp.weixin.qq.com/s?__biz=MzI4NTA1MDEwNg==&mid=2650774777&idx=1&sn=daae4b64e622eb951a565302e4931c32

Deep Dive: Bank Core System Disaster‑Recovery Architecture (With Cases) – Xiao Dai Deba –

http://mp.weixin.qq.com/s?__biz=MzI4NTA1MDEwNg==&mid=2650790739&idx=1&sn=20d5609d2da38979bbc2718315fd53e2

Recovering Data After Deleting Production Data Without Backups or Tables – Xu Yitao –

http://mp.weixin.qq.com/s?__biz=MzI4NTA1MDEwNg==&mid=2650788024&idx=1&sn=e2789b7f19279fffc2545ab47557001c

A Complete Fault‑Drill Guide – Yuan Fen –

http://mp.weixin.qq.com/s?__biz=MzI4NTA1MDEwNg==&mid=2650776684&idx=1&sn=7e48c894d87545c8261ac12887ea09a8

Intelligent Operations

Intelligent Ops Research and Application in Financial Core Domains – Chen Linbo –

http://mp.weixin.qq.com/s?__biz=MzI4NTA1MDEwNg==&mid=2650784303&idx=1&sn=7e84aa848e86c6382fb0551cf69c361d

Core Anomaly Detection Algorithm I Actually Use – Kong Zaihua –

http://mp.weixin.qq.com/s?__biz=MzI4NTA1MDEwNg==&mid=2650786977&idx=1&sn=4ae06903eab2452d7fc51e1e5f592d7e

JD Logistics Intelligent Ops System Built on Open‑Source APM – Fu Zhengquan –

http://mp.weixin.qq.com/s?__biz=MzI4NTA1MDEwNg==&mid=2650778824&idx=1&sn=d086c96c980caeb1d4e1a27a01a7746e

History and Core Technologies of Intelligent Operations – Shanghai Stock Exchange Tech Service –

http://mp.weixin.qq.com/s?__biz=MzI4NTA1MDEwNg==&mid=2650778347&idx=1&sn=18a1e38b521df2bfe2bcb143f1f1e91e

Tool Selection

Deep Dive into Monitoring Systems: Common Combinations and Mainstream Tool Choices – Cui Hao –

http://mp.weixin.qq.com/s?__biz=MzI4NTA1MDEwNg==&mid=2650791042&idx=1&sn=f842b1587aced1863c19dd685e8c071f

Five Open‑Source Log Analysis Tools I Recommend – Sam Bocetta –

http://mp.weixin.qq.com/s?__biz=MzI4NTA1MDEwNg==&mid=2650777842&idx=1&sn=981aeb2b92c90180f4521b9bc5706f20

2019 Top Ten DevOps Tools – How Many Have You Used? – Cui Jingwen –

http://mp.weixin.qq.com/s?__biz=MzI4NTA1MDEwNg==&mid=2650775489&idx=1&sn=af5ace80cc2038042e703b93ad779d40

Which Monitoring Component Suits You Best? – Miss Sister’s Dog –

http://mp.weixin.qq.com/s?__biz=MzI4NTA1MDEwNg==&mid=2650777583&idx=1&sn=db0c900ecc2b61108515887a0296544c

Top Open‑Source CI/CD Tools – Xie Li (translation) –

http://mp.weixin.qq.com/s?__biz=MzI4NTA1MDEwNg==&mid=2650792661&idx=1&sn=df3e936c28cbe028a3dbbdacb50b2131

More Operations Good Articles

Performance Optimization Strategies When System Load Drives You Crazy – Liu Diwei –

http://mp.weixin.qq.com/s?__biz=MzI4NTA1MDEwNg==&mid=2650773734&idx=1&sn=2950cff3cdd1043db825be1ee265c0d8

DevOps Is Not Just Putting Ops and Development Together – Liu Hua –

http://mp.weixin.qq.com/s?__biz=MzI4NTA1MDEwNg==&mid=2650776266&idx=1&sn=fd4d8fd772ac8784126a003f02c3f273

Linux Ops Handbook: 150 Most Common Commands – alonghub –

http://mp.weixin.qq.com/s?__biz=MzI4NTA1MDEwNg==&mid=2650776431&idx=1&sn=a89289a456d4cd2d925bf297829fda93

Bank Daily Ingestion of 15 TB ELK Log Platform (Apollo+ES Source Modification) – Minsheng Bank Big Data Team –

http://mp.weixin.qq.com/s?__biz=MzI4NTA1MDEwNg==&mid=2650791421&idx=1&sn=5c844efc91d079a5df671f41d6b0c900

SRE Team Building and Role Division at Alibaba – Zhu Jian –

http://mp.weixin.qq.com/s?__biz=MzI4NTA1MDEwNg==&mid=2650784052&idx=1&sn=4cc7b2d2402c8a67d665fc9c336c7265

Deployment Speed Increased Six‑Fold: Zhihu’s 0‑to‑1 Deployment System Evolution – Iven Hsu –

http://mp.weixin.qq.com/s?__biz=MzI4NTA1MDEwNg==&mid=2650777562&idx=1&sn=33319fcc14e0cb29f91fedff80aa1127

Essential Skills for Junior, Mid‑Level, and Senior Ops – Li Zhenliang –

http://mp.weixin.qq.com/s?__biz=MzI4NTA1MDEwNg==&mid=2650776595&idx=1&sn=78d35e283dca0c372a31f624f0298c98

Building CMDB for Legacy Bank Systems – Artisan Ops –

http://mp.weixin.qq.com/s?__biz=MzI4NTA1MDEwNg==&mid=2650775656&idx=1&sn=f59c725c1b28cadc106cf5ba2227e4bc

Three Problems You Must Solve Before Claiming High System Availability – Wang Yejian –

http://mp.weixin.qq.com/s?__biz=MzI4NTA1MDEwNg==&mid=2650780692&idx=1&sn=17134cd938a0be2d5ee4fe072fdc2f6a

DAMS China Data Intelligent Management Summit (2020)

Tencent – "Tencent Game Big Data Asset Management Practice: Metadata Management and Data Governance"

JD – "JD EB‑Level Global Big Data Platform Construction and Governance"

Alibaba – "Large‑Scale Container Cloud Infrastructure Architecture, Management, and Ops"

Industrial and Commercial Bank of China – "Exploring and Practicing DevOps Transformation"

China UnionPay – "Distributed Database from Self‑Developed Evolution"

Minsheng Bank – "Open‑Source MySQL Application Practice at Minsheng Bank"

Ping An Bank – "Hybrid CMDB and Operations Middle‑Platform Practice"

China Unicom – "Design, Development, and Operation of a Big Data Asset Management Platform"

AWS – "Building Cloud Data Analysis Architecture Based on Data Lake"

ByteDance – "Data Governance Practice at Bytedance"

Suning – "Suning Large‑Scale Intelligent Alert Convergence and Root‑Cause Practice"

Didi – "Trillion‑Level Kafka Message Queue Practice at Didi"

tool selectionintelligent-opsIncident Recovery
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.