45 Must‑Read Operations Articles: Monitoring, Incident Recovery, Disaster Backup & Intelligent Ops
This curated collection gathers 45 essential operations articles covering monitoring alerts, fault recovery, disaster‑backup strategies, intelligent operations, tool selection, and additional expert insights, each linked to its original source for deeper technical reading.
Monitoring Alerts
Why I Use ES for Redis Monitoring Instead of Prometheus or Zabbix? – Li Meng –
http://mp.weixin.qq.com/s?__biz=MzI4NTA1MDEwNg==&mid=2650788187&idx=1&sn=9363cc47966b4464e82f84904ac0b4b8Bank Monitoring Alarm System Performance Boosted 50× Using Open‑Source Components – Pang Yaping –
http://mp.weixin.qq.com/s?__biz=MzI4NTA1MDEwNg==&mid=2650795501&idx=1&sn=7d2ea32c7f9c99d9408fb944d341bccfWhy Prometheus Can Replace Zabbix as the Ultimate Monitoring Tool – Chen Xiaoyu –
http://mp.weixin.qq.com/s?__biz=MzI4NTA1MDEwNg==&mid=2650781498&idx=1&sn=18cee3ae108faa53700bd1837734c3c5Guangda Bank Monitoring Platform Practice, Including Tool and Architecture Choices – Pang Yaping –
http://mp.weixin.qq.com/s?__biz=MzI4NTA1MDEwNg==&mid=2650789404&idx=1&sn=7f588c50269a5c4567f39b25424fff52Choosing Between Zabbix and Prometheus: A Comprehensive Guide – Shi Peng, Cai Xianghua, Liu Yu –
http://mp.weixin.qq.com/s?__biz=MzI4NTA1MDEwNg==&mid=2650796364&idx=1&sn=b27ec041a895d902b5b53566d659bd11Enterprise Monitoring Platform Design and Implementation Based on Prometheus – Liu Hengtang –
http://mp.weixin.qq.com/s?__biz=MzI4NTA1MDEwNg==&mid=2650791880&idx=1&sn=4c14c21f58f16e3e22f29ee181cfe77bFrom Zabbix to Prometheus: Tongcheng Yilong Database Monitoring System Practice – Yan Xiaoyu –
http://mp.weixin.qq.com/s?__biz=MzI4NTA1MDEwNg==&mid=2650782138&idx=1&sn=7f4e6fba5910f6cd28d40498d9ed79a9Ele.me Monitoring System: Evolution Through Architectural Simplification – Huang Jie –
http://mp.weixin.qq.com/s?__biz=MzI4NTA1MDEwNg==&mid=2650783266&idx=1&sn=7ff441d483d16c541d7aabdac589f4a3Solving ELK Pain Points: Almost All Issues Covered – alonghub –
http://mp.weixin.qq.com/s?__biz=MzI4NTA1MDEwNg==&mid=2650776086&idx=1&sn=ff707f7dfcce0bf33eb3786da3d8c5d9Handling 900 TB Daily Real‑Time Monitoring Traffic with CAT – Liang Jinhua –
http://mp.weixin.qq.com/s?__biz=MzI4NTA1MDEwNg==&mid=2650788925&idx=1&sn=199c321518cc6fef0994ad63a86a13f0Misusing Prometheus: The Newbie’s Sword – Xu Yason –
http://mp.weixin.qq.com/s?__biz=MzI4NTA1MDEwNg==&mid=2650788382&idx=1&sn=6f9e72b4c8bff557552304fcfdd398eeWhich Monitoring System Is Stronger? A Comparison of Ele.me and Meituan Dianping – Li Gang –
http://mp.weixin.qq.com/s?__biz=MzI4NTA1MDEwNg==&mid=2650782814&idx=1&sn=99dfe98e4d2add296d4d35e34f546b19Why Do Other Companies’ Full‑Link Automated Monitoring Platforms Feel So Hassle‑Free? – chunlian/xiaojun –
http://mp.weixin.qq.com/s?__biz=MzI4NTA1MDEwNg==&mid=2650793159&idx=1&sn=ddfcdf491dd993e84cf4d1c60201b70eProfessional HDFS Monitoring Implementation Thoughts – Application R&D Dept –
http://mp.weixin.qq.com/s?__biz=MzI4NTA1MDEwNg==&mid=2650774961&idx=1&sn=c36611985c7cf8c9a361f4a8b25b9454Fault Recovery / Disaster Backup
Guidelines for Maintaining Thousands of MySQL Instances and Building a Disaster‑Recovery System – Liu Shuhao –
http://mp.weixin.qq.com/s?__biz=MzI4NTA1MDEwNg==&mid=2650781638&idx=2&sn=af78070ad26230f6ec3d40cd0a6bfcbeChina Unicom Big‑Data 5000‑Node Cluster Fault Self‑Healing Practice – Yu Che –
http://mp.weixin.qq.com/s?__biz=MzI4NTA1MDEwNg==&mid=2650780266&idx=1&sn=64e98d7d49f5700c23558bc1d010564bWhat to Do After Accidentally Running rm -fr /* ? – Xiao Lin coding –
http://mp.weixin.qq.com/s?__biz=MzI4NTA1MDEwNg==&mid=2650788775&idx=1&sn=ec736cb88157a6f896df80633669d551Major Incident: IO Issue Crashed 20 Machines Simultaneously – Er Ma Reading –
http://mp.weixin.qq.com/s?__biz=MzI4NTA1MDEwNg==&mid=2650792934&idx=1&sn=e13a3f750bfd1c9d94771301eb534529Memory Leak Caused by Development – How Ops Can Diagnose Without Being Blamed – Zhuan Bian Shu –
http://mp.weixin.qq.com/s?__biz=MzI4NTA1MDEwNg==&mid=2650774777&idx=1&sn=daae4b64e622eb951a565302e4931c32Deep Dive: Bank Core System Disaster‑Recovery Architecture (With Cases) – Xiao Dai Deba –
http://mp.weixin.qq.com/s?__biz=MzI4NTA1MDEwNg==&mid=2650790739&idx=1&sn=20d5609d2da38979bbc2718315fd53e2Recovering Data After Deleting Production Data Without Backups or Tables – Xu Yitao –
http://mp.weixin.qq.com/s?__biz=MzI4NTA1MDEwNg==&mid=2650788024&idx=1&sn=e2789b7f19279fffc2545ab47557001cA Complete Fault‑Drill Guide – Yuan Fen –
http://mp.weixin.qq.com/s?__biz=MzI4NTA1MDEwNg==&mid=2650776684&idx=1&sn=7e48c894d87545c8261ac12887ea09a8Intelligent Operations
Intelligent Ops Research and Application in Financial Core Domains – Chen Linbo –
http://mp.weixin.qq.com/s?__biz=MzI4NTA1MDEwNg==&mid=2650784303&idx=1&sn=7e84aa848e86c6382fb0551cf69c361dCore Anomaly Detection Algorithm I Actually Use – Kong Zaihua –
http://mp.weixin.qq.com/s?__biz=MzI4NTA1MDEwNg==&mid=2650786977&idx=1&sn=4ae06903eab2452d7fc51e1e5f592d7eJD Logistics Intelligent Ops System Built on Open‑Source APM – Fu Zhengquan –
http://mp.weixin.qq.com/s?__biz=MzI4NTA1MDEwNg==&mid=2650778824&idx=1&sn=d086c96c980caeb1d4e1a27a01a7746eHistory and Core Technologies of Intelligent Operations – Shanghai Stock Exchange Tech Service –
http://mp.weixin.qq.com/s?__biz=MzI4NTA1MDEwNg==&mid=2650778347&idx=1&sn=18a1e38b521df2bfe2bcb143f1f1e91eTool Selection
Deep Dive into Monitoring Systems: Common Combinations and Mainstream Tool Choices – Cui Hao –
http://mp.weixin.qq.com/s?__biz=MzI4NTA1MDEwNg==&mid=2650791042&idx=1&sn=f842b1587aced1863c19dd685e8c071fFive Open‑Source Log Analysis Tools I Recommend – Sam Bocetta –
http://mp.weixin.qq.com/s?__biz=MzI4NTA1MDEwNg==&mid=2650777842&idx=1&sn=981aeb2b92c90180f4521b9bc5706f202019 Top Ten DevOps Tools – How Many Have You Used? – Cui Jingwen –
http://mp.weixin.qq.com/s?__biz=MzI4NTA1MDEwNg==&mid=2650775489&idx=1&sn=af5ace80cc2038042e703b93ad779d40Which Monitoring Component Suits You Best? – Miss Sister’s Dog –
http://mp.weixin.qq.com/s?__biz=MzI4NTA1MDEwNg==&mid=2650777583&idx=1&sn=db0c900ecc2b61108515887a0296544cTop Open‑Source CI/CD Tools – Xie Li (translation) –
http://mp.weixin.qq.com/s?__biz=MzI4NTA1MDEwNg==&mid=2650792661&idx=1&sn=df3e936c28cbe028a3dbbdacb50b2131More Operations Good Articles
Performance Optimization Strategies When System Load Drives You Crazy – Liu Diwei –
http://mp.weixin.qq.com/s?__biz=MzI4NTA1MDEwNg==&mid=2650773734&idx=1&sn=2950cff3cdd1043db825be1ee265c0d8DevOps Is Not Just Putting Ops and Development Together – Liu Hua –
http://mp.weixin.qq.com/s?__biz=MzI4NTA1MDEwNg==&mid=2650776266&idx=1&sn=fd4d8fd772ac8784126a003f02c3f273Linux Ops Handbook: 150 Most Common Commands – alonghub –
http://mp.weixin.qq.com/s?__biz=MzI4NTA1MDEwNg==&mid=2650776431&idx=1&sn=a89289a456d4cd2d925bf297829fda93Bank Daily Ingestion of 15 TB ELK Log Platform (Apollo+ES Source Modification) – Minsheng Bank Big Data Team –
http://mp.weixin.qq.com/s?__biz=MzI4NTA1MDEwNg==&mid=2650791421&idx=1&sn=5c844efc91d079a5df671f41d6b0c900SRE Team Building and Role Division at Alibaba – Zhu Jian –
http://mp.weixin.qq.com/s?__biz=MzI4NTA1MDEwNg==&mid=2650784052&idx=1&sn=4cc7b2d2402c8a67d665fc9c336c7265Deployment Speed Increased Six‑Fold: Zhihu’s 0‑to‑1 Deployment System Evolution – Iven Hsu –
http://mp.weixin.qq.com/s?__biz=MzI4NTA1MDEwNg==&mid=2650777562&idx=1&sn=33319fcc14e0cb29f91fedff80aa1127Essential Skills for Junior, Mid‑Level, and Senior Ops – Li Zhenliang –
http://mp.weixin.qq.com/s?__biz=MzI4NTA1MDEwNg==&mid=2650776595&idx=1&sn=78d35e283dca0c372a31f624f0298c98Building CMDB for Legacy Bank Systems – Artisan Ops –
http://mp.weixin.qq.com/s?__biz=MzI4NTA1MDEwNg==&mid=2650775656&idx=1&sn=f59c725c1b28cadc106cf5ba2227e4bcThree Problems You Must Solve Before Claiming High System Availability – Wang Yejian –
http://mp.weixin.qq.com/s?__biz=MzI4NTA1MDEwNg==&mid=2650780692&idx=1&sn=17134cd938a0be2d5ee4fe072fdc2f6aDAMS China Data Intelligent Management Summit (2020)
Tencent – "Tencent Game Big Data Asset Management Practice: Metadata Management and Data Governance"
JD – "JD EB‑Level Global Big Data Platform Construction and Governance"
Alibaba – "Large‑Scale Container Cloud Infrastructure Architecture, Management, and Ops"
Industrial and Commercial Bank of China – "Exploring and Practicing DevOps Transformation"
China UnionPay – "Distributed Database from Self‑Developed Evolution"
Minsheng Bank – "Open‑Source MySQL Application Practice at Minsheng Bank"
Ping An Bank – "Hybrid CMDB and Operations Middle‑Platform Practice"
China Unicom – "Design, Development, and Operation of a Big Data Asset Management Platform"
AWS – "Building Cloud Data Analysis Architecture Based on Data Lake"
ByteDance – "Data Governance Practice at Bytedance"
Suning – "Suning Large‑Scale Intelligent Alert Convergence and Root‑Cause Practice"
Didi – "Trillion‑Level Kafka Message Queue Practice at Didi"
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
