Scaling Ops: From Hundreds to Thousands of Servers – Lessons from AdMaster
This article shares AdMaster's five‑year operations journey, detailing how the team scaled monitoring from under 200 machines to over a thousand, the evolution of their monitoring stack, the design of a custom distributed system, and practical Q&A insights for large‑scale infrastructure management.
AdMaster is a leading independent third‑party marketing big‑data solution provider in China, serving over 80% of the Fortune 100 brands across industries such as FMCG, IT, and automotive.
Operations Director Gu Kai presents his experience "From a Few to Thousands of Servers" covering five years of rapid growth: from dozens of servers to thousands, handling over 5 TB of daily data, more than 100 billion requests, and over 1 million QPS.
Three Growth Stages
Stage 1: Fewer than 200 machines
Requirements were simple: easy‑to‑use, stable operation, and alerting via email and SMS. Open‑source tools like Nagios, Cacti, Zabbix, and Ganglia were adopted; the team used Nagios and Cacti for quick setup.
Stage 2: 200‑1000 machines
Complexity increased, prompting unified basic monitoring (CPU, memory, disk) and comprehensive business monitoring. Alerts were categorized by severity and delivered via email, WeChat, SMS, and phone calls. The team deepened Nagios usage, writing custom scripts and plugins, but alarm volume exploded to thousands of emails per day.
When approaching 1000 machines, Nagios could no longer meet performance and visualization needs, leading to a decision to either heavily customize Nagios or build a new system.
Stage 3: Over 1000 machines
A self‑built monitoring platform was created to replicate and improve upon Nagios features, simplify alerts, separate alarm processing from display, and enable distributed deployment with failover via intelligent DNS.
Key design points:
Feature parity with Nagios : replicate all existing functions before replacing.
Alert de‑duplication : reduce daily alarms from 3000+ to under 300.
Separate alarm and display : local alarms trigger immediately, while a central node aggregates visualizations.
Distributed architecture : each data center has a local alarm node and a central node; DNS switches automatically on failure.
The platform also integrates asset management, ticketing, cloud management, and visual dashboards, making operational data understandable for product, sales, and leadership.
Several visualizations illustrate nationwide traffic monitoring and switch performance, leveraging Cloud Wisdom’s "Monitoring Treasure" nodes for comprehensive data collection.
Q&A Highlights
Is the underlying system still Nagios? No, it is fully custom‑built, inspired by Nagios concepts.
Database monitoring? Existing scripts are used; no dedicated DB monitoring.
Business monitoring? Implemented via data aggregation and textual summaries for stakeholders.
Data handling? Asynchronous processing powers large‑screen dashboards.
Resource impact? Ongoing optimizations keep overhead manageable.
Smart DNS? Third‑party solutions supplemented with internal tools.
MySQL cluster? Master‑slave setup with separate alarm and display paths to ensure real‑time alerts.
Private cloud tooling? KVM‑based automation, initially using gopstack and OpenStack before simplifying.
Physical server specs? Minimum dual 6‑core CPUs with 64 GB RAM.
Visualization purpose? To educate non‑technical teams and streamline ticket workflows.
Network issues like SFP oscillation? Detected via custom logs and correlated with business impact.
Overall, the talk emphasizes that regardless of scale, systematic monitoring, alert de‑duplication, and clear visualization are essential for reliable operations.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
