Operations 15 min read

Scaling Ops: From Hundreds to Thousands of Servers – Lessons from AdMaster

This article shares AdMaster's five‑year operations journey, detailing how the team scaled monitoring from under 200 machines to over a thousand, the evolution of their monitoring stack, the design of a custom distributed system, and practical Q&A insights for large‑scale infrastructure management.

MaGe Linux Operations

Jun 26, 2019

Scaling Ops: From Hundreds to Thousands of Servers – Lessons from AdMaster

AdMaster is a leading independent third‑party marketing big‑data solution provider in China, serving over 80% of the Fortune 100 brands across industries such as FMCG, IT, and automotive.

Operations Director Gu Kai presents his experience "From a Few to Thousands of Servers" covering five years of rapid growth: from dozens of servers to thousands, handling over 5 TB of daily data, more than 100 billion requests, and over 1 million QPS.

Three Growth Stages

Stage 1: Fewer than 200 machines

Requirements were simple: easy‑to‑use, stable operation, and alerting via email and SMS. Open‑source tools like Nagios, Cacti, Zabbix, and Ganglia were adopted; the team used Nagios and Cacti for quick setup.

Stage 2: 200‑1000 machines

Complexity increased, prompting unified basic monitoring (CPU, memory, disk) and comprehensive business monitoring. Alerts were categorized by severity and delivered via email, WeChat, SMS, and phone calls. The team deepened Nagios usage, writing custom scripts and plugins, but alarm volume exploded to thousands of emails per day.

When approaching 1000 machines, Nagios could no longer meet performance and visualization needs, leading to a decision to either heavily customize Nagios or build a new system.

Stage 3: Over 1000 machines

A self‑built monitoring platform was created to replicate and improve upon Nagios features, simplify alerts, separate alarm processing from display, and enable distributed deployment with failover via intelligent DNS.

Key design points:

Feature parity with Nagios : replicate all existing functions before replacing.

Alert de‑duplication : reduce daily alarms from 3000+ to under 300.

Separate alarm and display : local alarms trigger immediately, while a central node aggregates visualizations.

Distributed architecture : each data center has a local alarm node and a central node; DNS switches automatically on failure.

The platform also integrates asset management, ticketing, cloud management, and visual dashboards, making operational data understandable for product, sales, and leadership.

Several visualizations illustrate nationwide traffic monitoring and switch performance, leveraging Cloud Wisdom’s "Monitoring Treasure" nodes for comprehensive data collection.

Q&A Highlights

Is the underlying system still Nagios? No, it is fully custom‑built, inspired by Nagios concepts.

Database monitoring? Existing scripts are used; no dedicated DB monitoring.

Business monitoring? Implemented via data aggregation and textual summaries for stakeholders.

Data handling? Asynchronous processing powers large‑screen dashboards.

Resource impact? Ongoing optimizations keep overhead manageable.

Smart DNS? Third‑party solutions supplemented with internal tools.

MySQL cluster? Master‑slave setup with separate alarm and display paths to ensure real‑time alerts.

Private cloud tooling? KVM‑based automation, initially using gopstack and OpenStack before simplifying.

Physical server specs? Minimum dual 6‑core CPUs with 64 GB RAM.

Visualization purpose? To educate non‑technical teams and streamline ticket workflows.

Network issues like SFP oscillation? Detected via custom logs and correlated with business impact.

Overall, the talk emphasizes that regardless of scale, systematic monitoring, alert de‑duplication, and clear visualization are essential for reliable operations.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Operations scaling

Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.