From Hundreds to Thousands: Scaling Operations and Building a Custom Monitoring System
This article recounts AdMaster's five‑year journey from a few dozen servers to thousands, detailing the evolution of their monitoring infrastructure, the challenges faced at each scale stage, and the design of a self‑built, distributed monitoring platform that delivers real‑time alerts, visualized data, and business‑level insights.
AdMaster, a leading independent third‑party marketing big‑data solution provider in China, serves over 80% of the world’s top 100 brands across industries such as FMCG, IT, and automotive.
Operations Director Gu Kai shares his experience of scaling the company’s infrastructure from dozens to thousands of servers over five years, handling daily data growth exceeding 5 TB, more than 100 billion requests, and over 1 million QPS.
The operation team built its own platforms for asset management, ticketing, monitoring, domain management, and both public and private cloud management, making operational data transparent and visual.
Phase 1: Fewer than 200 Machines
Requirements were simple: easy‑to‑use, stable operation, and alerting via email and SMS. Open‑source tools like Nagios, Cacti, Zabbix, and Ganglia were adopted, with Nagios and Cacti chosen for familiarity and switch monitoring convenience.
Phase 2: 200–1000 Machines
Complexity grew, prompting three main actions:
Standardized basic monitoring (CPU, memory, disk) on every machine.
Implemented comprehensive business‑level monitoring to cover all processes and reduce duplicate alerts.
Established multi‑level notification channels (email, WeChat, SMS, phone) to ensure no missed alerts, with critical alerts using persistent call‑backs.
Deep customizations of Nagios were made, but as alarm volume exploded (thousands of emails per day), the team faced a decision: further extend Nagios or develop a new system.
Phase 3: Over 1000 Machines
A custom monitoring system was built to replicate and improve upon Nagios features, simplify alerts (reducing daily alarms from 3000+ to under 300), separate alerting from display, and deploy a distributed architecture with regional nodes and a central node that fail‑over via intelligent DNS.
Key capabilities include:
Full feature parity with Nagios, plus optimizations.
Alert de‑duplication and prioritization.
Separation of alarm processing from visualization.
Distributed deployment to avoid single points of failure.
Visual dashboards now present data in an understandable way for product, sales, and executive teams.
Examples of visualized data include nationwide access tracking, switch traffic analysis, and business‑level health indicators.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
