Designing a 100k-Server Monitoring System: Architecture and Key Lessons
This article shares the architecture, design principles, challenges, and performance‑optimizing solutions behind a ten‑hundred‑thousand‑scale server monitoring system, covering data collection agents, distributed pipelines, real‑time alerts, high throughput, multi‑platform support, and practical lessons learned.
1. Monitoring System Architecture
The system places an agent on each server to collect metrics, forwards data through a distributed pipeline, aggregates it, and stores results in a database while also defining numerous alert rules. Simple alerts trigger on thresholds like CPU >90%, while complex alerts consider patterns such as multiple spikes within a short window.
Data is stored in a distributed Linux‑based file database with a web front‑end for visualization.
2. Design Philosophy
Each module performs a single responsibility with precision, enabling scalability and flexibility across firewalls and varied network environments.
3. Core Challenges
Data Volume – At eLong the system generates about 160 GB per day, while at 360 it reaches 500 GB/day, with roughly 200 monitoring items per server collected every 5 seconds, resulting in over 40 data points per second.
Real‑time Processing – Alerts must be delivered within 15 seconds, requiring near‑zero latency and high availability; any data loss could suppress critical alarms.
High Throughput – The system handles heavy write loads and random reads for on‑demand analytics and charting.
Multi‑Platform Support – Environments include Linux, Windows, and FreeBSD, with Windows accounting for about half of the servers.
4. Solutions Implemented
Data Storage – Use HBase with a custom protocol to reduce overhead; JSON remains the primary serialization format for compatibility.
Real‑time Guarantees – Multi‑threaded, asynchronous, non‑blocking design with long‑lived connections.
High Availability – No single points of failure; agents can fail‑over to the nearest healthy node (“lazy intelligent routing”). Data is acknowledged at each stage to prevent loss.
Throughput – Emphasize caching where possible, though most high‑concurrency scenarios are handled without it.
Cross‑Platform Development – Initially built in C++ due to lack of Go, later migrated to Go for better Windows support.
5. Performance Optimization Techniques
zlib streaming compression for efficient data transfer.
Pipeline sliding windows to batch forward data and reduce latency.
Protocol redesign using Protobuf for compact, fast serialization.
Data merging strategies to minimize duplicate processing.
Function‑filter optimizations guided by profiling to address CPU bottlenecks.
6. Reflections on the Journey
Complexity breeds hidden pitfalls; simplifying each module leads to a more reliable distributed pipeline. By iterating on design, focusing on minimalism, and continuously profiling, the team achieved a robust monitoring system capable of handling massive scale and stringent latency requirements.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
