Designing a Scalable 100k-Server Monitoring System: Architecture and Lessons Learned
The article outlines the architecture, design principles, challenges, and performance optimizations of a large‑scale server monitoring system built for handling hundreds of gigabytes of data per day with high availability, low latency alerts, and multi‑platform support.
The talk, originally presented at the into100 Salon, shares the complete architecture and practical insights from developing a monitoring system capable of handling up to 100,000 servers.
1. Monitoring System Architecture
Each monitored server runs an Agent that collects metrics and forwards them through a distributed pipeline resembling building blocks. Collected data is split into two streams: one stored in a database for visualization and troubleshooting, and another used to evaluate alarm rules.
Alarm rules range from simple thresholds (e.g., CPU > 90%) to complex conditions (e.g., two alerts within 60 seconds). Alerts are routed to various back‑ends such as WeChat, SMS, etc.
2. Design Philosophy
Each module performs a single, well‑defined task and is highly refined.
The system must be scalable and flexible, able to expand horizontally across firewalls and diverse network environments.
Code reuse is emphasized; most modules are under 100 lines of C code, with a shared multithreaded network library.
3. Core Challenges
Data Volume: At eLong, daily data reaches ~160 GB; at 360 GB, with ~200 metrics per server collected every 5 seconds, resulting in over 40 k data points per second.
Real‑time Processing: Alerts must be delivered within 15 seconds, requiring near‑zero latency and high availability; loss of any data point can cause missed alerts.
High Throughput: Write‑heavy workload with random reads for dashboards demands sustained high throughput.
Multi‑Platform Support: The system runs on Windows, Linux, and FreeBSD across various data‑center firewalls.
4. Solutions Implemented
Use of HBase with a custom lightweight protocol to reduce overhead.
Adoption of multithreaded, asynchronous, non‑blocking I/O and long‑living connections for low latency.
Design of a “lazy intelligent routing” mechanism: if an agent loses its primary connection, it automatically fails over to the next best reachable node without a single point of failure.
Strict end‑to‑end data integrity guarantees to avoid any loss during network partitions.
High‑throughput handling via minimal caching; the system relies on fast writes and efficient batch processing.
Initial implementation in C/C++ due to lack of Go at the time; later migration to Go simplified cross‑platform concerns.
5. Performance Optimizations
zlib stream compression for bandwidth reduction.
Pipeline sliding‑window buffering to batch forward data and reduce round‑trip latency.
Protocol refactoring to Protobuf for compact serialization.
Data merging strategies to minimize duplicate transmissions.
Function‑filter optimizations guided by profiling to eliminate CPU hotspots.
6. Lessons Learned
Complexity breeds bugs; keeping each module simple and focused leads to higher reliability. Over‑engineering, especially in high‑constraint environments, caused performance regressions, so the team iteratively stripped unnecessary layers, ending with a lightweight distributed pipeline that runs efficiently on standard Linux systems.
Future work includes further scaling to millions of servers and refining the alerting logic.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
