Operations 10 min read

Designing a 100k-Server Monitoring System: Architecture and Key Lessons

This article shares the architecture, design principles, challenges, and performance‑optimizing solutions behind a ten‑hundred‑thousand‑scale server monitoring system, covering data collection agents, distributed pipelines, real‑time alerts, high throughput, multi‑platform support, and practical lessons learned.

21CTO

Dec 30, 2015

Designing a 100k-Server Monitoring System: Architecture and Key Lessons

1. Monitoring System Architecture

The system places an agent on each server to collect metrics, forwards data through a distributed pipeline, aggregates it, and stores results in a database while also defining numerous alert rules. Simple alerts trigger on thresholds like CPU >90%, while complex alerts consider patterns such as multiple spikes within a short window.

Data is stored in a distributed Linux‑based file database with a web front‑end for visualization.

2. Design Philosophy

Each module performs a single responsibility with precision, enabling scalability and flexibility across firewalls and varied network environments.

3. Core Challenges

Data Volume – At eLong the system generates about 160 GB per day, while at 360 it reaches 500 GB/day, with roughly 200 monitoring items per server collected every 5 seconds, resulting in over 40 data points per second.

Real‑time Processing – Alerts must be delivered within 15 seconds, requiring near‑zero latency and high availability; any data loss could suppress critical alarms.

High Throughput – The system handles heavy write loads and random reads for on‑demand analytics and charting.

Multi‑Platform Support – Environments include Linux, Windows, and FreeBSD, with Windows accounting for about half of the servers.

4. Solutions Implemented

Data Storage – Use HBase with a custom protocol to reduce overhead; JSON remains the primary serialization format for compatibility.

Real‑time Guarantees – Multi‑threaded, asynchronous, non‑blocking design with long‑lived connections.

High Availability – No single points of failure; agents can fail‑over to the nearest healthy node (“lazy intelligent routing”). Data is acknowledged at each stage to prevent loss.

Throughput – Emphasize caching where possible, though most high‑concurrency scenarios are handled without it.

Cross‑Platform Development – Initially built in C++ due to lack of Go, later migrated to Go for better Windows support.

5. Performance Optimization Techniques

zlib streaming compression for efficient data transfer.

Pipeline sliding windows to batch forward data and reduce latency.

Protocol redesign using Protobuf for compact, fast serialization.

Data merging strategies to minimize duplicate processing.

Function‑filter optimizations guided by profiling to address CPU bottlenecks.

6. Reflections on the Journey

Complexity breeds hidden pitfalls; simplifying each module leads to a more reliable distributed pipeline. By iterating on design, focusing on minimalism, and continuously profiling, the team achieved a robust monitoring system capable of handling massive scale and stringent latency requirements.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Optimization system architecture scalability High Availability Server monitoring

Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.