Operations 13 min read

How We Built a Scalable, High‑Availability Monitoring Platform with Service Trees

This article details the challenges of traditional monitoring systems, the design and implementation of a custom high‑availability monitoring platform using a Golang‑based service tree, Raft‑backed storage, InfluxDB for time‑series data, and a modular architecture that supports Windows agents, third‑party reporting, and AI‑driven future enhancements.

Efficient Ops
Efficient Ops
Efficient Ops
How We Built a Scalable, High‑Availability Monitoring Platform with Service Trees

1. Problems with Traditional Monitoring Systems

Common tools like Zabbix, Nagios, Open‑Falcon, and Prometheus were initially adopted, but as server count grew to around 3000, we faced storage and query bottlenecks, lack of convenient reporting APIs/SDKs, flat group structures, and difficulty meeting custom monitoring needs such as cluster monitoring.

Splitting Zabbix instances only mitigated load without solving the fundamental query latency, and further development was hindered by Zabbix’s C‑based backend.

2. Requirements for a Self‑Built Monitoring System

2.1 Service Tree

Service Tree: A hierarchical structure that organizes services as nodes, enabling cluster‑level governance rather than per‑machine management. Open‑Falcon: An open‑source distributed monitoring system that supports service‑tree management.

We needed a custom service tree because Open‑Falcon did not expose a service‑tree API at the time.

The service tree, implemented in Go for performance and ease of use, serves as the foundation for both monitoring and deployment systems.

2.2 High Availability

The service‑registry component is critical, so we use Raft for consensus and BoltDB for persistent storage. A LRU cache layer improves read performance.

<code>type Cache struct {
    mu        sync.RWMutex
    count     int
    evictList *list.List
    items     map[string]map[string]*list.Element
    size      uint64
    maxSize   uint64
    enable    bool
    logger    *log.Logger
}
</code>

Memory usage can be configured; we run the service tree with a 50 MB cache.

2.3 Extensibility

With three Raft instances (one leader, two followers), write operations are automatically forwarded to the leader, abstracting leader selection from users. TCP ports for cluster communication are multiplexed with Raft sync ports, simplifying configuration.

Resources such as machines, alerts, and permissions are treated uniformly, making future extensions straightforward.

2.4 Performance

Traditional Zabbix stores time‑series data in a relational database, causing inefficiencies. We separate configuration data (small, frequent reads) from monitoring data (large, hot‑cold reads) and store the latter in a time‑series database like OpenTSDB, InfluxDB, or Prometheus.

2.5 Business Data Reporting

Agents expose an interface for business‑level metrics, allowing them to be ingested as regular data points, reducing maintenance overhead. Standard service metrics are collected via plugins, similar to Zabbix templates.

3. Building the Monitoring System

Overall architecture:

3.1 Data Flow

Servers pull collection strategies from the service tree, agents gather metrics, and a message queue buffers data per IDC before a router writes it to InfluxDB. The router also shards writes for high availability and uses multiple InfluxDB instances to avoid single points of failure.

Alerts are handled by Kapacitor, with custom enhancements for value‑less monitoring.

3.2 Module Functions

Agent: Collects resource metrics.

Registry: Manages service‑tree, collection, and alert policies.

MQ: Buffers data and provides fault tolerance.

Router: Entry point for backend reads/writes, handling sharding and multi‑write.

InfluxDB: Time‑series storage.

Alarm: Dispatches alerts, supports silencing and convergence.

3.3 Why InfluxDB

InfluxDB’s TSM engine offers excellent write performance and compression. For ~2000 servers over 100 days, storage is ~400 GB, handling 10‑second scrape intervals with 5 × 10⁴ writes per second and ~10⁹ points per day.

Example: 2000 servers, 100 days → 400 GB; 10‑second reporting is well within limits.

4. Highlights

Native Windows Agent: Added extensive support for Windows services such as Exchange.

Third‑Party Metric Reporting: Simplifies integration for developers.

Rich Plugin Library: Leverages community contributions for extensive monitoring capabilities.

Optimized Visualization: Fast long‑range queries, native Grafana support.

Automatic Registration: Hosts auto‑register to appropriate service‑tree nodes.

Lightweight Agent: Memory <30 MB, CPU <1 %.

Hierarchical Alerts: Shows duration, status, and escalation paths.

Flexible Machine Management: View online status, set maintenance mode, and silence alerts per machine.

5. Outlook

By leveraging open‑source time‑series databases, we have resolved most write/read bottlenecks. The service tree simplifies cluster management, and the agent’s Unix domain socket enables non‑blocking business metric reporting, turning the monitoring platform into an open message bus.

Future monitoring will become more intelligent, integrating AIOps concepts: analyzing raw monitoring data with machine learning to automatically detect anomalies, reducing reliance on manually configured alert rules.

Machine learning and AI will drive the next wave of operational transformation, a direction we are actively pursuing.

monitoringhigh availabilityopsInfluxDBAIOpsservice tree
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.