Operations 16 min read

How WiFi Key Built a Million‑User Monitoring Platform: Architecture and Best Practices

This article describes how WiFi 万能钥匙 designed and implemented the Roma monitoring platform to handle billions of daily requests, covering background challenges, architectural principles, component design, data collection, transmission, storage, alerting, and future directions for large‑scale observability.

Efficient Ops
Efficient Ops
Efficient Ops
How WiFi Key Built a Million‑User Monitoring Platform: Architecture and Best Practices

Introduction

WiFi 万能钥匙 shares the experience of building a monitoring platform capable of supporting millions of daily users. The project, named Roma, follows three guiding ideas: incremental improvement, multiple data‑collection paths, and a story‑driven evolution.

Background

Rapid growth in active users led to traffic spikes, architectural scaling, and performance bottlenecks. The shift to SOA, microservices, and API gateways increased the number of services and machines, creating challenges such as delayed fault detection, massive log volumes, and difficulty tracing inter‑service calls.

The goal is to build a one‑stop, integrated monitoring platform that improves fault detection rate, shortens resolution time, and reduces user complaints.

Architecture Design

Principles

Minimize performance impact on business systems.

Low intrusion: easy integration with little or no code changes.

No internal dependencies that could cause cascading failures.

Unit‑level deployment supporting multi‑data‑center scenarios.

Centralized data processing, analysis, and storage.

Component Overview

The Roma system consists of several components:

roma-client : Java client library for data collection (real‑time and minute‑level).

roma-agent : Data channel deployed on each physical machine.

roma-transport : Pre‑processes, forwards, and stores data; deployed per data center.

roma-server : Configuration and heartbeat channel; per data center.

roma-master : Cluster management master node; deployed only in the primary data center.

roma-analyser : Real‑time data analysis and consumption; per data center.

roma-replicator : Cross‑data‑center data synchronization; primary data center only.

roma-storage : Stores data in HBase, OpenTSDB, etc.; primary data center only.

roma-monitor : Handles alerts based on thresholds, trends, and clusters.

roma-alarm : Sends alert notifications via SMS, email, etc.

roma-task : Scheduled tasks for state changes and data cleanup.

roma-web : Front‑end console and management UI.

Data Flow

Data is collected by the client, sent through the agent, processed by transport, stored, and finally visualized in the web console. Pre‑aggregation at each layer reduces network traffic and storage costs.

Configuration Distribution

Client → Agent → Server → Master communicate via TCP (short and long connections). Configuration changes in the web UI trigger distribution to agents, supporting both push and pull modes to handle unstable inter‑data‑center networks.

Data Collection

Multiple collection methods are compared, considering manpower and cost. In‑process monitoring uses long‑lived TCP connections between client and agent, while agents also run scripts to gather system metrics.

Data Transmission

The transmission layer uses a TLV protocol supporting binary, JSON, and XML formats.

Data Synchronization

Cross‑data‑center synchronization relies on Kafka with a customized uReplicator solution, chosen over MirrorMaker for lower latency and dynamic topic management.

Data Analysis

Challenges include data expiration policies and traceability strategies, which are addressed with dedicated analysis pipelines.

Data Storage

Storage back‑ends include HBase, OpenTSDB, and Elasticsearch. Key practices involve cluster partitioning by product line, Linux/TCP performance tuning, and batch writes to reduce RPC overhead.

Alert Processing

Alerts are handled in real‑time and near‑real‑time, driven by data or scheduled tasks, with deduplication and convergence to avoid alert storms. Future work aims at AI‑driven (AIOps) intelligent alerting.

Best Practices

Trace‑Level Monitoring

Implementation follows concepts from Google Dapper and Alibaba EagleEye, covering context propagation, asynchronous calls, and log handling.

Feature Demonstrations

Examples include trace‑based call‑graph queries, JVM metric dashboards, exception stack tracing, and unified log search with configurable log paths and parsing rules.

Future Outlook

Deep integration with internal project management, release, performance testing, and issue‑tracking systems.

Support for container‑native monitoring as microservice deployments move to Kubernetes.

Intelligent monitoring (AIOps) to improve alert timeliness and accuracy.

Conclusion

Roma is a full‑stack, end‑to‑end monitoring platform covering external, internal, and inter‑service metrics. It enables rapid fault diagnosis, performance bottleneck identification, architectural analysis, dependency mapping, and capacity planning for large‑scale microservice environments.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

monitoringarchitecturedata pipelineMicroservicesOperationsobservability
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.