How WiFi Key Built a Million‑User Monitoring Platform: Architecture and Best Practices
This article describes how WiFi 万能钥匙 designed and implemented the Roma monitoring platform to handle billions of daily requests, covering background challenges, architectural principles, component design, data collection, transmission, storage, alerting, and future directions for large‑scale observability.
Introduction
WiFi 万能钥匙 shares the experience of building a monitoring platform capable of supporting millions of daily users. The project, named Roma, follows three guiding ideas: incremental improvement, multiple data‑collection paths, and a story‑driven evolution.
Background
Rapid growth in active users led to traffic spikes, architectural scaling, and performance bottlenecks. The shift to SOA, microservices, and API gateways increased the number of services and machines, creating challenges such as delayed fault detection, massive log volumes, and difficulty tracing inter‑service calls.
The goal is to build a one‑stop, integrated monitoring platform that improves fault detection rate, shortens resolution time, and reduces user complaints.
Architecture Design
Principles
Minimize performance impact on business systems.
Low intrusion: easy integration with little or no code changes.
No internal dependencies that could cause cascading failures.
Unit‑level deployment supporting multi‑data‑center scenarios.
Centralized data processing, analysis, and storage.
Component Overview
The Roma system consists of several components:
roma-client : Java client library for data collection (real‑time and minute‑level).
roma-agent : Data channel deployed on each physical machine.
roma-transport : Pre‑processes, forwards, and stores data; deployed per data center.
roma-server : Configuration and heartbeat channel; per data center.
roma-master : Cluster management master node; deployed only in the primary data center.
roma-analyser : Real‑time data analysis and consumption; per data center.
roma-replicator : Cross‑data‑center data synchronization; primary data center only.
roma-storage : Stores data in HBase, OpenTSDB, etc.; primary data center only.
roma-monitor : Handles alerts based on thresholds, trends, and clusters.
roma-alarm : Sends alert notifications via SMS, email, etc.
roma-task : Scheduled tasks for state changes and data cleanup.
roma-web : Front‑end console and management UI.
Data Flow
Data is collected by the client, sent through the agent, processed by transport, stored, and finally visualized in the web console. Pre‑aggregation at each layer reduces network traffic and storage costs.
Configuration Distribution
Client → Agent → Server → Master communicate via TCP (short and long connections). Configuration changes in the web UI trigger distribution to agents, supporting both push and pull modes to handle unstable inter‑data‑center networks.
Data Collection
Multiple collection methods are compared, considering manpower and cost. In‑process monitoring uses long‑lived TCP connections between client and agent, while agents also run scripts to gather system metrics.
Data Transmission
The transmission layer uses a TLV protocol supporting binary, JSON, and XML formats.
Data Synchronization
Cross‑data‑center synchronization relies on Kafka with a customized uReplicator solution, chosen over MirrorMaker for lower latency and dynamic topic management.
Data Analysis
Challenges include data expiration policies and traceability strategies, which are addressed with dedicated analysis pipelines.
Data Storage
Storage back‑ends include HBase, OpenTSDB, and Elasticsearch. Key practices involve cluster partitioning by product line, Linux/TCP performance tuning, and batch writes to reduce RPC overhead.
Alert Processing
Alerts are handled in real‑time and near‑real‑time, driven by data or scheduled tasks, with deduplication and convergence to avoid alert storms. Future work aims at AI‑driven (AIOps) intelligent alerting.
Best Practices
Trace‑Level Monitoring
Implementation follows concepts from Google Dapper and Alibaba EagleEye, covering context propagation, asynchronous calls, and log handling.
Feature Demonstrations
Examples include trace‑based call‑graph queries, JVM metric dashboards, exception stack tracing, and unified log search with configurable log paths and parsing rules.
Future Outlook
Deep integration with internal project management, release, performance testing, and issue‑tracking systems.
Support for container‑native monitoring as microservice deployments move to Kubernetes.
Intelligent monitoring (AIOps) to improve alert timeliness and accuracy.
Conclusion
Roma is a full‑stack, end‑to‑end monitoring platform covering external, internal, and inter‑service metrics. It enables rapid fault diagnosis, performance bottleneck identification, architectural analysis, dependency mapping, and capacity planning for large‑scale microservice environments.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
