Design and Architecture of a Cloud‑Native Monitoring Platform for Business Systems
The document outlines the background, vision, current status, technical research, value, product and technical architecture, and functional design of a cloud‑native monitoring platform that integrates SkyWalking and Prometheus to provide comprehensive APM, resource utilization, alerting, and rapid fault localization for business and technical middle‑platform services.
1. Background The monitoring platform targets business systems and technical middle‑platforms to monitor and alert on hardware and software components. As cloud‑based applications proliferate and IT architecture becomes more complex, a global view of application health, resource usage, intelligent analysis, alerting, emergency response, and self‑healing is required.
Monitoring objects: hosts, containers, distributed storage, SDN networks, distributed systems, middleware.
Monitoring angles: server monitoring, container monitoring, network monitoring, storage monitoring, application performance monitoring (APM), middleware monitoring.
Failure Review Case
Root Cause Analysis
Monitoring Solution
File service exception (occasional attachment upload failure)
Root cause not located; server reboot restored service, suspected long‑running Windows server anomaly
1. IP/domain health check
2. High‑frequency access monitoring
uploadimage file service domain network failure
Operator fault in the data‑center carrier
Network monitoring
IM launch failure
Excessive KEYS * queries caused CPU 100% load
1. Middleware monitoring: Redis, Kafka, ZK, etc.
2. Redis slow‑query monitoring
Business call procurement exception
大量 504/499 responses, many service exceptions
Interface status monitoring
2. Vision
3. Current Status and Technical Research
3.1 Monitoring Status
APM: In development
Host & Network Monitoring:
3.2 References
3.3 Pain Points
Lack of a full‑business panoramic view (topology)
Inability to analyze service call relationships (call topology + call tree)
Difficulty quickly locating business issues
Inflexible configuration of business monitoring alerts
Agent‑less integration of application performance monitoring
Redundant monitoring tools and unclear architecture (Zabbix, Prometheus, Telegraf, InfluxDB, Grafana) making customization hard
4. Value
4.1 Business Value
Establish a comprehensive audit and inspection mechanism to quickly locate problems, assess impact, and reduce downtime.
Map full‑link application and service dependencies to aid business correlation analysis.
Identify performance bottlenecks early, reducing stress‑test costs.
Enable business‑level granular monitoring for operational insight.
4.2 Operations Value
Unify monitoring across the cloud‑building ecosystem, simplifying architecture maintenance.
Codify fault analysis, investigation, and handling processes to automate and intelligent‑ify operations.
Standardize alerts for timely fault detection and reduced business impact.
Analyze architectural rationality and resource utilization to lower hardware costs.
4.3 Operational Value
Provide an operational panoramic view to support decision‑making.
5. Product Architecture
6. Technical Architecture
Customized secondary development based on SkyWalking + Prometheus to adapt to the cloud‑building business architecture.
SkyWalking: Rich multi‑language agent support; native Elasticsearch storage for better data analysis and aggregation.
Prometheus: Go‑based, lower secondary‑development difficulty; better cloud environment support.
7. Functional Design
The APM core consists of three parts:
Agent: Probe that collects and sends data to the collector.
Collector: Aggregates monitoring data, performs processing, and stores results.
Web: Visualization platform presenting persisted monitoring data.
7.1 Full‑Link Monitoring
The agent must trace cross‑process/thread calls, capturing Trace, Span, Tags, Logs, etc.
Features include global topology (request count, TPS, resource utilization), request scatter points, latency and status statistics, method‑level monitoring, single‑request call topology, and service health indicators.
7.2 Rapid Issue Localization
Keyword‑based full‑text search (application, class, method, business field, trace ID) quickly surfaces faulty traces, instances, exceptions, latency, and performance trends, reducing diagnosis time to minutes.
Full‑text search supports various attributes; monitoring alerts link directly to related traces.
7.3 Metric Management
By establishing a KPI system, inspection mechanisms, and business‑scenario focus, the platform achieves end‑to‑end traceability and a full‑business monitoring dashboard, turning black‑box operations into a controllable, predictable "glass house".
Monitoring point = Object + Metric (e.g., MRO request success rate = MRO + request success rate). Supports CMDB‑driven metric definitions, streaming calculations, third‑party performance data ingestion (e.g., MQ control platform), and hierarchical control of objects, metrics, and monitoring points.
7.4 JVM Performance Monitoring
To aid production fault analysis, the platform monitors JVM memory, CPU, and thread usage, providing dump capabilities for rapid diagnosis.
Features include thread monitoring (including deadlock detection), CPU usage collection, and memory/GC monitoring.
7.5 Monitoring Alerts
Establishes standardized alert models, integrates with an alert center, and dispatches notifications.
8. Iteration Plan
Author: Zhang Kai (Zhang Xiao Qiu)
Reviewer: Wu Youqiang (TechDian)
Editor: Wu Youqiang (TechDian)
YunZhu Net Technology Team
Technical practice sharing from the YunZhu Net Technology Team
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.