Operations 10 min read

Design and Architecture of a Cloud‑Native Monitoring Platform for Business Systems

The document outlines the background, vision, current status, technical research, value, product and technical architecture, and functional design of a cloud‑native monitoring platform that integrates SkyWalking and Prometheus to provide comprehensive APM, resource utilization, alerting, and rapid fault localization for business and technical middle‑platform services.

YunZhu Net Technology Team
YunZhu Net Technology Team
YunZhu Net Technology Team
Design and Architecture of a Cloud‑Native Monitoring Platform for Business Systems

1. Background The monitoring platform targets business systems and technical middle‑platforms to monitor and alert on hardware and software components. As cloud‑based applications proliferate and IT architecture becomes more complex, a global view of application health, resource usage, intelligent analysis, alerting, emergency response, and self‑healing is required.

Monitoring objects: hosts, containers, distributed storage, SDN networks, distributed systems, middleware.

Monitoring angles: server monitoring, container monitoring, network monitoring, storage monitoring, application performance monitoring (APM), middleware monitoring.

Failure Review Case

Root Cause Analysis

Monitoring Solution

File service exception (occasional attachment upload failure)

Root cause not located; server reboot restored service, suspected long‑running Windows server anomaly

1. IP/domain health check

2. High‑frequency access monitoring

uploadimage file service domain network failure

Operator fault in the data‑center carrier

Network monitoring

IM launch failure

Excessive KEYS * queries caused CPU 100% load

1. Middleware monitoring: Redis, Kafka, ZK, etc.

2. Redis slow‑query monitoring

Business call procurement exception

大量 504/499 responses, many service exceptions

Interface status monitoring

2. Vision

3. Current Status and Technical Research

3.1 Monitoring Status

APM: In development

Host & Network Monitoring:

3.2 References

3.3 Pain Points

Lack of a full‑business panoramic view (topology)

Inability to analyze service call relationships (call topology + call tree)

Difficulty quickly locating business issues

Inflexible configuration of business monitoring alerts

Agent‑less integration of application performance monitoring

Redundant monitoring tools and unclear architecture (Zabbix, Prometheus, Telegraf, InfluxDB, Grafana) making customization hard

4. Value

4.1 Business Value

Establish a comprehensive audit and inspection mechanism to quickly locate problems, assess impact, and reduce downtime.

Map full‑link application and service dependencies to aid business correlation analysis.

Identify performance bottlenecks early, reducing stress‑test costs.

Enable business‑level granular monitoring for operational insight.

4.2 Operations Value

Unify monitoring across the cloud‑building ecosystem, simplifying architecture maintenance.

Codify fault analysis, investigation, and handling processes to automate and intelligent‑ify operations.

Standardize alerts for timely fault detection and reduced business impact.

Analyze architectural rationality and resource utilization to lower hardware costs.

4.3 Operational Value

Provide an operational panoramic view to support decision‑making.

5. Product Architecture

6. Technical Architecture

Customized secondary development based on SkyWalking + Prometheus to adapt to the cloud‑building business architecture.

SkyWalking: Rich multi‑language agent support; native Elasticsearch storage for better data analysis and aggregation.

Prometheus: Go‑based, lower secondary‑development difficulty; better cloud environment support.

7. Functional Design

The APM core consists of three parts:

Agent: Probe that collects and sends data to the collector.

Collector: Aggregates monitoring data, performs processing, and stores results.

Web: Visualization platform presenting persisted monitoring data.

7.1 Full‑Link Monitoring

The agent must trace cross‑process/thread calls, capturing Trace, Span, Tags, Logs, etc.

Features include global topology (request count, TPS, resource utilization), request scatter points, latency and status statistics, method‑level monitoring, single‑request call topology, and service health indicators.

7.2 Rapid Issue Localization

Keyword‑based full‑text search (application, class, method, business field, trace ID) quickly surfaces faulty traces, instances, exceptions, latency, and performance trends, reducing diagnosis time to minutes.

Full‑text search supports various attributes; monitoring alerts link directly to related traces.

7.3 Metric Management

By establishing a KPI system, inspection mechanisms, and business‑scenario focus, the platform achieves end‑to‑end traceability and a full‑business monitoring dashboard, turning black‑box operations into a controllable, predictable "glass house".

Monitoring point = Object + Metric (e.g., MRO request success rate = MRO + request success rate). Supports CMDB‑driven metric definitions, streaming calculations, third‑party performance data ingestion (e.g., MQ control platform), and hierarchical control of objects, metrics, and monitoring points.

7.4 JVM Performance Monitoring

To aid production fault analysis, the platform monitors JVM memory, CPU, and thread usage, providing dump capabilities for rapid diagnosis.

Features include thread monitoring (including deadlock detection), CPU usage collection, and memory/GC monitoring.

7.5 Monitoring Alerts

Establishes standardized alert models, integrates with an alert center, and dispatches notifications.

8. Iteration Plan

Author: Zhang Kai (Zhang Xiao Qiu)

Reviewer: Wu Youqiang (TechDian)

Editor: Wu Youqiang (TechDian)

Monitoringcloud-nativeAPMoperationsobservabilitymetrics
YunZhu Net Technology Team
Written by

YunZhu Net Technology Team

Technical practice sharing from the YunZhu Net Technology Team

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.