How to Build a Multi‑Layer Cloud‑Native Monitoring System for Telecom Operations
This article details a telecom operator's journey to cloud‑native operations, outlining the challenges of scaling monitoring, the design of a five‑level visual monitoring framework, the integration of open‑source tools like Ganglia, Nagios and JVMTI, and concrete implementation steps and results.
Background
Telecom business‑support systems are being migrated to cloud environments. Cloudification improves efficiency and reduces cost but creates new operational requirements: the number of managed machines and processes grows by tens of times, traditional point‑based monitoring and script‑driven deployment cannot scale, data for analysis is scattered across clusters, and IT‑as‑a‑service demands faster, cross‑domain incident handling.
Solution Overview
A visual operations platform was built by combining open‑source and commercial components, adapting them to carrier‑specific needs, and deploying them in production. The platform provides a unified, real‑time view from end‑user experience down to hardware‑level faults.
Monitoring – Five Hierarchical Levels
Level 1: User‑Experience Monitoring
Browser‑injected plugins capture user actions, response times and error codes along the full request chain. Data are aggregated and displayed in a console where each visit is classified as satisfied, tolerable or disappointed, and the exact step causing dissatisfaction can be drilled into.
Level 2: End‑to‑End Application Monitoring
Probes are deployed on every node; traffic is collected by a TAP+ exchange (MongoDB + Spark) that filters, decodes and correlates transaction data. Real‑time dashboards show transaction volume, success rate and latency for major channels (NGCRM, self‑service, mobile workbench, etc.).
Level 3: Code‑Level Diagnosis
A lightweight JVMTI agent is injected into the JVM to record stack traces, method execution times and input parameters. Hot‑spots and root‑causes can be pinpointed to the exact line of code. Example: an ArrayList‑related slowdown in a marketing platform was replaced by a hash‑based collection, eliminating the bottleneck.
Level 4: Cluster‑Scale Platform Performance Monitoring
Ganglia (v3.7.1‑2) and Nagios (v4.1.1) are customized to collect metrics from > 500 nodes across big‑data, CRM and BDS clusters. Monitored metrics include CPU, memory, network, I/O, Hadoop/HBase performance and custom business indicators.
Level 5: Platform Fault Monitoring
A hardware‑ and OS‑level fault platform based on SNMP and international MIB standards aggregates alerts from heterogeneous devices (x86 servers, VMs, network gear). Fault data are correlated with business information via the BOMC system to present a unified infrastructure health view.
Implementation Architecture
Distributed agents (gmond, NRPE) collect raw metrics on each host.
Central aggregators (gmetad, TAP+ exchange) normalize and store data.
Web front‑ends (Ganglia‑web, custom dashboards) visualise the metrics.
Alerting engines (Nagios, custom alerter) generate threshold‑based alerts and trigger SMS/e‑mail notifications.
All components are containerised and can be rolled out in three to five automated steps, enabling rapid scaling and flexible customization.
Key Technical Details
JVMTI agent : built on the Java Virtual Machine Tool Interface, captures method entry/exit, execution time and parameters; deployed as a plug‑in that communicates with a proxy process.
TAP+ pipeline : uses MongoDB for durable storage of raw packets, Spark Streaming for real‑time decoding, and a rule engine to extract business‑level metrics.
Ganglia modules : gmond on each node, gmetad on a central server, and ganglia‑web for UI.
Nagios modules : nagios daemon, plugins, and NRPE for remote checks; status codes 0 = OK, 1 = WARNING, 2 = CRITICAL, 3 = UNKNOWN.
SNMP fault collector : follows international MIBs, aggregates alerts, and forwards them to BOMC for business‑context enrichment.
Deployment Highlights
Deploy gmond on every monitored host; configure a data_source node for gmetad.
Install NRPE and required Nagios plugins on each host; define check commands (e.g., check_disk, check_cpu).
Launch the TAP+ agents on cloud servers; ensure MongoDB replica set and Spark streaming jobs are running.
Install the JVMTI sensor on JVM‑based services; enable parameter capture as needed.
Configure SNMP traps on network and server devices; point them to the fault collector.
Observed Benefits
Unified, real‑time visibility from user perception to platform health.
Fast fault localisation through code‑level tracing and end‑to‑end transaction mapping.
Scalable monitoring of thousands of nodes with low overhead.
Automated threshold tuning via simulation of historical incidents.
Improved service‑quality metrics such as Apdex, success rate and response latency.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
