Operations 34 min read

Designing a Multi‑Layered Monitoring System for Modern IT Operations

This article outlines a comprehensive, layered monitoring architecture for enterprise IT operations, detailing the construction of a centralized platform, responsibilities across infrastructure, server, service, and user‑experience layers, event aggregation, visualization, data integration standards, alert thresholds, and continuous optimization practices.

Efficient Ops

Nov 19, 2019

Designing a Multi‑Layered Monitoring System for Modern IT Operations

With rapid advances in computing and the rise of cloud platforms, traditional enterprises must evolve their operations from static practices to dynamic, data‑driven monitoring that supports cloud, big‑data analysis, and emerging AI‑based operations.

Safety and production assurance remain critical; monitoring is the core of the "monitor‑manage‑control" model for traditional enterprises.

Monitoring System Layering

Overview

Enterprises have accumulated diverse monitoring tools across infrastructure, hardware, software, and security domains. To handle this variety, the following principles are applied:

Build a centralized monitoring platform that provides real‑time visibility, drives event‑driven control, and supplies data for operational analytics.

Retain existing tools because they often contain deep, customized instrumentation that remains valuable.

Domain teams own their monitoring while the platform team provides the underlying technology.

Integrate tools through standardization and consolidation.

These principles lead to a systematic, layered monitoring architecture.

Layering Approach

The following professional‑domain‑based layers are typical:

Infrastructure Layer : carrier lines, data‑center facilities, network devices; monitored for status, performance, quality, capacity, architecture, and traffic analysis.

System Server Layer : servers and storage availability.

System & Network Service Layer : operating systems, system software, and network software usage.

Application Service Layer : application availability, business status, performance, and transaction volume.

Customer Experience Layer : access speed and functional correctness from end‑user perspective.

Responsibilities per Layer

Infrastructure

Status monitoring (power, cooling, hardware health).

Performance monitoring (CPU, memory, session counts, port traffic).

Network monitoring (packet loss, latency).

Capacity monitoring (load, bandwidth utilization).

Hardware vendors often provide health checks; request them to push events directly to the monitoring platform.

Server Layer

Storage health (read/write errors, timeouts, disk failures).

Server components (memory, NIC, power, fans, RAID status).

Virtual machines (vCenter, etc.).

Containers (Docker, etc.).

For storage and physical devices, rely on vendor‑pushed events; for containers, consider open‑source tools or custom development.

System Service Layer

Monitors OS, middleware, databases, and other distributed components, covering CPU, memory, disk I/O, network I/O, connections, processes, and latency metrics. Data is used for load‑balancing decisions and auto‑scaling.

Application Service Layer

Service availability (processes, ports, health).

Business status (whether services are operational).

Performance (transaction volume, success/failure rates, response times).

Transaction tracing (instrumented logs, ESB flows).

Customer Experience Layer

Includes synthetic user testing and speed measurements to verify both access performance and business‑logic correctness.

Monitoring Integration

Layered monitoring improves coverage but introduces management overhead. Integration focuses on three aspects:

Event aggregation.

Unified visualization.

Data consolidation.

Event Aggregation

Events from many tools must be collected, de‑duplicated, and correlated. Key requirements include:

Aggregating events across layers and domains.

Converging duplicate alerts.

Prioritizing events with severity levels.

Analyzing relationships (vertical and horizontal) to build fault trees.

High‑performance processing and external APIs for data collection.

Unified Visualization

A single view should support role‑based dashboards, multiple device formats (web, mobile, large screens), and subscription‑based displays for specific operational scenarios.

Data Integration Standards

Integration covers packet, log, and database‑transaction streams.

Packet decoding : use side‑channel capture (e.g., BPC) to feed monitoring without impacting services.

Log structure : adopt platforms like Splunk or ELK, or standardize log output (e.g., log4j) and forward asynchronously.

Database transaction : create dedicated operation tables or instrument JDBC to record execution details.

Monitoring Metrics

Metrics are defined per layer, with weight and threshold grading to avoid missed alerts while reducing noise.

Metric Classification

Infrastructure: environment, network, security devices.

Server: virtualization, storage, physical servers.

System software: OS, databases, middleware.

Application services: availability and transaction metrics.

Customer experience: response time and functional checks.

Weight and Thresholds

Metrics are assigned weights (primary vs. secondary) and graded thresholds (notification, warning, alarm). Dynamic baselines replace static thresholds, learning from historical data and adjusting for business cycles.

Continuous Optimization

Optimization follows a staged approach:

Reduce alert volume (adjust thresholds, eliminate noisy metrics).

Decrease false‑positive rate (classify alerts, improve baseline accuracy).

Increase fault coverage (ensure 80% of incidents are detected by monitoring).

Accelerate incident resolution (close alerts within one hour).

A dedicated monitoring‑optimization team should analyze event trends, refine configurations, and drive improvements.

Source: Adapted from the public account "运维之路" by author Peng Huasheng.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

system architecture Alerting performance metrics Event Management continuous optimization

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.