Operations 14 min read

How to Design an Enterprise‑Grade Monitoring & Alerting System from Scratch

This article introduces the fundamental concepts, methods, types, goals, and product attributes of enterprise monitoring and alerting, explains the perspective differences between users and builders, and outlines a comprehensive monitoring system architecture for large‑scale operations.

Efficient Ops

Sep 3, 2017

How to Design an Enterprise‑Grade Monitoring & Alerting System from Scratch

Overview

This is the first article of a series on monitoring and alerting products, serving as an index that outlines the basic knowledge needed for monitoring product design.

Core content of the next three articles.

Key concepts that need to be introduced in advance.

An explanation of a three‑dimensional monitoring system.

The author is a product manager for the ZhiYun monitoring and alerting platform, and the subsequent design and implementation discussions are based on that platform.

Finding the Original Motivation

When working on QQ operations, the author used a monitoring and alerting platform daily to view business dashboards, check anomalies, confirm and handle alerts.

For operations engineers, the usage frequency of monitoring platforms exceeds that of automation platforms because monitoring is needed 24/7, sometimes generating thousands of alerts per day.

When alerts exceed a few dozen per day they become ineffective; the author therefore views monitoring from a user perspective.

After transitioning to product management, the author now looks at monitoring from a builder perspective.

The main difference between the two perspectives is that users focus on individual functional points, while builders aim to abstract common scenarios and design features that serve the majority of users.

“When any failure occurs, every other link might be at fault, but monitoring is always at fault!” – George the Scapegoat

The author hopes the series will enable exchange of ideas about how an enterprise‑level monitoring product is planned, designed, and delivered.

Foundations of Monitoring

This section clarifies basic concepts.

What is monitoring?

What are the monitoring methods?

What types of monitoring exist?

What are the goals of monitoring?

What is the essence of monitoring?

What levels does monitoring address?

How to understand the product attributes of monitoring?

Definition

Monitoring is the use of technical means to discover service anomalies and continuously improve business availability and user experience.

Methods

Active : In‑process instrumentation reports metrics directly; precise, fast, and flexible but may require code changes.

Passive : External probing or log analysis without instrumentation.

Bypass : Monitoring unrelated to program logic, such as public sentiment analysis.

All three methods are suitable for different scenarios; for example, domain monitoring can use external probing.

Types

Basic monitoring : IAAS layer (servers, systems, networks).

Server‑side monitoring : Backend services.

Client‑side monitoring : Apps (e.g., QQ, WeChat).

WEB monitoring : Websites and domain probing.

User‑side monitoring : Public sentiment and reputation.

Goals

A good monitoring system should achieve three objectives:

Coverage (Full) : Broad monitoring objects and points.

Speed (Fast) : High performance and data processing capability.

Accuracy (Accurate) : Intelligent analysis, convergence, and precise alerting.

Essence

In DevOps, operations, development, and testing should share a unified view; monitoring focuses on three core metrics: request volume, success rate, and latency, which together reflect service reliability and user experience.

Purpose

The ultimate purpose of monitoring and alerting is to continuously optimize business service quality and build a quality‑centric system.

Product Attributes

Monitoring and alerting is a data‑centric product, following the data pipeline: Data Production → Data Enrichment → Data Consumption . This pipeline drives various user stories and scenarios.

Examples of data production include server OS metrics (CPU, memory). Data consumption includes visual dashboards and threshold‑based alerts. Further consumption can involve statistical analysis of alert trends.

Monitoring System Architecture

A large organization needs multiple monitoring systems that interact to form an overall monitoring ecosystem. The internal ZhiYun monitoring system comprises several subsystems that avoid data silos.

Typically, three essential monitoring subsystems are required (infrastructure, service, and user‑experience layers).

Summary

The article provides a macro view of monitoring and alerting, laying out basic concepts, methods, types, goals, and system architecture, preparing readers for deeper practical discussions in upcoming articles.

Preview of Upcoming Articles

Design and implementation of IAAS‑level monitoring (servers, networks, traffic analysis).

How to design a CMDB for an enterprise monitoring product in the cloud era.

How platform‑level monitoring can support diverse cloud components.

这系列的文章我也会尝试用开放式(类众包)的方式去写，文章的末尾会附上我的微信号，欢迎朋友们将日常使用监控告警产品的痛点与具体的场景告知我，后续会统一评估这些反馈的场景，如果是典型共性场景或者是很小众，但是这个很小众的场景却能代表一个特定类型的业务的话，将会采纳您提供的场景，在后续的文章中会标明这是由那位朋友提供的，并且附上我的建议场景解决方案，供大家交流与讨论。

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

monitoring Operations System Design Alerting Enterprise

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.