Tagged articles
263 articles
Page 3 of 3
360 Quality & Efficiency
360 Quality & Efficiency
Feb 28, 2020 · Operations

External Network Quality Monitoring System at 360: Architecture, Features, and Alert Strategies

The article details 360's external network quality monitoring system, explaining its background, real‑time detection features, CDN‑based source host selection, three‑layer architecture, data collection and storage pipelines, fault‑diagnosis strategies, and visualization approaches for rapid network fault localization.

AlertingCDNNetwork Monitoring
0 likes · 10 min read
External Network Quality Monitoring System at 360: Architecture, Features, and Alert Strategies
Programmer DD
Programmer DD
Feb 15, 2020 · Operations

Understanding Prometheus: Architecture, Data Model, and Alerting Explained

This article provides a comprehensive overview of Prometheus, covering its open‑source monitoring architecture, multi‑dimensional data model, query language, storage mechanisms, service discovery, alerting workflow with Alertmanager, and visualization using Grafana, all illustrated with key diagrams and configuration examples.

AlertingGrafanaMetrics
0 likes · 9 min read
Understanding Prometheus: Architecture, Data Model, and Alerting Explained
360 Tech Engineering
360 Tech Engineering
Feb 6, 2020 · Operations

External Network Quality Monitoring System at 360: Architecture, Features, and Alert Strategies

The article describes how 360 implements an external network quality monitoring system that uses CDN nodes as source hosts to perform minute‑level, end‑to‑end ping measurements, stores results in time‑series and other databases, analyzes them to detect VIP, data‑center or ISP faults, and generates visualized alerts and reports for operations teams.

AlertingCDNNetwork Monitoring
0 likes · 8 min read
External Network Quality Monitoring System at 360: Architecture, Features, and Alert Strategies
21CTO
21CTO
Dec 3, 2019 · Operations

Why Most Alerts Fail and How to Build Actionable Monitoring

This article explains why many system alerts are poorly designed, describes the true purpose of alerts as actionable notifications, distinguishes business rule monitoring from reliability monitoring, and presents practical metrics, strategies, and simple anomaly‑detection algorithms to create high‑quality, actionable alerts for reliable operations.

AlertingMetricsOperations
0 likes · 23 min read
Why Most Alerts Fail and How to Build Actionable Monitoring
Efficient Ops
Efficient Ops
Nov 19, 2019 · Operations

Designing a Multi‑Layered Monitoring System for Modern IT Operations

This article outlines a comprehensive, layered monitoring architecture for enterprise IT operations, detailing the construction of a centralized platform, responsibilities across infrastructure, server, service, and user‑experience layers, event aggregation, visualization, data integration standards, alert thresholds, and continuous optimization practices.

AlertingEvent ManagementSystem Architecture
0 likes · 34 min read
Designing a Multi‑Layered Monitoring System for Modern IT Operations
dbaplus Community
dbaplus Community
Oct 28, 2019 · Operations

Avoid Common Prometheus Pitfalls: Best Practices for Reliable Monitoring

This article shares practical Prometheus best‑practice tips, covering the accuracy‑reliability trade‑off, self‑monitoring setups, avoiding NFS storage, pruning high‑cardinality metrics, handling rate‑function traps, alert‑graph mismatches, group_interval effects, and the overarching goal of stable, cost‑effective observability.

AlertingOperationsPrometheus
0 likes · 9 min read
Avoid Common Prometheus Pitfalls: Best Practices for Reliable Monitoring
Programmer DD
Programmer DD
Sep 20, 2019 · Operations

Master Prometheus: Key Features, Architecture, and Query Essentials

This article introduces Prometheus, an open‑source cloud‑native monitoring and alerting system, covering its main characteristics, core components, architecture diagram, typical use cases, query language syntax, built‑in functions, time‑series types, and practical tips for reliable operation.

AlertingOperationsPromQL
0 likes · 9 min read
Master Prometheus: Key Features, Architecture, and Query Essentials
58 Tech
58 Tech
Jul 23, 2019 · Operations

Design and Implementation of an Open Alarm Platform for Monitoring Systems

The Open Alarm Platform provides a flexible data model, modular architecture, and robust stability features to enable various business lines to integrate their custom monitoring systems via APIs, offering alert convergence, merging, multi‑channel delivery, and comprehensive management while reducing development and maintenance costs.

AlertingOperationsScalability
0 likes · 9 min read
Design and Implementation of an Open Alarm Platform for Monitoring Systems
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
Jul 18, 2019 · Operations

Why Bosun Beats Alertmanager and Kapacitor for Container Alerting

This article compares three container alerting frameworks—Alertmanager, Kapacitor, and Bosun—explains why Bosun was chosen for its flexible HTTP API rule deployment and low learning curve, and provides step‑by‑step configuration, rule definition, notification, and templating examples for integrating Bosun with Prometheus.

AlertingBosunConfiguration
0 likes · 9 min read
Why Bosun Beats Alertmanager and Kapacitor for Container Alerting
HomeTech
HomeTech
Jun 27, 2019 · Operations

Design and Implementation of a Distributed Monitoring System at Autohome

The article describes Autohome's evolution from a Zabbix‑based monitoring setup to a custom, distributed monitoring platform, detailing its architectural components, design goals, implementation choices, product features, and future roadmap for fault localization and dynamic alerting.

AlertingDistributed SystemsOpen-Falcon
0 likes · 6 min read
Design and Implementation of a Distributed Monitoring System at Autohome
Architecture Digest
Architecture Digest
Jun 25, 2019 · Operations

Design and Implementation of a Unified Monitoring and Alert System for MaFengWo Large Transportation Business

This article describes the motivation, architecture, key components, rule engine, alert actions, and practical lessons learned while building a unified monitoring and alarm system for MaFengWo's large‑scale transportation platform, highlighting data collection, Elasticsearch storage, scheduling, and future enhancements.

AlertingElasticsearcharchitecture
0 likes · 13 min read
Design and Implementation of a Unified Monitoring and Alert System for MaFengWo Large Transportation Business
Programmer DD
Programmer DD
Jun 7, 2019 · Operations

Why Most Alerts Fail and How to Build Actionable Monitoring

This article explains the fundamental flaws of typical alert systems, distinguishes between business rule and reliability monitoring, outlines essential metrics and strategies for effective alerts, and presents simple yet powerful anomaly‑detection algorithms to ensure alerts are actionable and reduce noise.

AlertingOperationsReliability
0 likes · 21 min read
Why Most Alerts Fail and How to Build Actionable Monitoring
360 Quality & Efficiency
360 Quality & Efficiency
May 23, 2019 · Operations

Online Monitoring: Principles, Scope, Types, Implementation and Value Assessment

This article explores the essential concepts of online monitoring, including effective monitoring items, objectives, scope, system and business monitoring types, stakeholder considerations, implementation steps, tool choices, alert strategies, and how to evaluate the overall value of monitoring initiatives.

Alertingbusiness monitoringsystem-monitoring
0 likes · 12 min read
Online Monitoring: Principles, Scope, Types, Implementation and Value Assessment
Architects' Tech Alliance
Architects' Tech Alliance
May 13, 2019 · Operations

Comprehensive Guide to System Monitoring: Objectives, Methods, Tools, Processes, and Best Practices

This article provides a thorough overview of system monitoring, covering its objectives, practical methods, core concepts, a comparison of popular open‑source and commercial tools, detailed monitoring processes (using Zabbix as an example), key metrics, alerting strategies, interview tips, and a summary of how organizations extend monitoring solutions.

AlertingZabbixmonitoring
0 likes · 17 min read
Comprehensive Guide to System Monitoring: Objectives, Methods, Tools, Processes, and Best Practices
Efficient Ops
Efficient Ops
Jan 23, 2019 · Operations

Designing an Operations Monitoring Platform: Tools & Best Practices

This article explores the essential concepts for selecting and building an operations monitoring platform, reviewing popular tools such as Cacti, Nagios, Zabbix, Ganglia, Centreon, Prometheus, and Grafana, and outlines a six‑layer architecture and practical strategies for scaling, alerting, and high‑availability in diverse environments.

AlertingDevOpsInfrastructure
0 likes · 19 min read
Designing an Operations Monitoring Platform: Tools & Best Practices
Ctrip Technology
Ctrip Technology
Dec 26, 2018 · Operations

Evolution of Ctrip's Hickwall Monitoring and Alerting Platform: Architecture, InfluxDB Cluster, Data Aggregation, and Stream Alerting

This article details the architectural evolution of Ctrip's Hickwall monitoring and alerting platform, describing the transition from an Elasticsearch‑based first generation to an InfluxDB‑driven second generation, the design of the Incluster storage layer, data aggregation strategies, and the implementation of high‑performance stream‑based alerting.

AlertingInfluxDBarchitecture
0 likes · 12 min read
Evolution of Ctrip's Hickwall Monitoring and Alerting Platform: Architecture, InfluxDB Cluster, Data Aggregation, and Stream Alerting
58 Tech
58 Tech
Dec 26, 2018 · Operations

Overview of the 58 Intelligent Monitoring System and Its Multi‑Dimensional Architecture

The 58 Intelligent Monitoring System provides a flexible, 24/7, multi‑dimensional monitoring solution that covers network, server, system, application and business layers, incorporates AI‑driven prediction, anomaly detection, alarm merging, root‑cause analysis and self‑healing, and offers both PC and WeChat interfaces for operators.

AlertingOperationsSystem Architecture
0 likes · 16 min read
Overview of the 58 Intelligent Monitoring System and Its Multi‑Dimensional Architecture
dbaplus Community
dbaplus Community
Nov 28, 2018 · Databases

Boost MySQL Performance: How to Use the PHP Rebuilt Percona PT‑kill with Email & WeChat Alerts

This guide introduces a PHP‑based reimplementation of Percona’s PT‑kill tool that not only terminates long‑running MySQL queries but also adds email and WeChat notifications, explains installation prerequisites, detailed command‑line options, example usages, configuration steps, and how to customize alerts and logging.

AlertingDatabase MonitoringPHP
0 likes · 8 min read
Boost MySQL Performance: How to Use the PHP Rebuilt Percona PT‑kill with Email & WeChat Alerts
58 Tech
58 Tech
Nov 12, 2018 · Operations

Key Takeaways from the 58 Group Technical Salon on Monitoring Platforms

The article summarizes the 58 Group technical salon where experts from Momo and 58 shared practical experiences on monitoring platform architectures, coverage, alarm configurations, convergence techniques, custom dimensions, multi‑view dashboards, and future directions for intelligent and automated monitoring across the company.

AlertingDevOpsOperations
0 likes · 9 min read
Key Takeaways from the 58 Group Technical Salon on Monitoring Platforms
JD Tech
JD Tech
Oct 29, 2018 · Operations

SGM Service Governance Monitoring Platform: Design, Features, and Use Cases

The article introduces SGM, a comprehensive service governance and monitoring solution that addresses scaling, dependency complexity, and operational challenges by providing automated topology, real‑time tracing, capacity planning, root‑cause analysis, and extensive monitoring features such as performance metrics, JVM stats, call‑chain visualization, business dashboards, and intelligent alerting.

AlertingOperationscall chain
0 likes · 13 min read
SGM Service Governance Monitoring Platform: Design, Features, and Use Cases
MaGe Linux Operations
MaGe Linux Operations
Aug 16, 2018 · Operations

Unlock Zabbix Monitoring: Complete Setup, Custom Alerts & Distributed Management

Zabbix offers a web‑based, enterprise‑grade solution for distributed system and network monitoring; this guide walks Linux ops engineers through why monitoring matters, key availability metrics, what to monitor, step‑by‑step installation, web UI configuration, custom checks, alerting, visualization, template sharing, full‑network scaling, auto‑discovery, proxy deployment, and SNMP integration.

AlertingDistributed SystemsLinux
0 likes · 23 min read
Unlock Zabbix Monitoring: Complete Setup, Custom Alerts & Distributed Management
360 Quality & Efficiency
360 Quality & Efficiency
Jul 23, 2018 · Operations

Introduction to X Monitoring System: Architecture, Modules, and Implementation Details

The article presents a detailed overview of the internally developed X Monitoring system, covering its architecture, configuration, reporting and monitoring modules, the use of Redis, Qbus, ElasticSearch and MySQL, as well as both server‑side (API/DB) and agent‑side (PHP) monitoring features, data collection commands, alert thresholds, and overall operational benefits.

AlertingPHPSystem Architecture
0 likes · 5 min read
Introduction to X Monitoring System: Architecture, Modules, and Implementation Details
ITPUB
ITPUB
Jan 18, 2018 · Operations

How to Build Real‑Time User Login Dashboards with MySQL Binlog & Logtail

This guide walks through enabling MySQL binlog, installing Logtail, configuring data collection, indexing, previewing logs, writing custom SQL queries for user login analysis, constructing real‑time dashboards, setting abnormal‑login alerts, and backing up data to OSS for long‑term storage.

AlertingBinlogDashboard
0 likes · 10 min read
How to Build Real‑Time User Login Dashboards with MySQL Binlog & Logtail
Efficient Ops
Efficient Ops
Jan 16, 2018 · Operations

How Tencent Secures Game Operations: Real Cases, Challenges, and Data‑Driven Solutions

This article shares a comprehensive overview of game operation security at Tencent, covering personal background, real‑world incident cases, the inherent challenges of large‑scale game services, past monitoring efforts, and a new data‑driven alerting framework that dramatically reduces false alarms while protecting game economies.

AlertingBig DataGame Security
0 likes · 25 min read
How Tencent Secures Game Operations: Real Cases, Challenges, and Data‑Driven Solutions
MaGe Linux Operations
MaGe Linux Operations
Oct 17, 2017 · Operations

Step-by-Step Guide: Build a Zabbix Monitoring System from Scratch

This article walks you through the complete process of setting up Zabbix on a Linux server—including preparing the environment, installing LAMP, configuring the Zabbix server and agent, creating databases, defining templates, items, triggers, graphs, and custom script alerts—to achieve real‑time network traffic monitoring and automated notifications.

AlertingNetwork Trafficmonitoring
0 likes · 9 min read
Step-by-Step Guide: Build a Zabbix Monitoring System from Scratch
Efficient Ops
Efficient Ops
Sep 3, 2017 · Operations

How to Design an Enterprise‑Grade Monitoring & Alerting System from Scratch

This article introduces the fundamental concepts, methods, types, goals, and product attributes of enterprise monitoring and alerting, explains the perspective differences between users and builders, and outlines a comprehensive monitoring system architecture for large‑scale operations.

AlertingEnterpriseOperations
0 likes · 14 min read
How to Design an Enterprise‑Grade Monitoring & Alerting System from Scratch
ITPUB
ITPUB
Jun 15, 2017 · Backend Development

How to Send Logs and Alerts to WeChat Using wechat_sender

This guide explains how to install, configure, and use the wechat_sender tool—built on wxpy and Tornado—to forward logs, alerts, and scheduled messages from scripts or web services directly to personal or group WeChat chats.

AlertingWeChatautomation
0 likes · 5 min read
How to Send Logs and Alerts to WeChat Using wechat_sender
MaGe Linux Operations
MaGe Linux Operations
May 2, 2017 · Operations

What Is Zabbix? A Deep Dive into Its Features, Architecture, and Deployment

Zabbix is an open‑source, web‑based enterprise monitoring platform that tracks Windows/Linux hosts, network devices, and hardware/software metrics, provides alerting, visualizes data via a customizable PHP web UI, and comprises components such as server, agents, proxies, Java gateway, and API, with flexible templates, discovery, and storage options.

AlertingIT OperationsInfrastructure
0 likes · 6 min read
What Is Zabbix? A Deep Dive into Its Features, Architecture, and Deployment
Efficient Ops
Efficient Ops
Apr 12, 2017 · Operations

Mastering Enterprise Monitoring: From Basics to Advanced Toolchains

This comprehensive guide explains why monitoring is vital for operations, outlines clear objectives and methods, compares popular open‑source and commercial tools, details a Zabbix‑based workflow, and covers hardware, system, application, network, security, API, performance, and business metrics with practical alerting strategies.

AlertingOperationsZabbix
0 likes · 21 min read
Mastering Enterprise Monitoring: From Basics to Advanced Toolchains
Baidu Intelligent Testing
Baidu Intelligent Testing
Mar 21, 2017 · Operations

Server Monitoring Solution: Requirements, Design Decisions, and Implementation Details

This article presents a comprehensive server‑side monitoring solution covering functional and performance requirements, monitoring objects, design choices between self‑monitoring and centralized reporting, system architecture, API definitions, key challenges such as key collisions, data formats, storage options, and operational considerations.

AlertingMetricsOperations
0 likes · 12 min read
Server Monitoring Solution: Requirements, Design Decisions, and Implementation Details
Java Backend Technology
Java Backend Technology
Mar 5, 2017 · Operations

How to Use ElastAlert with WeChat: Python, Shell, and Java Plugins

This guide explains why WeChat is popular in China and introduces three ElastAlert plugins (shell, Python, Java) that send alerts via WeChat, compares alert channels, outlines prerequisites, and provides step‑by‑step installation and usage instructions with code examples and screenshots.

AlertingElastAlertPython
0 likes · 4 min read
How to Use ElastAlert with WeChat: Python, Shell, and Java Plugins
Meituan Technology Team
Meituan Technology Team
Feb 24, 2017 · Operations

Improvements and Architecture of Mt-Falcon Monitoring System

Mt‑Falcon, Meituan’s re‑engineered successor to Zabbix, introduces a modular architecture—Agent, Transfer, HBS, Judge, Graph, Alarm, Portal—and extensive refactorings that boost memory efficiency, asynchronous data handling, multi‑condition alerts, and API exposure, enabling over one million QPS, 200 million metrics, and robust, scalable monitoring across the company.

Alertingarchitecturemonitoring
0 likes · 24 min read
Improvements and Architecture of Mt-Falcon Monitoring System
Efficient Ops
Efficient Ops
Nov 20, 2016 · Operations

Why Most Log‑Analysis Features Are Overrated and What Really Matters

The article critiques popular but unnecessary log‑analysis features—such as sub‑second alerts, endless pagination, flashy maps, full SQL support, bulk downloads, and live tail—arguing that focusing on practical alert content, efficient querying, and proper architecture yields far more value for IT operations.

AlertingDSLData visualization
0 likes · 10 min read
Why Most Log‑Analysis Features Are Overrated and What Really Matters
Nightwalker Tech
Nightwalker Tech
Nov 9, 2016 · Operations

Best Practices for Service Monitoring and Alerting in E‑commerce Systems

The discussion outlines essential service‑monitoring techniques—including health checks, JVM metrics, traffic and payment ring‑ratio analysis, client‑side exception tracking, third‑party CDN monitoring, alert thresholds, instrumentation via AOP or SDKs, and tooling such as Datadog, Zabbix, and the Elastic stack—to reliably detect and respond to incidents in e‑commerce environments.

Alertinge‑commerceincident response
0 likes · 10 min read
Best Practices for Service Monitoring and Alerting in E‑commerce Systems
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
Oct 19, 2016 · Operations

Wonder Monitoring: Scaling Ops with Open‑Falcon‑Powered Automation

This article explains how the internally built Wonder monitoring system, based on Open‑Falcon, tackles large‑scale operational challenges by offering automated agent updates, customizable metrics, log and port monitoring, persistent alarm storage, enhanced alert content, and comprehensive dashboards for thousands of devices.

AlertingInfrastructureOpen-Falcon
0 likes · 7 min read
Wonder Monitoring: Scaling Ops with Open‑Falcon‑Powered Automation
Ctrip Technology
Ctrip Technology
Aug 12, 2016 · Big Data

Ctrip's Real-Time Data Platform: Architecture, Practices, and Lessons Learned

This article details Ctrip's journey building a unified real-time data platform—covering business motivations, architectural requirements, technology choices like Kafka and Storm, implementation of Avro schemas, monitoring, alerting, operational lessons, and future explorations such as Streaming CQL and JStorm.

AlertingBig DataKafka
0 likes · 15 min read
Ctrip's Real-Time Data Platform: Architecture, Practices, and Lessons Learned
dbaplus Community
dbaplus Community
May 11, 2016 · Operations

Inside Twitter’s Massive Monitoring Stack: Architecture, Metrics, and Lessons Learned

Twitter’s internal monitoring team built a full‑stack observability platform that handles billions of metric writes per minute, supports distributed tracing, log aggregation, visual dashboards, and alerting across data centers and public clouds, and shares the architecture, components, and key lessons learned.

AlertingDistributed TracingMetrics
0 likes · 18 min read
Inside Twitter’s Massive Monitoring Stack: Architecture, Metrics, and Lessons Learned
21CTO
21CTO
Mar 22, 2016 · Operations

Build a Scalable Unified Monitoring & Alert Platform with Ganglia & Centreon

This article explains how to design and implement a unified operations monitoring and alerting platform by combining Ganglia for data collection with Centreon for alerting, covering architecture layers, module functions, integration steps, and practical Q&A for large‑scale deployments.

AlertingCentreonGanglia
0 likes · 20 min read
Build a Scalable Unified Monitoring & Alert Platform with Ganglia & Centreon
Efficient Ops
Efficient Ops
Mar 21, 2016 · Operations

How to Build a High‑Performance Unified Monitoring & Alerting Platform

This article outlines a comprehensive design for a high‑performance, unified operations monitoring platform, detailing a six‑layer architecture, the roles of data collection (using Ganglia), data extraction, and alerting modules (with Centreon), and provides practical integration tips, deployment diagrams, and Q&A for large‑scale environments.

AlertingCentreonGanglia
0 likes · 24 min read
How to Build a High‑Performance Unified Monitoring & Alerting Platform
Java High-Performance Architecture
Java High-Performance Architecture
Mar 16, 2016 · Operations

How Vipshop’s Three‑Tier Monitoring System Keeps Services Running Smoothly

Vipshop’s three‑tier monitoring system—covering system, application (Mercury), and business layers—collects and analyzes logs from distributed components, providing real‑time metrics, slow‑call detection, error tracing, and configurable alerts to help engineers quickly pinpoint and resolve performance issues.

APMAlertingDistributed Systems
0 likes · 4 min read
How Vipshop’s Three‑Tier Monitoring System Keeps Services Running Smoothly
Efficient Ops
Efficient Ops
May 21, 2015 · Operations

Open-Falcon: Scalable Open-Source Monitoring System for Modern Operations

Open‑Falcon, an open‑source, enterprise‑grade monitoring solution from Xiaomi’s operations team, offers zero‑configuration data collection, high‑throughput horizontal scaling, flexible alerting, efficient historical queries, and a user‑friendly dashboard, with detailed documentation, quick installation steps, and a highly available architecture.

AlertingDashboardScalable
0 likes · 6 min read
Open-Falcon: Scalable Open-Source Monitoring System for Modern Operations
MaGe Linux Operations
MaGe Linux Operations
Oct 9, 2014 · Operations

Integrate WeChat Alerts into Nagios Using Node.js

This guide walks you through registering a WeChat public account, setting up a Node.js simulator for login and friend retrieval, installing Node.js, configuring Nagios commands and contacts, and finally enabling WeChat notifications for monitoring alerts.

AlertingNagiosNode.js
0 likes · 6 min read
Integrate WeChat Alerts into Nagios Using Node.js