Tagged articles
263 articles
Page 2 of 3
IT Services Circle
IT Services Circle
Jan 9, 2023 · Operations

Python Server Resource Monitoring and Alerting Scripts

This article presents Python scripts for server‑side and client‑side resource monitoring, automatically checking CPU, memory, disk usage and network traffic, storing alerts in MySQL and optionally sending notifications via email or Enterprise WeChat, with deployment instructions and cron scheduling.

AlertingPythonServer Monitoring
0 likes · 19 min read
Python Server Resource Monitoring and Alerting Scripts
Architecture Digest
Architecture Digest
Jan 8, 2023 · Operations

Design and Evolution of Vivo Server Monitoring System

This article systematically presents the business background, basic monitoring workflow, usage guidelines, OpenTSDB fundamentals, code precision issues, vmonitor collector architecture, old and new system designs, core alerting metrics, demo illustrations, and a comparison with mainstream monitoring solutions, offering insights for technology selection.

AlertingOpenTSDBServer
0 likes · 18 min read
Design and Evolution of Vivo Server Monitoring System
Open Source Linux
Open Source Linux
Dec 8, 2022 · Operations

Master Prometheus: From Metrics Collection to Alerting and Visualization

Prometheus is an open‑source monitoring solution that covers metric exposition, scraping, storage, querying, visualization, and alerting, and this guide walks through its architecture, configuration, custom exporters, PromQL queries, Grafana integration, and alert management, providing a comprehensive introduction for developers and ops engineers.

AlertingExporterGrafana
0 likes · 22 min read
Master Prometheus: From Metrics Collection to Alerting and Visualization
macrozheng
macrozheng
Nov 19, 2022 · Operations

Unlocking Prometheus: Visual Guide to Architecture, Metrics, and Alerts

This article visually explains Prometheus’s architecture, core features, metric collection methods, exporters, PromQL query language, and alerting workflow, helping readers understand how to monitor cloud‑native systems effectively while noting its strengths and limitations.

AlertingExportersMetrics
0 likes · 8 min read
Unlocking Prometheus: Visual Guide to Architecture, Metrics, and Alerts
Open Source Linux
Open Source Linux
Nov 7, 2022 · Cloud Native

Unlock Scalable Cloud‑Native Alerting with Grafana Mimir: Architecture & Setup

This article explains the current state of cloud‑native alerting, introduces Grafana Mimir as a horizontally scalable, multi‑tenant storage for Prometheus, details its architecture and components, and provides step‑by‑step guidance for installing, configuring, and operating Mimir in Kubernetes environments.

AlertingCloud NativeKubernetes
0 likes · 24 min read
Unlock Scalable Cloud‑Native Alerting with Grafana Mimir: Architecture & Setup
Open Source Linux
Open Source Linux
Oct 30, 2022 · Operations

Unlock Kubernetes Insights: Master Event Types, Monitoring, and Alerting

This guide explains what Kubernetes events are, how to list and filter them, categorizes common event types, and shows practical ways to collect, store, and alert on events using native commands and open‑source tools, helping teams reduce alert fatigue and improve cluster observability.

AlertingEventsKubernetes
0 likes · 11 min read
Unlock Kubernetes Insights: Master Event Types, Monitoring, and Alerting
Efficient Ops
Efficient Ops
Oct 13, 2022 · Operations

Essential Guide to Effective Monitoring in Operations: Goals, Methods, and Tools

This article outlines the essential components of operational monitoring, covering monitoring objectives, methods, core processes, key tools, metrics for hardware, system, application, network, and business layers, as well as alerting, handling, and best practices for building a comprehensive, reliable monitoring solution.

AlertingMetricssystem reliability
0 likes · 7 min read
Essential Guide to Effective Monitoring in Operations: Goals, Methods, and Tools
DataFunSummit
DataFunSummit
Oct 6, 2022 · Big Data

JD Big Data Log Lifecycle and Alerting Best Practices

This article presents a comprehensive overview of JD's big‑data log lifecycle, covering background, platform capabilities, log collection methods, processing functions, storage strategies, query mechanisms, DSL extensions, data delivery, and alerting techniques to help engineers build efficient and reliable log management solutions.

AlertingELKFilebeat
0 likes · 14 min read
JD Big Data Log Lifecycle and Alerting Best Practices
Aikesheng Open Source Community
Aikesheng Open Source Community
Sep 27, 2022 · Operations

Refactoring Alertmanager: Reducing Noise, Improving Escalation, Suppression, and Silence Management

This article shares practical experiences and solutions for improving an Alertmanager‑based alert system, addressing problems such as noisy alerts, lack of escalation, missing recovery notifications, suppression limitations, and cumbersome silence management by redesigning architecture, adding custom scripts, and extending database support.

AlertingAlertmanagerOperations
0 likes · 19 min read
Refactoring Alertmanager: Reducing Noise, Improving Escalation, Suppression, and Silence Management
NetEase Yanxuan Technology Product Team
NetEase Yanxuan Technology Product Team
Sep 13, 2022 · Operations

How Yanxuan Built a Scalable Full‑Link Monitoring, Alerting, and Event‑Bus System for Microservices

This article details Yanxuan's four‑year evolution of a unified monitoring, alerting, and event‑bus platform for micro‑service architectures, covering design principles, technology selection, multi‑stage implementation, dynamic sampling, custom plugins, data modeling, visualization upgrades, and the final fault‑driven, system‑wide integration.

AlertingFull‑Link TracingMicroservices
0 likes · 23 min read
How Yanxuan Built a Scalable Full‑Link Monitoring, Alerting, and Event‑Bus System for Microservices
Bilibili Tech
Bilibili Tech
Aug 12, 2022 · Operations

SLO Implementation and Alerting Strategies – Bilibili SRE Practices

The article outlines Bilibili’s refined SLO framework—categorizing services into four business tiers, selecting availability, latency, and freshness SLIs, setting concrete SLO targets, and employing multi‑window error‑budget and consumption‑rate alerting strategies to improve stability and provide comprehensive quality dashboards.

AlertingMetricsSLO
0 likes · 18 min read
SLO Implementation and Alerting Strategies – Bilibili SRE Practices
Open Source Linux
Open Source Linux
Aug 11, 2022 · Operations

Master Zabbix: From Installation to Advanced Monitoring and Alerting

This comprehensive guide explains why monitoring is essential, describes reliability metrics, walks through Zabbix installation, web UI configuration, custom monitoring, trigger creation, alert integration, distributed monitoring, SNMP support, and large‑scale server monitoring using scripts, APIs, and auto‑discovery.

AlertingProxySNMP
0 likes · 24 min read
Master Zabbix: From Installation to Advanced Monitoring and Alerting
MaGe Linux Operations
MaGe Linux Operations
Jul 12, 2022 · Operations

Master Zabbix: From Installation to Advanced Custom Monitoring and Alerts

This guide walks through Zabbix monitoring fundamentals, covering why monitoring matters, installing Zabbix, configuring servers and agents, creating custom checks, setting up alerts with OneAlert, visualizing data, leveraging auto‑discovery, distributed proxies, and SNMP integration to comprehensively monitor large server fleets.

AlertingZabbixautomation
0 likes · 28 min read
Master Zabbix: From Installation to Advanced Custom Monitoring and Alerts
Selected Java Interview Questions
Selected Java Interview Questions
Jul 6, 2022 · Operations

Grafana 9.0 New Features and Improvements Overview

Grafana 9.0 introduces a suite of usability enhancements—including a visual Prometheus query builder, a visual Loki LogQL generator, improved Explore‑to‑dashboard workflow, revamped heatmap panel, command palette, panel search, trace panel, navigation upgrades, and alerting refinements—aimed at simplifying observability, data visualization, and operational efficiency.

AlertingDashboardGrafana
0 likes · 7 min read
Grafana 9.0 New Features and Improvements Overview
Architecture Digest
Architecture Digest
Jul 2, 2022 · Operations

Design and Evolution of Vivo Server‑Side Monitoring System

This article systematically outlines the design, components, data flow, and evolution of Vivo’s server‑side monitoring system, covering data collection, transmission, storage with OpenTSDB, visualization, alerting mechanisms, and comparisons with other monitoring solutions.

AlertingOpenTSDBOperations
0 likes · 19 min read
Design and Evolution of Vivo Server‑Side Monitoring System
21CTO
21CTO
Jun 28, 2022 · Operations

Master Prometheus: From Metrics Collection to Alerts and Grafana Visualization

This comprehensive guide walks you through Prometheus fundamentals, including metric exposure, scraping, storage, querying with PromQL, custom exporter creation in Go, dynamic configuration reloading, and visualizing data with Grafana, while also covering alerting with Alertmanager and best practices for accurate histogram bucket design.

AlertingGrafanaMetrics
0 likes · 20 min read
Master Prometheus: From Metrics Collection to Alerts and Grafana Visualization
IT Architects Alliance
IT Architects Alliance
Jun 27, 2022 · Operations

Comprehensive Guide to Prometheus: Metrics Collection, Storage, Querying, Alerting and Visualization

This article provides a detailed overview of Prometheus, covering its architecture, metric exposure, scraping models, storage format, metric types, custom exporter implementation in Go, PromQL query language, built‑in functions, Grafana integration, and alerting with Alertmanager, offering practical code examples throughout.

AlertingGoGrafana
0 likes · 20 min read
Comprehensive Guide to Prometheus: Metrics Collection, Storage, Querying, Alerting and Visualization
Tencent Cloud Developer
Tencent Cloud Developer
May 30, 2022 · Cloud Native

An Introduction to Prometheus: Metrics Collection, Storage, Querying, Visualization and Alerting

Prometheus is an open‑source monitoring system that scrapes metrics from services or exporters, stores them in a time‑series database, lets users query with PromQL, visualizes data via its web UI or Grafana, and sends alerts through Alertmanager, supporting custom Go metrics, various discovery methods, and four metric types.

AlertingGoGrafana
0 likes · 21 min read
An Introduction to Prometheus: Metrics Collection, Storage, Querying, Visualization and Alerting
Programmer DD
Programmer DD
Apr 11, 2022 · Backend Development

Unlock Dynamic Thread Pool Management with Hippo4J: Features, Modes, and Benefits

This article introduces Hippo4J, a Java dynamic thread‑pool solution inspired by Meituan's design, detailing its web‑based parameter tuning, monitoring, alerting capabilities, two deployment modes (lightweight with config‑center and standalone server), and the operational advantages it brings to developers and operators.

AlertingDynamic Thread PoolHippo4J
0 likes · 5 min read
Unlock Dynamic Thread Pool Management with Hippo4J: Features, Modes, and Benefits
YunZhu Net Technology Team
YunZhu Net Technology Team
Feb 24, 2022 · Big Data

Design and Implementation of a Comprehensive Monitoring System for a Big Data Platform

This article describes the end‑to‑end design, metric hierarchy, data collection methods, visualization dashboards, and alerting mechanisms used to build a robust monitoring system for a large‑scale big‑data platform, covering physical hosts, Hadoop components, business services, and data layers with tools such as Telegraf, Prometheus, and Grafana.

AlertingGrafanaPrometheus
0 likes · 14 min read
Design and Implementation of a Comprehensive Monitoring System for a Big Data Platform
DaTaobao Tech
DaTaobao Tech
Feb 21, 2022 · Frontend Development

Focused Gray Release Monitoring and Alert Configuration for Frontend Quality

To raise front‑end quality, the team implements gray‑release monitoring that triggers log analysis at a 5 % rollout, automatically generates reports within ten minutes, and uses dynamic thresholds and noise‑reduction tactics to detect errors early, enabling rapid rollback or expansion and markedly improving stability and release efficiency.

AlertingMetricsfrontend
0 likes · 9 min read
Focused Gray Release Monitoring and Alert Configuration for Frontend Quality
Beike Product & Technology
Beike Product & Technology
Feb 18, 2022 · Operations

KeMonitor Alert Platform: Systematic Alert Governance and Practices

The article presents a comprehensive case study of KeMonitor, a one‑stop monitoring and alert platform built by 贝壳找房 to unify fragmented alerts, define lifecycle‑based governance, standardize alert metadata, implement graded subscription, on‑call escalation, silencing, self‑healing, and post‑mortem analysis, thereby improving incident response efficiency and reducing alert fatigue.

AlertingSOPincident response
0 likes · 17 min read
KeMonitor Alert Platform: Systematic Alert Governance and Practices
Ctrip Technology
Ctrip Technology
Feb 17, 2022 · Operations

Evolution and Architecture of the Hickwall Enterprise Monitoring Platform

The article details the background, challenges, multi‑year evolution, current architecture, and future roadmap of Hickwall, Ctrip's enterprise‑grade monitoring and observability platform, covering metrics, logs, traces, high‑cardinality handling, cloud‑native integration, alert governance, and storage engine migrations.

AlertingOperationsTSDB
0 likes · 15 min read
Evolution and Architecture of the Hickwall Enterprise Monitoring Platform
vivo Internet Technology
vivo Internet Technology
Feb 16, 2022 · Operations

Vivo Server Monitoring System Architecture and Evolution: A Comprehensive Technical Guide

Vivo’s vmonitor system replaces its legacy RabbitMQ‑based pipeline with an HTTP‑driven collector and gateway, stores minute‑level JVM, system, and business metrics in a customized OpenTSDB on HBase, adds precise floating‑point handling and null‑aware aggregation, buffers data in Redis, and provides multi‑dimensional alerts comparable to Zabbix, Open‑Falcon, and Prometheus.

AlertingDistributed MonitoringJVM Monitoring
0 likes · 18 min read
Vivo Server Monitoring System Architecture and Evolution: A Comprehensive Technical Guide
Ops Development Stories
Ops Development Stories
Jan 21, 2022 · Operations

How to Combine ELK and Zabbix for Real‑Time Log Alerting

This guide explains how to integrate ELK's Logstash with Zabbix using the logstash‑output‑zabbix plugin, covering installation, configuration of Logstash pipelines, Zabbix template and trigger setup, and testing the end‑to‑end alerting workflow.

AlertingELKLog Monitoring
0 likes · 17 min read
How to Combine ELK and Zabbix for Real‑Time Log Alerting
Zhuanzhuan Tech
Zhuanzhuan Tech
Jan 5, 2022 · Operations

Design and Implementation of a Multi‑Dimensional Monitoring Platform Based on Prometheus and M3DB

This article details the background, research, architecture, performance testing, and deployment of a comprehensive monitoring system that leverages Prometheus, Grafana, and M3DB to provide flexible metric collection, automatic dashboard generation, and a custom alerting service for large‑scale business services.

AlertingMetricsTime Series
0 likes · 16 min read
Design and Implementation of a Multi‑Dimensional Monitoring Platform Based on Prometheus and M3DB
Liulishuo Tech Team
Liulishuo Tech Team
Dec 30, 2021 · Operations

Design and Implementation of an Alert Scheduling System (GoAlert) and Notification Center

This article explains why alerts and on‑call scheduling are needed, outlines the core principles of an alert scheduling system, describes the architecture evolution from PagerDuty to GoAlert and Notice‑Center, and details the implementation, code snippets, and future outlook for a comprehensive operations monitoring solution.

AlertingNotification Systemgoalert
0 likes · 14 min read
Design and Implementation of an Alert Scheduling System (GoAlert) and Notification Center
Programmer DD
Programmer DD
Dec 12, 2021 · Operations

How Netflix’s Telltale Transforms Monitoring for 100+ Services

This article explains Netflix’s home‑grown monitoring system Telltale, detailing its design, multi‑dimensional health‑assessment model, intelligent alerting, integration with Slack, deployment monitoring, and continuous optimization that together keep over a hundred production applications running smoothly.

AlertingMicroservicesNetflix
0 likes · 13 min read
How Netflix’s Telltale Transforms Monitoring for 100+ Services
Youzan Coder
Youzan Coder
Dec 8, 2021 · Big Data

How to Build a Real‑Time Data Quality Monitoring System with Flink

This article outlines a comprehensive approach to monitoring and ensuring the accuracy and timeliness of real‑time data streams, detailing background challenges, solution design, implementation steps using Flink and automated testing, alert handling procedures, and future improvement plans.

AlertingData QualityFlink
0 likes · 10 min read
How to Build a Real‑Time Data Quality Monitoring System with Flink
dbaplus Community
dbaplus Community
Nov 22, 2021 · Databases

Transforming MySQL Monitoring: From Nagios to Kafka‑Powered Alerts

Qunar’s DBA team overhauled their MySQL monitoring and alert system—originally built on Nagios and NRPE—by integrating a Kafka‑based pipeline, a custom alarm service, and MySQL‑stored alert templates, achieving flexible thresholds, granular silencing, high‑availability processing, and early‑stage intelligent management of alerts, slow queries, and disk space.

AlertingDBAKafka
0 likes · 14 min read
Transforming MySQL Monitoring: From Nagios to Kafka‑Powered Alerts
Baidu Geek Talk
Baidu Geek Talk
Nov 1, 2021 · Frontend Development

How Baidu’s Ad Hosting Team Built a Scalable Front‑End Exception Monitoring System

This article shares Baidu’s ad‑hosting team experience in designing, collecting, alerting, investigating, and remediating front‑end exceptions—covering generic and business‑specific error tracking, data protocols, monitoring strategies, alert tuning, and practical governance to improve user experience and ad performance.

AlertingBaiduError Handling
0 likes · 25 min read
How Baidu’s Ad Hosting Team Built a Scalable Front‑End Exception Monitoring System
Alibaba Cloud Developer
Alibaba Cloud Developer
Oct 15, 2021 · Operations

How Unified Observability Transforms Quality Management in Cloud‑Native Environments

This article explores the challenges of quality monitoring in cloud‑native DevOps pipelines, outlines pain points of massive heterogeneous logs and alerts, and presents a unified observability platform that enables data consolidation, AI‑driven intelligent inspection, and smart alert management to improve system reliability.

AIAlertingData Unification
0 likes · 17 min read
How Unified Observability Transforms Quality Management in Cloud‑Native Environments
Baidu Intelligent Testing
Baidu Intelligent Testing
Sep 16, 2021 · Operations

Baidu Game Microservice Monitoring Practice: System Design and Evolution

This article describes Baidu's game microservice monitoring practice, detailing the initial challenges, system design, risk control, intelligent monitoring, multi‑dimensional visualization, smart alerting, and efficient fault localization, illustrating how a systematic approach improves detection speed, coverage, and issue resolution for large‑scale online games.

AlertingGame Developmentmonitoring
0 likes · 12 min read
Baidu Game Microservice Monitoring Practice: System Design and Evolution
Baidu Geek Talk
Baidu Geek Talk
Jul 14, 2021 · Operations

How Baidu Built a Robust Microservice Monitoring System for Game Services

This article details Baidu's comprehensive microservice monitoring practice for its game platform, covering the initial fragmented setup, systematic redesign across risk control, intelligent monitoring, smart alerting, and rapid fault localization, and presents the resulting monitoring architecture, visualizations, and future improvement goals.

AlertingBaiduMicroservices
0 likes · 14 min read
How Baidu Built a Robust Microservice Monitoring System for Game Services
ByteDance ADFE Team
ByteDance ADFE Team
Jul 9, 2021 · Operations

From Ad‑hoc Deployment to Standardized SRE Practices: Definitions, Responsibilities, Metrics and Alerting

The article traces the evolution from a rudimentary deployment workflow in a small startup to a mature, Google‑inspired Site Reliability Engineering (SRE) approach, explaining SRE definitions, team duties, error‑budget concepts, key reliability metrics (SLI/SLO/SLA), monitoring implementation with OpenTSDB, and best‑practice alerting rules.

AlertingError BudgetSLI
0 likes · 7 min read
From Ad‑hoc Deployment to Standardized SRE Practices: Definitions, Responsibilities, Metrics and Alerting
Architecture Digest
Architecture Digest
Jun 22, 2021 · Operations

Netflix’s Telltale: An Intelligent Monitoring and Alerting System for Application Health

The article details Netflix’s internally built Telltale monitoring platform, explaining its motivation, key features such as multi‑dimensional health assessment, smart alerting, event management, deployment monitoring, and continuous optimization, and how it improves operational efficiency for over a hundred production services.

AlertingNetflixTelltale
0 likes · 12 min read
Netflix’s Telltale: An Intelligent Monitoring and Alerting System for Application Health
Full-Stack Internet Architecture
Full-Stack Internet Architecture
Jun 19, 2021 · Operations

Solving Monitoring Pain Points: Unified Framework, Alert Prioritization, and Classification

The article discusses common monitoring challenges such as fragmented tooling and noisy alerts, and proposes solutions including consolidating to a single monitoring framework, prioritizing runtime exceptions, and classifying business alerts with codes and trace information to improve incident response.

Alertingbest-practicesincident management
0 likes · 6 min read
Solving Monitoring Pain Points: Unified Framework, Alert Prioritization, and Classification
58 Tech
58 Tech
Jun 11, 2021 · Frontend Development

Beidou Frontend Monitoring System: Architecture, Challenges, and Solutions

The article details the design, architecture, and operational challenges of the Beidou frontend monitoring platform at 58 Group, covering SDK management, behavior trace logging, front‑back link integration, performance optimizations, minute‑level alerting, and permission management.

Alertingarchitecturefrontend
0 likes · 22 min read
Beidou Frontend Monitoring System: Architecture, Challenges, and Solutions
Youzan Coder
Youzan Coder
Jun 9, 2021 · Mobile Development

Mobile SkyNet Platform: Architecture, Log Collection, Storage, and Alerting Design

The Mobile SkyNet platform adds a dedicated mobile monitoring layer to SaaS services, using Zanlogger for error, warning, and info logs, Kafka‑HBase pipelines for high‑throughput storage, WeChat‑based alerting, and an MPaaS console for issue visualization, reducing mobile‑side incidents by about twenty percent.

AlertingBackend IntegrationLog Monitoring
0 likes · 11 min read
Mobile SkyNet Platform: Architecture, Log Collection, Storage, and Alerting Design
Efficient Ops
Efficient Ops
Jun 6, 2021 · Databases

How We Built a Scalable Database Monitoring System for Real‑Time Alerts

This article details the design and implementation of a comprehensive database monitoring platform that automatically adapts to cluster changes, aggregates host and DB metrics, offers flexible alert templates and strategies, stores data in InfluxDB, and provides customizable dashboards for real‑time insight and incident response.

AlertingDatabase MonitoringInfluxDB
0 likes · 12 min read
How We Built a Scalable Database Monitoring System for Real‑Time Alerts
TAL Education Technology
TAL Education Technology
May 27, 2021 · Big Data

Big Data Monitoring System: Architecture, Basic and Advanced Monitoring, and Alert Convergence & Grading

This article outlines the challenges of operating petabyte‑scale big‑data clusters and presents a comprehensive monitoring framework—including basic and upgraded monitoring layers, metric collection, alerting pipelines, and strategies for alarm convergence and grading—to ensure reliable, proactive SRE operations.

AlertingGrafanaOperations
0 likes · 12 min read
Big Data Monitoring System: Architecture, Basic and Advanced Monitoring, and Alert Convergence & Grading
Liulishuo Tech Team
Liulishuo Tech Team
May 26, 2021 · Operations

Custom Prometheus Monitoring Architecture and GitOps Practices at Liulishuo

This article details Liulishuo's customized Prometheus monitoring architecture, including data backup to Aliyun SLS, ECS service discovery, advanced alerting with PagerDuty and Goalert, GitOps-driven config management, cloud resource exporters, SLA monitoring, and future plans for storage and alert pipelines.

Alertingcloud-nativemonitoring
0 likes · 9 min read
Custom Prometheus Monitoring Architecture and GitOps Practices at Liulishuo
dbaplus Community
dbaplus Community
May 18, 2021 · Operations

Mastering End‑to‑End Monitoring: From Purpose to Prometheus Implementation

This guide explains why monitoring is essential throughout a product lifecycle, outlines monitoring modes and methods, compares health checks, logs, tracing and metric solutions, and provides a detailed Prometheus‑based monitoring architecture with concrete metric definitions, alerting rules, and incident‑response procedures.

AlertingMetricsOperations
0 likes · 25 min read
Mastering End‑to‑End Monitoring: From Purpose to Prometheus Implementation
Big Data Technology & Architecture
Big Data Technology & Architecture
Apr 26, 2021 · Operations

Comprehensive Guide to Prometheus: Installation, Configuration, PromQL, Exporters, Grafana, and Alerting

This article provides a complete tutorial on Prometheus, covering its origins, core features, installation methods (binary and Docker), configuration file structure, PromQL basics, HTTP API usage, Grafana integration, various exporters for metrics collection, and alerting with Alertmanager, all within a cloud‑native monitoring context.

AlertingExportersGrafana
0 likes · 32 min read
Comprehensive Guide to Prometheus: Installation, Configuration, PromQL, Exporters, Grafana, and Alerting
ITFLY8 Architecture Home
ITFLY8 Architecture Home
Apr 23, 2021 · Operations

How JD’s Open Platform Guarantees Reliable Message Delivery with Dynamic BMQ Design

This article explains how JD’s Open Platform’s Business Message Queue (BMQ) architecture, dynamic channels, retry and downgrade mechanisms, and real‑time monitoring ensure reliable, low‑risk message delivery across thousands of merchants while simplifying integration and scaling for future growth.

AlertingDynamic ConfigurationJD Open Platform
0 likes · 10 min read
How JD’s Open Platform Guarantees Reliable Message Delivery with Dynamic BMQ Design
iQIYI Technical Product Team
iQIYI Technical Product Team
Mar 12, 2021 · Operations

Implementation and Practice of LEDAO‑CAT Monitoring System for iQIYI Content Platform

To meet the LEDAO platform’s need for rapid anomaly detection, full‑stack observability, and reliable alerting across more than 100 microservices, iQIYI evaluated OpenFalcon, Prometheus and CAT, selected CAT, deployed separate mainland and overseas clusters, added configurable access, health‑check and integrated alert channels, enabling five‑minute service onboarding, near‑zero‑intrusion instrumentation, and real‑time business‑level monitoring.

AlertingCATDevOps
0 likes · 12 min read
Implementation and Practice of LEDAO‑CAT Monitoring System for iQIYI Content Platform
Top Architect
Top Architect
Mar 6, 2021 · Operations

Spring Boot Monitoring with Prometheus and Grafana: A Step‑by‑Step Guide

This article provides a comprehensive tutorial on setting up Spring Boot application monitoring using Prometheus and Grafana, covering project creation, dependency configuration, security setup, Prometheus server installation, Grafana dashboard creation, email alerting configuration, and testing the end‑to‑end alert workflow.

AlertingBackendSpring Boot
0 likes · 10 min read
Spring Boot Monitoring with Prometheus and Grafana: A Step‑by‑Step Guide
Programmer DD
Programmer DD
Jan 15, 2021 · Operations

Why Does Prometheus Sometimes Fail to Trigger Alerts?

This article explains why Prometheus alerts may not fire or may fire unexpectedly, covering the role of the for parameter, sampling intervals, Grafana range queries, and practical steps to diagnose and fix alerting issues.

AlertingGrafanaOps
0 likes · 7 min read
Why Does Prometheus Sometimes Fail to Trigger Alerts?
Ops Development Stories
Ops Development Stories
Jan 7, 2021 · Operations

Master Blackbox Exporter: Install, Configure, and Alert with Prometheus

This guide walks through the concepts of white‑box vs black‑box monitoring, explains Prometheus Blackbox Exporter capabilities, shows step‑by‑step installation, Kubernetes configuration, probe definitions for HTTP, TCP, ICMP and SSL, and provides ready‑to‑use alert rules and Grafana dashboard integration.

AlertingBlackbox ExporterKubernetes
0 likes · 11 min read
Master Blackbox Exporter: Install, Configure, and Alert with Prometheus
Youzan Coder
Youzan Coder
Dec 30, 2020 · Operations

ERROR Log Governance and Monitoring Alerting Practice at Youzan

Youzan’s log‑governance guide uses a car‑dashboard analogy to show why precise ERROR logs and sensible alerts matter, defines INFO/WARN/ERROR levels, sets daily reduction targets, leverages top‑error analysis and water‑level monitoring, and ultimately cut daily ERROR entries from thousands to about one hundred while catching issues before incidents.

AlertingError HandlingLog Management
0 likes · 9 min read
ERROR Log Governance and Monitoring Alerting Practice at Youzan
Architecture Digest
Architecture Digest
Dec 13, 2020 · Operations

Netflix’s Telltale: Simplifying Application Monitoring and Intelligent Alerting

The article describes Netflix’s internally built monitoring system Telltale, explaining its motivations, core features such as unified data views, multi‑dimensional health assessment, intelligent alerting, Slack integration, deployment monitoring, and continuous optimization to reduce on‑call fatigue and improve service reliability.

AlertingMicroservicesNetflix
0 likes · 12 min read
Netflix’s Telltale: Simplifying Application Monitoring and Intelligent Alerting
High Availability Architecture
High Availability Architecture
Nov 26, 2020 · Operations

Implementing Unified Monitoring Dashboards and Rich‑Text Alerts with Grafana FlowCharting and ImageRender at Meitu

This article explains Meitu's monitoring architecture and presents two practical, low‑effort implementations—a Grafana FlowCharting unified dashboard and a GrafanaImageRender + WeChat Work rich‑text alert solution—detailing step‑by‑step procedures, required tools, and sample code to help SRE teams quickly adopt them.

AlertingDashboardFlowCharting
0 likes · 22 min read
Implementing Unified Monitoring Dashboards and Rich‑Text Alerts with Grafana FlowCharting and ImageRender at Meitu
Programmer DD
Programmer DD
Jul 30, 2020 · Cloud Native

Master Prometheus: Practical Tips, Exporter Strategies, and Scaling Challenges

This comprehensive guide explores Prometheus monitoring fundamentals, key design principles, exporter selection for Kubernetes, advanced configuration tricks, capacity planning, high‑cardinality pitfalls, HA architectures, and integration with Grafana, Alertmanager, and Thanos to help you build reliable cloud‑native observability pipelines.

AlertingExporterGrafana
0 likes · 36 min read
Master Prometheus: Practical Tips, Exporter Strategies, and Scaling Challenges
dbaplus Community
dbaplus Community
Jul 20, 2020 · Operations

How to Build Reliable Monitoring for Low‑Frequency Financial Services

After two years transitioning from e‑commerce to finance, the team shares practical monitoring strategies for low‑frequency financial services, contrasting e‑commerce traffic‑based methods with finance‑specific challenges, and detailing point‑based metrics, hourly success‑rate alerts, aspect‑oriented exception handling, white‑list filtering, and Sentinel‑based circuit breaking.

AlertingAspect Oriented ProgrammingCircuit Breaking
0 likes · 16 min read
How to Build Reliable Monitoring for Low‑Frequency Financial Services
Full-Stack Internet Architecture
Full-Stack Internet Architecture
Jul 12, 2020 · Operations

Monitoring Practices for Low‑Frequency Financial Services: Lessons from E‑commerce and Reliable Alerting Techniques

This article shares practical monitoring strategies for financial services with low‑frequency operations, contrasting e‑commerce monitoring methods, outlining the challenges of financial monitoring, and presenting reliable solutions such as success‑rate alerts, aspect‑oriented exception handling with whitelists, and circuit‑breaker degradation using Sentinel.

AlertingAspect Oriented ProgrammingFinancial Services
0 likes · 14 min read
Monitoring Practices for Low‑Frequency Financial Services: Lessons from E‑commerce and Reliable Alerting Techniques
HaoDF Tech Team
HaoDF Tech Team
Jul 8, 2020 · Operations

How We Rebuilt Our Monitoring System into a Scalable Alert Service

After two months of intensive development, the team launched a new monitoring and alerting platform that transforms a legacy system into a service‑oriented solution, addressing pain points such as inflexible escalation, noisy alerts, and poor ownership while introducing phone alerts, automated escalation, Prometheus integration, and a unified rule engine.

AlertingDevOpsPrometheus
0 likes · 16 min read
How We Rebuilt Our Monitoring System into a Scalable Alert Service
Liangxu Linux
Liangxu Linux
Jun 13, 2020 · Operations

Mastering Monitoring: From Basics to Advanced Zabbix Practices

This comprehensive guide explains why monitoring is essential for operations, outlines monitoring goals and methods, reviews a wide range of open‑source tools, details a Zabbix‑based workflow, enumerates key metrics across hardware, system, application, network, security and business layers, and offers practical alerting and interview tips.

AlertingOperationsZabbix
0 likes · 21 min read
Mastering Monitoring: From Basics to Advanced Zabbix Practices
iQIYI Technical Product Team
iQIYI Technical Product Team
Jun 12, 2020 · Operations

Microservice Monitoring Practices at iQIYI: Architecture, Metrics, and Automation

iQIYI’s micro‑service monitoring combines low‑cost automatic instrumentation, declarative method metrics, and push‑gateway data into a unified multi‑dimensional schema, visualized centrally in Grafana and managed with standardized alert rules, demonstrating that simple integration, centralized dashboards, and early‑stage governance enable rapid anomaly detection and effective incident response.

AlertingMetricsPrometheus
0 likes · 14 min read
Microservice Monitoring Practices at iQIYI: Architecture, Metrics, and Automation
dbaplus Community
dbaplus Community
May 18, 2020 · Databases

Deploy and Use the Open‑Source MongoDB Visual Monitoring Tool (mongo_monitor)

This guide explains how to set up the mongo_monitor tool—a PHP‑based graphical monitor for MongoDB—by installing required PHP extensions, configuring a MySQL schema, adding MongoDB credentials, customizing email and WeChat alerts, scheduling data collection via cron, and accessing the web dashboard.

AlertingDatabase ToolsDeployment
0 likes · 9 min read
Deploy and Use the Open‑Source MongoDB Visual Monitoring Tool (mongo_monitor)
Efficient Ops
Efficient Ops
May 11, 2020 · Operations

How Nightingale Transforms Monitoring for Scalable Stability

This article introduces Didi's open‑source monitoring system Nightingale, detailing its design, architecture, key improvements over Open‑Falcon, and how its flexible alerting and data handling capabilities support the full lifecycle of stability engineering in large‑scale operations.

AlertingDevOpsTime Series
0 likes · 23 min read
How Nightingale Transforms Monitoring for Scalable Stability
MaGe Linux Operations
MaGe Linux Operations
May 10, 2020 · Databases

How to Build a Complete MySQL Monitoring Dashboard with Prometheus and Grafana

This guide walks through deploying mysqld_exporter, configuring Prometheus and Grafana, and monitoring essential MySQL metrics such as replication health, query throughput, slow‑query counts, connection usage, and InnoDB buffer‑pool statistics, while also showing how to set up alert rules for proactive database operations.

AlertingExportersGrafana
0 likes · 15 min read
How to Build a Complete MySQL Monitoring Dashboard with Prometheus and Grafana
Liangxu Linux
Liangxu Linux
Apr 29, 2020 · Operations

How to Build a Complete Monitoring System: Goals, Methods, Tools & Best Practices

This guide explains why monitoring is essential for the entire operations lifecycle, outlines key monitoring objectives, describes practical methods and workflows, reviews a range of open‑source tools (including Zabbix, MRTG, Ganglia, Nagios, Smokeping, OpenTSDB), and details metric categories such as hardware, system, application, network, log, security, API, performance and business monitoring.

AlertingMetricsZabbix
0 likes · 22 min read
How to Build a Complete Monitoring System: Goals, Methods, Tools & Best Practices
dbaplus Community
dbaplus Community
Apr 25, 2020 · Operations

Master Blackbox Exporter: Install, Configure, and Monitor with Prometheus

This guide explains the concepts of white‑box and black‑box monitoring, introduces Prometheus Blackbox Exporter, walks through installation, systemd setup, and detailed Prometheus configurations for HTTP, TCP, ICMP, POST and SSL checks, shows Grafana dashboard integration, and provides alert rule examples for reliable service health monitoring.

AlertingBlackbox ExporterGrafana
0 likes · 13 min read
Master Blackbox Exporter: Install, Configure, and Monitor with Prometheus
Efficient Ops
Efficient Ops
Mar 22, 2020 · Operations

Why Nightingale Is Shaping the Future of Enterprise Monitoring

Nightingale, an open‑source enterprise monitoring platform from Didi, combines cloud‑native design, high availability, flexible plugins, and a powerful object‑tree navigation to meet the monitoring needs of both small clusters and massive deployments, while extending and improving upon Open‑Falcon.

AlertingOperationsarchitecture
0 likes · 10 min read
Why Nightingale Is Shaping the Future of Enterprise Monitoring