Topic

monitoring

Collection size
1767 articles
Page 10 of 89
Efficient Ops
Efficient Ops
May 30, 2023 · Operations

Mastering Fault Self-Healing: Automate Disk Alerts and Scale Operations

Discover how to transform nightly disk‑space alerts into automated self‑healing workflows, covering prerequisite standards, multi‑dimensional monitoring, CMDB integration, script‑based remediation, and multi‑channel notifications to scale operations across thousands of servers without manual intervention.

CMDBDevOpsMonitoring
0 likes · 10 min read
Mastering Fault Self-Healing: Automate Disk Alerts and Scale Operations
Efficient Ops
Efficient Ops
May 12, 2023 · Operations

Designing an Intelligent Performance Testing Platform: From Vision to Implementation

This article describes how a bank’s IT team transformed its performance testing by defining intelligent platform capabilities, designing a modular architecture, and implementing features such as automated risk identification, smart test case generation, data synthesis, multi‑protocol support, chaos injection, and automated result analysis using JMeter, Prometheus, and custom plugins.

Chaos EngineeringJMeterMonitoring
0 likes · 11 min read
Designing an Intelligent Performance Testing Platform: From Vision to Implementation
Efficient Ops
Efficient Ops
Apr 23, 2023 · Operations

Compare Cacti, Nagios, Zabbix, Prometheus, Grafana, Nightingale, Open-Falcon

This article reviews several popular open‑source monitoring tools—Cacti, Nagios, Zabbix, Prometheus, Grafana, Nightingale, and Open‑Falcon—detailing their core features, data collection methods, visualization capabilities, and typical use cases for IT operations.

CactiGrafanaMonitoring
0 likes · 7 min read
Compare Cacti, Nagios, Zabbix, Prometheus, Grafana, Nightingale, Open-Falcon
Efficient Ops
Efficient Ops
Aug 31, 2022 · Operations

How to Build Scalable Fault Self‑Healing for Modern Operations

This article explains why traditional manual responses to alerts are insufficient, outlines the concept of fault self‑healing, and provides a step‑by‑step guide on establishing standards, monitoring dimensions, a unified CMDB, automation tools, and notification channels to achieve automated recovery at scale.

CMDBMonitoringautomation
0 likes · 9 min read
How to Build Scalable Fault Self‑Healing for Modern Operations
Efficient Ops
Efficient Ops
Aug 17, 2022 · Operations

Master System Monitoring with the USE Method and Prometheus

This article explains how to build a comprehensive monitoring system using the concise USE (Utilization, Saturation, Errors) method, outlines key system and application metrics, and demonstrates practical implementation with Prometheus, Grafana, full‑link tracing, and ELK for observability and performance troubleshooting.

MonitoringObservabilityPrometheus
0 likes · 13 min read
Master System Monitoring with the USE Method and Prometheus
Efficient Ops
Efficient Ops
Aug 8, 2022 · Operations

Master Essential Linux Ops: xargs, Background Jobs, Process Monitoring & More

This guide walks you through practical Linux operations—from using xargs for efficient file handling and running commands in the background, to monitoring high‑memory and high‑CPU processes, viewing multiple logs with multitail, continuous ping logging, checking TCP states, identifying top IPs on port 80, and leveraging SSH for port forwarding.

LinuxMonitoringmultitail
0 likes · 10 min read
Master Essential Linux Ops: xargs, Background Jobs, Process Monitoring & More
Efficient Ops
Efficient Ops
Nov 24, 2021 · Operations

Why Switch to Loki? Step‑by‑Step Installation and Grafana Visualization

This guide explains why Loki is a lightweight alternative to EFK/ELK, walks through installing Loki and Promtail binaries, configuring them with YAML files, and visualizing logs in Grafana using LogQL, providing a complete end‑to‑end log management solution.

GrafanaLokiMonitoring
0 likes · 6 min read
Why Switch to Loki? Step‑by‑Step Installation and Grafana Visualization
Efficient Ops
Efficient Ops
Apr 20, 2021 · Operations

How Dada’s Intelligent Elastic Scaling Cuts Costs and Boosts Delivery Performance

This article details Dada Group’s implementation of an intelligent elastic scaling architecture that automatically adjusts capacity during peak promotions and low‑traffic periods, improving delivery reliability, reducing costs, and supporting multi‑cloud and multi‑runtime environments through sophisticated monitoring and auto‑scaler mechanisms.

Cloud NativeMonitoringauto scaling
0 likes · 17 min read
How Dada’s Intelligent Elastic Scaling Cuts Costs and Boosts Delivery Performance
Efficient Ops
Efficient Ops
Feb 1, 2021 · Operations

How to Detect Anomalous Nodes in Massive Compute Clusters Using Intelligent Ops

This article explains how internet companies can reduce soaring manual operations costs by applying intelligent monitoring techniques—such as pattern recognition and statistical anomaly detection—to automatically identify abnormal nodes among thousands of servers, streamline fault diagnosis, and improve service quality.

Anomaly DetectionMonitoringlarge-scale systems
0 likes · 4 min read
How to Detect Anomalous Nodes in Massive Compute Clusters Using Intelligent Ops
Efficient Ops
Efficient Ops
May 5, 2019 · Operations

How Qunar Uses AI-Driven Fault Prediction to Boost System Reliability

This article outlines Qunar's operational strategy for reducing failures and extending uptime through precise fault detection, rapid recovery, and AI-powered predictive health management, detailing the evolution of their OPS processes, practical implementations, and future challenges in applying PHM to internet services.

AIOpsFault PredictionMonitoring
0 likes · 18 min read
How Qunar Uses AI-Driven Fault Prediction to Boost System Reliability
Efficient Ops
Efficient Ops
Oct 16, 2018 · Operations

How Tencent Built an AI‑Powered Network Fault Detection System in Minutes

In this talk, Tencent’s infrastructure lead explains how their team created an AI‑driven, three‑minute fault detection and recovery pipeline—combining high‑precision Meshping monitoring, multi‑KPI analytics, and automated Moveout isolation—to dramatically shorten network outage resolution from hours to minutes.

AIOpsMonitoringNetwork Operations
0 likes · 18 min read
How Tencent Built an AI‑Powered Network Fault Detection System in Minutes
Efficient Ops
Efficient Ops
May 21, 2018 · Databases

Designing Scalable MySQL Cloud DBaaS: Architecture, Availability, and Future Plans

This article summarizes the design and evolution of a MySQL cloud DBaaS platform, covering MySQL 8.0 features, the need for DBaaS, multi‑generation architecture, service and data availability strategies, monitoring, DTS design, and upcoming roadmap for broader database support and hybrid cloud deployment.

DBaaSDTSDatabase Design
0 likes · 13 min read
Designing Scalable MySQL Cloud DBaaS: Architecture, Availability, and Future Plans
Efficient Ops
Efficient Ops
Dec 5, 2017 · Operations

How Alibaba’s Sunfire Achieves Second‑Level Monitoring at Trillion‑Transaction Scale

This article explains how Alibaba’s Sunfire monitoring platform processes terabytes of logs per minute, uses a pull‑based architecture with Brain‑Reduce‑Map roles, tackles scalability and reliability challenges, and outlines future directions such as MQL standardization and intelligent baselines.

MonitoringReal-timelarge scale
0 likes · 17 min read
How Alibaba’s Sunfire Achieves Second‑Level Monitoring at Trillion‑Transaction Scale
Linux Ops Smart Journey
Linux Ops Smart Journey
Jun 6, 2025 · Operations

How to Build a Complete Longhorn Monitoring System with Prometheus & Grafana

This guide explains how to monitor Longhorn storage in Kubernetes by collecting metrics with Prometheus, configuring scraping, verifying data collection, and visualizing everything in Grafana, enabling proactive performance tuning and reliable operations.

Cloud NativeGrafanaKubernetes
0 likes · 6 min read
How to Build a Complete Longhorn Monitoring System with Prometheus & Grafana
Linux Ops Smart Journey
Linux Ops Smart Journey
Apr 20, 2025 · Operations

Visualize Kubernetes Events: Store in Elasticsearch and Dashboard with Grafana

This guide explains how to store Kubernetes event data in Elasticsearch, configure Logstash and Ruby filters for timestamp correction, and create a Grafana dashboard to visualize and analyze cluster events for improved monitoring and troubleshooting.

ElasticsearchGrafanaK8s Events
0 likes · 4 min read
Visualize Kubernetes Events: Store in Elasticsearch and Dashboard with Grafana
Linux Ops Smart Journey
Linux Ops Smart Journey
Apr 16, 2025 · Operations

How to Build a Robust Elasticsearch Monitoring System with Prometheus & Grafana

Learn step‑by‑step how to deploy the Elasticsearch‑exporter via Helm, configure Prometheus to scrape its metrics, and visualize them in Grafana, enabling comprehensive monitoring of Elasticsearch clusters for performance, health, and early issue detection in Kubernetes environments.

ElasticsearchExporterGrafana
0 likes · 7 min read
How to Build a Robust Elasticsearch Monitoring System with Prometheus & Grafana
Linux Ops Smart Journey
Linux Ops Smart Journey
Jan 7, 2025 · Operations

Enable Nacos Metrics in Prometheus and Visualize with Grafana

This guide shows how to enable Nacos metrics, configure Prometheus to scrape them, and visualize the data with a Grafana dashboard, providing a centralized view across different departments for enterprise monitoring and decision‑making.

GrafanaKubernetesMetrics
0 likes · 4 min read
Enable Nacos Metrics in Prometheus and Visualize with Grafana
macrozheng
macrozheng
Jun 9, 2025 · Backend Development

Mastering Redis Hotspot Keys: Detection, Risks, and Solutions

This article explains what Redis hotspot keys are, the performance and stability issues they cause, common causes, how to monitor and identify them, and practical mitigation strategies such as cluster scaling, key sharding, and multi‑level caching.

MonitoringPerformanceRedis
0 likes · 10 min read
Mastering Redis Hotspot Keys: Detection, Risks, and Solutions
macrozheng
macrozheng
Nov 8, 2022 · Operations

Choosing the Right Open‑Source Monitoring System: Zabbix, Open‑Falcon, Prometheus

This article provides a systematic overview of monitoring fundamentals, compares three popular open‑source monitoring solutions—Zabbix, Open‑Falcon, and Prometheus—and offers practical guidance for selecting the most suitable system based on scale, features, and operational needs.

MonitoringOpen-FalconPrometheus
0 likes · 21 min read
Choosing the Right Open‑Source Monitoring System: Zabbix, Open‑Falcon, Prometheus
Raymond Ops
Raymond Ops
Jun 4, 2025 · Operations

Mastering SFTP: Complete Planning, Configuration, and High‑Availability Guide

This guide walks you through SFTP server planning, user naming conventions, directory structures, SSH configuration, account creation, permission setup, client usage, log auditing, rotation, connection limits, monitoring, and high‑availability deployment across multiple servers, providing ready‑to‑run commands and scripts.

ACLHigh AvailabilityLinux
0 likes · 14 min read
Mastering SFTP: Complete Planning, Configuration, and High‑Availability Guide