monitoring | BestHub

Collection size

1767 articles

Page 10 of 89

Efficient Ops

May 30, 2023 · Operations

Mastering Fault Self-Healing: Automate Disk Alerts and Scale Operations

Discover how to transform nightly disk‑space alerts into automated self‑healing workflows, covering prerequisite standards, multi‑dimensional monitoring, CMDB integration, script‑based remediation, and multi‑channel notifications to scale operations across thousands of servers without manual intervention.

CMDBDevOpsMonitoring

0 likes · 10 min read

Mastering Fault Self-Healing: Automate Disk Alerts and Scale Operations

Efficient Ops

May 12, 2023 · Operations

Designing an Intelligent Performance Testing Platform: From Vision to Implementation

This article describes how a bank’s IT team transformed its performance testing by defining intelligent platform capabilities, designing a modular architecture, and implementing features such as automated risk identification, smart test case generation, data synthesis, multi‑protocol support, chaos injection, and automated result analysis using JMeter, Prometheus, and custom plugins.

Chaos EngineeringJMeterMonitoring

0 likes · 11 min read

Designing an Intelligent Performance Testing Platform: From Vision to Implementation

Efficient Ops

Apr 23, 2023 · Operations

Compare Cacti, Nagios, Zabbix, Prometheus, Grafana, Nightingale, Open-Falcon

This article reviews several popular open‑source monitoring tools—Cacti, Nagios, Zabbix, Prometheus, Grafana, Nightingale, and Open‑Falcon—detailing their core features, data collection methods, visualization capabilities, and typical use cases for IT operations.

CactiGrafanaMonitoring

0 likes · 7 min read

Compare Cacti, Nagios, Zabbix, Prometheus, Grafana, Nightingale, Open-Falcon

Efficient Ops

Aug 31, 2022 · Operations

How to Build Scalable Fault Self‑Healing for Modern Operations

This article explains why traditional manual responses to alerts are insufficient, outlines the concept of fault self‑healing, and provides a step‑by‑step guide on establishing standards, monitoring dimensions, a unified CMDB, automation tools, and notification channels to achieve automated recovery at scale.

CMDBMonitoringautomation

0 likes · 9 min read

How to Build Scalable Fault Self‑Healing for Modern Operations

Efficient Ops

Aug 17, 2022 · Operations

Master System Monitoring with the USE Method and Prometheus

This article explains how to build a comprehensive monitoring system using the concise USE (Utilization, Saturation, Errors) method, outlines key system and application metrics, and demonstrates practical implementation with Prometheus, Grafana, full‑link tracing, and ELK for observability and performance troubleshooting.

MonitoringObservabilityPrometheus

0 likes · 13 min read

Master System Monitoring with the USE Method and Prometheus

Efficient Ops

Aug 8, 2022 · Operations

Master Essential Linux Ops: xargs, Background Jobs, Process Monitoring & More

This guide walks you through practical Linux operations—from using xargs for efficient file handling and running commands in the background, to monitoring high‑memory and high‑CPU processes, viewing multiple logs with multitail, continuous ping logging, checking TCP states, identifying top IPs on port 80, and leveraging SSH for port forwarding.

LinuxMonitoringmultitail

0 likes · 10 min read

Master Essential Linux Ops: xargs, Background Jobs, Process Monitoring & More

Efficient Ops

Nov 24, 2021 · Operations

Why Switch to Loki? Step‑by‑Step Installation and Grafana Visualization

This guide explains why Loki is a lightweight alternative to EFK/ELK, walks through installing Loki and Promtail binaries, configuring them with YAML files, and visualizing logs in Grafana using LogQL, providing a complete end‑to‑end log management solution.

GrafanaLokiMonitoring

0 likes · 6 min read

Why Switch to Loki? Step‑by‑Step Installation and Grafana Visualization

Efficient Ops

Apr 20, 2021 · Operations

How Dada’s Intelligent Elastic Scaling Cuts Costs and Boosts Delivery Performance

This article details Dada Group’s implementation of an intelligent elastic scaling architecture that automatically adjusts capacity during peak promotions and low‑traffic periods, improving delivery reliability, reducing costs, and supporting multi‑cloud and multi‑runtime environments through sophisticated monitoring and auto‑scaler mechanisms.

Cloud NativeMonitoringauto scaling

0 likes · 17 min read

How Dada’s Intelligent Elastic Scaling Cuts Costs and Boosts Delivery Performance

Efficient Ops

Feb 1, 2021 · Operations

How to Detect Anomalous Nodes in Massive Compute Clusters Using Intelligent Ops

This article explains how internet companies can reduce soaring manual operations costs by applying intelligent monitoring techniques—such as pattern recognition and statistical anomaly detection—to automatically identify abnormal nodes among thousands of servers, streamline fault diagnosis, and improve service quality.

Anomaly DetectionMonitoringlarge-scale systems

0 likes · 4 min read

How to Detect Anomalous Nodes in Massive Compute Clusters Using Intelligent Ops

Efficient Ops

May 5, 2019 · Operations

How Qunar Uses AI-Driven Fault Prediction to Boost System Reliability

This article outlines Qunar's operational strategy for reducing failures and extending uptime through precise fault detection, rapid recovery, and AI-powered predictive health management, detailing the evolution of their OPS processes, practical implementations, and future challenges in applying PHM to internet services.

AIOpsFault PredictionMonitoring

0 likes · 18 min read

How Qunar Uses AI-Driven Fault Prediction to Boost System Reliability

Efficient Ops

Oct 16, 2018 · Operations

How Tencent Built an AI‑Powered Network Fault Detection System in Minutes

In this talk, Tencent’s infrastructure lead explains how their team created an AI‑driven, three‑minute fault detection and recovery pipeline—combining high‑precision Meshping monitoring, multi‑KPI analytics, and automated Moveout isolation—to dramatically shorten network outage resolution from hours to minutes.

AIOpsMonitoringNetwork Operations

0 likes · 18 min read

How Tencent Built an AI‑Powered Network Fault Detection System in Minutes

Efficient Ops

May 21, 2018 · Databases

Designing Scalable MySQL Cloud DBaaS: Architecture, Availability, and Future Plans

This article summarizes the design and evolution of a MySQL cloud DBaaS platform, covering MySQL 8.0 features, the need for DBaaS, multi‑generation architecture, service and data availability strategies, monitoring, DTS design, and upcoming roadmap for broader database support and hybrid cloud deployment.

DBaaSDTSDatabase Design

0 likes · 13 min read

Designing Scalable MySQL Cloud DBaaS: Architecture, Availability, and Future Plans

Efficient Ops

Dec 5, 2017 · Operations

How Alibaba’s Sunfire Achieves Second‑Level Monitoring at Trillion‑Transaction Scale

This article explains how Alibaba’s Sunfire monitoring platform processes terabytes of logs per minute, uses a pull‑based architecture with Brain‑Reduce‑Map roles, tackles scalability and reliability challenges, and outlines future directions such as MQL standardization and intelligent baselines.

MonitoringReal-timelarge scale

0 likes · 17 min read

How Alibaba’s Sunfire Achieves Second‑Level Monitoring at Trillion‑Transaction Scale

Linux Ops Smart Journey

Jun 6, 2025 · Operations

How to Build a Complete Longhorn Monitoring System with Prometheus & Grafana

This guide explains how to monitor Longhorn storage in Kubernetes by collecting metrics with Prometheus, configuring scraping, verifying data collection, and visualizing everything in Grafana, enabling proactive performance tuning and reliable operations.

Cloud NativeGrafanaKubernetes

0 likes · 6 min read

How to Build a Complete Longhorn Monitoring System with Prometheus & Grafana

Linux Ops Smart Journey

Apr 20, 2025 · Operations

Visualize Kubernetes Events: Store in Elasticsearch and Dashboard with Grafana

This guide explains how to store Kubernetes event data in Elasticsearch, configure Logstash and Ruby filters for timestamp correction, and create a Grafana dashboard to visualize and analyze cluster events for improved monitoring and troubleshooting.

ElasticsearchGrafanaK8s Events

0 likes · 4 min read

Visualize Kubernetes Events: Store in Elasticsearch and Dashboard with Grafana

Linux Ops Smart Journey

Apr 16, 2025 · Operations

How to Build a Robust Elasticsearch Monitoring System with Prometheus & Grafana

Learn step‑by‑step how to deploy the Elasticsearch‑exporter via Helm, configure Prometheus to scrape its metrics, and visualize them in Grafana, enabling comprehensive monitoring of Elasticsearch clusters for performance, health, and early issue detection in Kubernetes environments.

ElasticsearchExporterGrafana

0 likes · 7 min read

How to Build a Robust Elasticsearch Monitoring System with Prometheus & Grafana

Linux Ops Smart Journey

Jan 7, 2025 · Operations

Enable Nacos Metrics in Prometheus and Visualize with Grafana

This guide shows how to enable Nacos metrics, configure Prometheus to scrape them, and visualize the data with a Grafana dashboard, providing a centralized view across different departments for enterprise monitoring and decision‑making.

GrafanaKubernetesMetrics

0 likes · 4 min read

Enable Nacos Metrics in Prometheus and Visualize with Grafana

macrozheng

Jun 9, 2025 · Backend Development

Mastering Redis Hotspot Keys: Detection, Risks, and Solutions

This article explains what Redis hotspot keys are, the performance and stability issues they cause, common causes, how to monitor and identify them, and practical mitigation strategies such as cluster scaling, key sharding, and multi‑level caching.

MonitoringPerformanceRedis

0 likes · 10 min read

Mastering Redis Hotspot Keys: Detection, Risks, and Solutions

macrozheng

Nov 8, 2022 · Operations

Choosing the Right Open‑Source Monitoring System: Zabbix, Open‑Falcon, Prometheus

This article provides a systematic overview of monitoring fundamentals, compares three popular open‑source monitoring solutions—Zabbix, Open‑Falcon, and Prometheus—and offers practical guidance for selecting the most suitable system based on scale, features, and operational needs.

MonitoringOpen-FalconPrometheus

0 likes · 21 min read

Choosing the Right Open‑Source Monitoring System: Zabbix, Open‑Falcon, Prometheus

Raymond Ops

Jun 4, 2025 · Operations

Mastering SFTP: Complete Planning, Configuration, and High‑Availability Guide

This guide walks you through SFTP server planning, user naming conventions, directory structures, SSH configuration, account creation, permission setup, client usage, log auditing, rotation, connection limits, monitoring, and high‑availability deployment across multiple servers, providing ready‑to‑run commands and scripts.

ACLHigh AvailabilityLinux

0 likes · 14 min read

Mastering SFTP: Complete Planning, Configuration, and High‑Availability Guide