Topic

Monitoring

Collection size
1711 articles
Page 8 of 86
Efficient Ops
Efficient Ops
Aug 18, 2024 · Operations

Essential Bash Scripts for Linux Ops: Monitoring, Deployment & Automation

A curated collection of ready‑to‑use Bash scripts that help you monitor MySQL replication, track directory changes, batch‑create users, detect website issues, execute remote commands, deploy LNMP stacks, check server resource usage, identify high‑load processes, and automate Java or PHP project releases.

DevOpsLinuxMonitoring
0 likes · 12 min read
Essential Bash Scripts for Linux Ops: Monitoring, Deployment & Automation
Efficient Ops
Efficient Ops
Jul 28, 2024 · Operations

Building a Resilient, High‑Performance Website: Domains, CDN, Security & Ops

This guide outlines a comprehensive, step‑by‑step strategy for creating a highly available, secure, and scalable website—from buying and protecting multiple domains, configuring DNS and CDN, setting up image and database servers, to implementing monitoring, redundancy, high‑concurrency testing, and disaster‑recovery plans.

CDNMonitoringOperations
0 likes · 13 min read
Building a Resilient, High‑Performance Website: Domains, CDN, Security & Ops
Efficient Ops
Efficient Ops
Mar 18, 2024 · Operations

How to Implement Fault Self‑Healing for Scalable Operations

This article explains why low‑disk alerts demand automation, outlines the concept of fault self‑healing versus manual response, and provides practical guidelines—including standards, monitoring dimensions, CMDB integration, script execution tools, and notification channels—to build a reliable self‑healing system for large‑scale environments.

CMDBDevOpsMonitoring
0 likes · 10 min read
How to Implement Fault Self‑Healing for Scalable Operations
Efficient Ops
Efficient Ops
Mar 13, 2024 · Operations

What Does an Operations Engineer Do? Skills, Tools, and Career Path

This article explains the role of an operations (运维) engineer, covering daily responsibilities, essential knowledge such as Linux and networking, common monitoring tools, and emerging career paths like DevOps, AIOps, and SRE, helping newcomers understand how to start and grow in the field.

DevOpsLinuxMonitoring
0 likes · 6 min read
What Does an Operations Engineer Do? Skills, Tools, and Career Path
Efficient Ops
Efficient Ops
Sep 26, 2023 · Operations

Mastering Zabbix: From Installation to Advanced Monitoring and Automation

This comprehensive guide walks you through Zabbix monitoring concepts, reliability calculations, installation methods, web UI configuration, host and template management, custom monitoring, alert integration with OneAlert, Grafana visualization, distributed monitoring, SNMP support, and practical scripts for large‑scale server environments.

AlertingGrafanaMonitoring
0 likes · 28 min read
Mastering Zabbix: From Installation to Advanced Monitoring and Automation
Efficient Ops
Efficient Ops
Aug 31, 2022 · Operations

How to Build Scalable Fault Self‑Healing for Modern Operations

This article explains why traditional manual responses to alerts are insufficient, outlines the concept of fault self‑healing, and provides a step‑by‑step guide on establishing standards, monitoring dimensions, a unified CMDB, automation tools, and notification channels to achieve automated recovery at scale.

CMDBMonitoringOperations
0 likes · 9 min read
How to Build Scalable Fault Self‑Healing for Modern Operations
Efficient Ops
Efficient Ops
Aug 17, 2022 · Operations

Master System Monitoring with the USE Method and Prometheus

This article explains how to build a comprehensive monitoring system using the concise USE (Utilization, Saturation, Errors) method, outlines key system and application metrics, and demonstrates practical implementation with Prometheus, Grafana, full‑link tracing, and ELK for observability and performance troubleshooting.

MonitoringPrometheusUSE method
0 likes · 13 min read
Master System Monitoring with the USE Method and Prometheus
Efficient Ops
Efficient Ops
Nov 24, 2021 · Operations

Why Switch to Loki? Step‑by‑Step Installation and Grafana Visualization

This guide explains why Loki is a lightweight alternative to EFK/ELK, walks through installing Loki and Promtail binaries, configuring them with YAML files, and visualizing logs in Grafana using LogQL, providing a complete end‑to‑end log management solution.

GrafanaLokiMonitoring
0 likes · 6 min read
Why Switch to Loki? Step‑by‑Step Installation and Grafana Visualization
Efficient Ops
Efficient Ops
Apr 20, 2021 · Operations

How Dada’s Intelligent Elastic Scaling Cuts Costs and Boosts Delivery Performance

This article details Dada Group’s implementation of an intelligent elastic scaling architecture that automatically adjusts capacity during peak promotions and low‑traffic periods, improving delivery reliability, reducing costs, and supporting multi‑cloud and multi‑runtime environments through sophisticated monitoring and auto‑scaler mechanisms.

Cloud NativeMonitoringOperations
0 likes · 17 min read
How Dada’s Intelligent Elastic Scaling Cuts Costs and Boosts Delivery Performance
Efficient Ops
Efficient Ops
Oct 16, 2018 · Operations

How Tencent Built an AI‑Powered Network Fault Detection System in Minutes

In this talk, Tencent’s infrastructure lead explains how their team created an AI‑driven, three‑minute fault detection and recovery pipeline—combining high‑precision Meshping monitoring, multi‑KPI analytics, and automated Moveout isolation—to dramatically shorten network outage resolution from hours to minutes.

AIOpsMonitoringNetwork Operations
0 likes · 18 min read
How Tencent Built an AI‑Powered Network Fault Detection System in Minutes
Efficient Ops
Efficient Ops
May 21, 2018 · Databases

Designing Scalable MySQL Cloud DBaaS: Architecture, Availability, and Future Plans

This article summarizes the design and evolution of a MySQL cloud DBaaS platform, covering MySQL 8.0 features, the need for DBaaS, multi‑generation architecture, service and data availability strategies, monitoring, DTS design, and upcoming roadmap for broader database support and hybrid cloud deployment.

DBaaSDTSDatabase Design
0 likes · 13 min read
Designing Scalable MySQL Cloud DBaaS: Architecture, Availability, and Future Plans
Efficient Ops
Efficient Ops
Dec 5, 2017 · Operations

How Alibaba’s Sunfire Achieves Second‑Level Monitoring at Trillion‑Transaction Scale

This article explains how Alibaba’s Sunfire monitoring platform processes terabytes of logs per minute, uses a pull‑based architecture with Brain‑Reduce‑Map roles, tackles scalability and reliability challenges, and outlines future directions such as MQL standardization and intelligent baselines.

MonitoringOperationsReal-time
0 likes · 17 min read
How Alibaba’s Sunfire Achieves Second‑Level Monitoring at Trillion‑Transaction Scale
Linux Ops Smart Journey
Linux Ops Smart Journey
Jun 6, 2025 · Operations

How to Build a Complete Longhorn Monitoring System with Prometheus & Grafana

This guide explains how to monitor Longhorn storage in Kubernetes by collecting metrics with Prometheus, configuring scraping, verifying data collection, and visualizing everything in Grafana, enabling proactive performance tuning and reliable operations.

Cloud NativeGrafanaKubernetes
0 likes · 6 min read
How to Build a Complete Longhorn Monitoring System with Prometheus & Grafana
Linux Ops Smart Journey
Linux Ops Smart Journey
Apr 20, 2025 · Operations

Visualize Kubernetes Events: Store in Elasticsearch and Dashboard with Grafana

This guide explains how to store Kubernetes event data in Elasticsearch, configure Logstash and Ruby filters for timestamp correction, and create a Grafana dashboard to visualize and analyze cluster events for improved monitoring and troubleshooting.

ElasticsearchGrafanaK8s Events
0 likes · 4 min read
Visualize Kubernetes Events: Store in Elasticsearch and Dashboard with Grafana
Linux Ops Smart Journey
Linux Ops Smart Journey
Apr 16, 2025 · Operations

How to Build a Robust Elasticsearch Monitoring System with Prometheus & Grafana

Learn step‑by‑step how to deploy the Elasticsearch‑exporter via Helm, configure Prometheus to scrape its metrics, and visualize them in Grafana, enabling comprehensive monitoring of Elasticsearch clusters for performance, health, and early issue detection in Kubernetes environments.

ElasticsearchExporterGrafana
0 likes · 7 min read
How to Build a Robust Elasticsearch Monitoring System with Prometheus & Grafana
Linux Ops Smart Journey
Linux Ops Smart Journey
Jan 7, 2025 · Operations

Enable Nacos Metrics in Prometheus and Visualize with Grafana

This guide shows how to enable Nacos metrics, configure Prometheus to scrape them, and visualize the data with a Grafana dashboard, providing a centralized view across different departments for enterprise monitoring and decision‑making.

GrafanaKubernetesMonitoring
0 likes · 4 min read
Enable Nacos Metrics in Prometheus and Visualize with Grafana
macrozheng
macrozheng
Jun 9, 2025 · Backend Development

Mastering Redis Hotspot Keys: Detection, Risks, and Solutions

This article explains what Redis hotspot keys are, the performance and stability issues they cause, common causes, how to monitor and identify them, and practical mitigation strategies such as cluster scaling, key sharding, and multi‑level caching.

CachingMonitoringRedis
0 likes · 10 min read
Mastering Redis Hotspot Keys: Detection, Risks, and Solutions
macrozheng
macrozheng
Nov 8, 2022 · Operations

Choosing the Right Open‑Source Monitoring System: Zabbix, Open‑Falcon, Prometheus

This article provides a systematic overview of monitoring fundamentals, compares three popular open‑source monitoring solutions—Zabbix, Open‑Falcon, and Prometheus—and offers practical guidance for selecting the most suitable system based on scale, features, and operational needs.

MonitoringOpen SourceOpen-Falcon
0 likes · 21 min read
Choosing the Right Open‑Source Monitoring System: Zabbix, Open‑Falcon, Prometheus
WeiLi Technology Team
WeiLi Technology Team
Jun 28, 2024 · Big Data

How to Build a Robust Big Data Monitoring and Alerting System

This article explains why high‑availability design and comprehensive monitoring are essential for modern big‑data platforms, outlines a layered architecture, and provides practical guidance on health checks, alerting, and data‑quality monitoring across storage, compute, scheduling, and service layers.

AlertingBig DataFlink
0 likes · 14 min read
How to Build a Robust Big Data Monitoring and Alerting System
Xianyu Technology
Xianyu Technology
Jun 17, 2020 · Backend Development

Lottery System Risk Management and SDK Integration

Xianyu mitigated lottery‑related financial loss by centralizing rights management, decoupling UI from business logic, and providing a unified SDK with simple draw APIs, while adding real‑time log backflow, comprehensive accounting and monitoring, cutting configuration time by over 50 % and eliminating UI‑only risk.

MonitoringSDKbackend
0 likes · 10 min read
Lottery System Risk Management and SDK Integration