Tagged articles

2179 articles

Page 9 of 22

May 5, 2023 · Operations

How to Build a Flexible Kubernetes Monitoring System for Big Data with kube‑prometheus

This article explains how to design and implement a lightweight, flexible monitoring solution for big‑data components running on Kubernetes using kube‑prometheus, covering metric exposure methods, scrape configurations, alert rule design, exporter deployment, and practical examples with code snippets.

AlertmanagerBig DataKubernetes

0 likes · 19 min read

How to Build a Flexible Kubernetes Monitoring System for Big Data with kube‑prometheus

Open Source Linux

May 5, 2023 · Operations

Essential Ops Lessons from 3.5 Years of Real-World Crises

Drawing from three and a half years of operations work, this article shares hard‑earned best practices on testing, backups, security, monitoring, performance tuning, and the right mindset to avoid costly mistakes such as data loss, service outages, and security breaches.

Backupmonitoringperformance-tuning

0 likes · 12 min read

Essential Ops Lessons from 3.5 Years of Real-World Crises

政采云技术

Apr 29, 2023 · Cloud Native

Understanding Observability: Challenges, Principles, and OpenTelemetry Architecture

The article explains how growing system complexity drives the need for observability, outlines the three pillars of logs, traces, and metrics, compares traditional stability stacks with modern observability, and details OpenTelemetry's design, advantages, and implementation considerations for cloud‑native environments.

MicroservicesOpenTelemetrymonitoring

0 likes · 16 min read

Understanding Observability: Challenges, Principles, and OpenTelemetry Architecture

Liangxu Linux

Apr 26, 2023 · Operations

Essential Linux Ops Practices to Prevent Disasters

Drawing from years of sysadmin experience, this guide lists concrete Linux operational habits—such as rigorous backups, cautious use of rm‑rf, single‑person changes, SSH hardening, firewall rules, monitoring, and disciplined performance tuning—to help teams avoid costly production failures.

Linuxmonitoringperformance tuning

0 likes · 12 min read

Essential Linux Ops Practices to Prevent Disasters

DeWu Technology

Apr 26, 2023 · Operations

Stability and Alerting Practices for E‑commerce Order Submission Service

The article details how a high‑throughput e‑commerce checkout pipeline achieves stability by combining fine‑grained metrics, custom trace logs, version‑based data validation, and targeted alert rules that detect latency spikes, error‑code surges, and downstream service failures, enabling rapid incident localization and reliable order processing.

Alertinge‑commercemonitoring

0 likes · 12 min read

Stability and Alerting Practices for E‑commerce Order Submission Service

Zhuanzhuan Tech

Apr 26, 2023 · Backend Development

Design and Implementation of an Automated Payment Channel Management System

This article describes the design, technology choices, architecture, and implementation details of an automated payment channel management system that uses Redis‑based time‑series storage, custom circuit‑breaker logic, and monitoring to achieve fast fault detection, accurate alerting, and future automated failover.

Backendcircuit breakerfault tolerance

0 likes · 10 min read

Design and Implementation of an Automated Payment Channel Management System

Ops Development Stories

Apr 25, 2023 · Operations

Simplify Monitoring with Categraf: All‑in‑One Agent for Metrics, Logs, and Traces

Categraf is an all‑in‑one, Go‑based monitoring agent that consolidates metric, log, and trace collection, offering remote_write support, lightweight deployment, and extensive plugin configurations to replace multiple exporters in Prometheus‑based observability stacks.

CategrafPrometheusagent

0 likes · 14 min read

Simplify Monitoring with Categraf: All‑in‑One Agent for Metrics, Logs, and Traces

Qunar Tech Salon

Apr 24, 2023 · Operations

Design and Evolution of Qunar's Watcher Enterprise Monitoring Platform

The article details the background, architecture, core features, alert governance, trace integration, and cloud‑native evolution of Watcher, Qunar's internally built, highly scalable monitoring platform that unifies application‑level metrics, alerting, and observability across thousands of services and containers.

AlertingDevOpscloud-native

0 likes · 19 min read

Design and Evolution of Qunar's Watcher Enterprise Monitoring Platform

DevOps Operations Practice

Apr 21, 2023 · Operations

Monitoring MySQL with Prometheus and Grafana: Installation, Configuration, and Alerting Guide

This tutorial explains how to install the MySQL Exporter, configure Prometheus to scrape MySQL metrics, set up Grafana dashboards for visualization, and define alerting rules, providing a complete end‑to‑end solution for monitoring MySQL databases in production environments.

AlertingExporterGrafana

0 likes · 5 min read

Monitoring MySQL with Prometheus and Grafana: Installation, Configuration, and Alerting Guide

vivo Internet Technology

Apr 19, 2023 · Backend Development

Investigation of Midnight Interface Timeout in Vivo E‑commerce Activity System

The article details how a midnight interface timeout in Vivo’s e‑commerce activity system was traced to a logging bottleneck: a synchronous Log4j call blocked all threads while a cron‑driven log‑rotation script copied a 2.6 GB file, and the issue was resolved by switching to asynchronous logging with a non‑blocking appender.

BackendTomcatlogging

0 likes · 17 min read

Investigation of Midnight Interface Timeout in Vivo E‑commerce Activity System

DeWu Technology

Apr 19, 2023 · Backend Development

Web Project Code Refactoring: Practices, Challenges, and Solutions

The article details a year‑long refactoring case study of a fast‑iteration web project migrated from Python to Java/Go, describing inherited performance bugs, a prioritized migration plan, monitoring integration, concrete optimizations such as query reduction and cache redesign, and the resulting stability and latency gains, while outlining required developer skills and best‑practice recommendations.

Code RefactoringDevOpsGo

0 likes · 19 min read

Web Project Code Refactoring: Practices, Challenges, and Solutions

Efficient Ops

Apr 18, 2023 · Databases

Mastering MongoDB Clusters: Setup, Monitoring, Migration, and Optimization

This comprehensive guide explains MongoDB cluster architecture, component roles, common use cases, monitoring commands, essential maintenance operations, data migration steps, troubleshooting of typical production issues, and practical optimization recommendations for high‑performance deployments.

BackupClusterMongoDB

0 likes · 20 min read

Mastering MongoDB Clusters: Setup, Monitoring, Migration, and Optimization

Java Architect Essentials

Apr 16, 2023 · Databases

Alibaba Druid Connection Pool in Spring Boot: Concepts, Configuration, Monitoring and Customization

This article introduces Alibaba's Druid database connection pool, explains its core features and filters, shows how to add the Maven starter, configure properties and filters in Spring Boot, demonstrates the built‑in monitoring pages, slow‑SQL logging, Spring AOP integration, and provides methods to remove the default advertisement and retrieve monitoring data via code.

ConfigurationConnection PoolDruid

0 likes · 16 min read

Alibaba Druid Connection Pool in Spring Boot: Concepts, Configuration, Monitoring and Customization

MaGe Linux Operations

Apr 16, 2023 · Operations

How Netflix’s Telltale Transforms Application Monitoring and Alerting

The article details Netflix’s self‑built Telltale monitoring system, explaining how it consolidates data sources, reduces alert fatigue, provides intelligent alerts, and continuously optimizes application health assessment for over 100 production services, ultimately improving operational efficiency and reliability.

AlertingNetflixOperations

0 likes · 11 min read

How Netflix’s Telltale Transforms Application Monitoring and Alerting

Su San Talks Tech

Apr 14, 2023 · Operations

How to Master Grafana Monitoring on Alibaba Cloud: A Step‑by‑Step Guide

This guide explains why Grafana is essential for modern monitoring, outlines its key features, and provides a step‑by‑step tutorial for creating and configuring Grafana dashboards on Alibaba Cloud, including service provisioning, data source integration, and panel customization.

Alibaba CloudDashboardGrafana

0 likes · 9 min read

How to Master Grafana Monitoring on Alibaba Cloud: A Step‑by‑Step Guide

Aotu Lab

Apr 13, 2023 · Frontend Development

How to Triple Your Web App’s Speed: A Front‑End Performance Optimization Playbook

This article walks through a comprehensive front‑end performance optimization process—starting from diagnosing issues with Lighthouse, identifying bottlenecks such as large bundle size and uncompressed assets, applying code splitting, lazy loading, image optimization, CSP, SEO tweaks, and finally setting up continuous monitoring with a custom platform—to achieve a 279% improvement in Lighthouse performance scores and near‑three‑fold speed gains.

LighthouseSEOfrontend

0 likes · 11 min read

How to Triple Your Web App’s Speed: A Front‑End Performance Optimization Playbook

Ops Development Stories

Apr 13, 2023 · Operations

How to Deploy N9e: A Step‑by‑Step Guide to Unified Observability

This article walks through the challenges of observability for small‑to‑medium companies and provides a detailed, hands‑on guide to installing, configuring, and using the N9e monitoring platform—including architecture options, component setup, and adding data sources—so readers can achieve integrated alerting, metrics, logs, and tracing in a single pane.

N9eOperationsmonitoring

0 likes · 13 min read

How to Deploy N9e: A Step‑by‑Step Guide to Unified Observability

Efficient Ops

Apr 12, 2023 · Operations

Building Highly Available Prometheus Monitoring with Thanos: A Practical Guide

This article explains why native Prometheus HA solutions fall short for large, multi‑region clusters and shows how to use Thanos components—including sidecar, query, store gateway, and compactor—to achieve long‑term storage, unlimited scaling, a global view, and non‑intrusive integration with existing Prometheus deployments.

KubernetesPrometheusThanos

0 likes · 22 min read

Building Highly Available Prometheus Monitoring with Thanos: A Practical Guide

dbaplus Community

Apr 10, 2023 · Operations

Can Ops Roles Disappear? Exploring Self‑Service Platforms, COE Experts, and SaaS in Modern Monitoring

The article examines whether traditional operations positions can become obsolete by analyzing a self‑service platform + COE + Business Partner model, detailing essential monitoring tools, the role of COE specialists, SaaS alternatives, and practical career pathways for newcomers, mid‑level, and senior engineers.

COEOperationsSaaS

0 likes · 8 min read

Can Ops Roles Disappear? Exploring Self‑Service Platforms, COE Experts, and SaaS in Modern Monitoring

ITPUB

Apr 10, 2023 · Operations

How Bytecode Enhancement Enables Zero‑Intrusion Monitoring for Microservices

This article, based on a SACC 2022 talk by Huolala architect Cao Wei, explains the principles of bytecode‑enhancement, its practical implementation for large‑scale microservice monitoring, compares enhancement frameworks, shares best‑practice patterns, and explores broader applications such as service‑mesh sidecars.

BackendInstrumentationJava Agent

0 likes · 18 min read

How Bytecode Enhancement Enables Zero‑Intrusion Monitoring for Microservices

Architecture Digest

Apr 10, 2023 · Operations

Comparison of Common Log Management Tools: Filebeat, Graylog, LogDNA, ELK, Loki, Datadog, Logstash, Fluentd, and Splunk

This article provides a detailed comparison of nine popular log management solutions—Filebeat, Graylog, LogDNA, ELK Stack, Grafana Loki, Datadog, Logstash, Fluentd, and Splunk—covering their core features, pricing models, advantages, and drawbacks to help readers choose the right tool for centralized logging.

ELKLog Managementcloud

0 likes · 13 min read

Comparison of Common Log Management Tools: Filebeat, Graylog, LogDNA, ELK, Loki, Datadog, Logstash, Fluentd, and Splunk

MaGe Linux Operations

Apr 8, 2023 · Operations

Master Linux System Monitoring: htop, top, CPU, Memory, and Process Insights

This guide explains how to use Linux tools such as htop, top, uptime, sar, strace, free, pidstat, lsof, and Docker commands to monitor CPU, load average, memory, processes, network throughput, socket states, and database connections, helping you diagnose performance issues and map container PIDs to host PIDs.

CPUmonitoringnetwork

0 likes · 14 min read

Master Linux System Monitoring: htop, top, CPU, Memory, and Process Insights

Aikesheng Open Source Community

Apr 4, 2023 · Databases

Understanding OMS Components, Migration Workflow, and Performance Optimization

This article provides a detailed analysis of the OMS community edition 3.3.1 architecture, explains each internal process, describes the migration workflow and monitoring metrics, and offers practical optimization tips to improve data migration performance.

Data MigrationMetricsOMS

0 likes · 20 min read

Understanding OMS Components, Migration Workflow, and Performance Optimization

DataFunTalk

Apr 4, 2023 · Big Data

Compass: An Open‑Source Big Data Task Diagnosis Platform for DolphinScheduler, Airflow and Spark

Compass is an open‑source big‑data diagnostic platform developed by OPPO that provides non‑intrusive, real‑time monitoring and root‑cause analysis for offline and streaming tasks on schedulers such as DolphinScheduler and Airflow, covering workflow‑level failures, Spark engine anomalies, resource usage, and offering one‑click reports and extensible rule‑based diagnostics.

Big DataDolphinSchedulerSpark

0 likes · 13 min read

Compass: An Open‑Source Big Data Task Diagnosis Platform for DolphinScheduler, Airflow and Spark

Refining Core Development Skills

Apr 4, 2023 · Cloud Native

Understanding Container CPU Utilization: Accurate Measurement Methods and the Missing Nice/IRQ/SoftIRQ Metrics

This article explains how to correctly obtain CPU utilization inside containers, compares host and container metrics, describes the use of lxcfs and cgroup files (including cgroup V1/V2) for accurate measurement, and clarifies why container statistics omit nice, irq, and softirq fields.

Cloud NativeContainercgroup

0 likes · 16 min read

Understanding Container CPU Utilization: Accurate Measurement Methods and the Missing Nice/IRQ/SoftIRQ Metrics

DevOps Operations Practice

Mar 31, 2023 · Operations

Monitoring Nginx with Prometheus: Configuration and Visualization Guide

This tutorial shows how to enable Nginx's stub_status module, install the Nginx Prometheus Exporter, configure Prometheus to scrape Nginx metrics, and visualize the data in Grafana, providing a complete end‑to‑end monitoring solution.

ConfigurationExporterNginx

0 likes · 4 min read

Monitoring Nginx with Prometheus: Configuration and Visualization Guide

Open Source Linux

Mar 31, 2023 · Operations

Boost Your Ops Efficiency: 5 Python Scripts Every Engineer Should Know

This article explains how Python can automate common operations tasks—remote command execution, log parsing, system monitoring with alerts, batch software deployment, and backup/recovery—providing code examples and practical tips to improve efficiency and reduce manual errors.

DeploymentOpsPython

0 likes · 9 min read

Boost Your Ops Efficiency: 5 Python Scripts Every Engineer Should Know

MaGe Linux Operations

Mar 28, 2023 · Operations

Essential Ops Lessons: Prevent Data Disasters and Boost Server Reliability

Drawing from three and a half years of sysadmin experience, this guide shares practical rules for safe online operations, data protection, security hardening, monitoring, performance tuning, and the right mindset to avoid costly outages and maintain stable, secure services.

BackupOperationsmonitoring

0 likes · 12 min read

Essential Ops Lessons: Prevent Data Disasters and Boost Server Reliability

DevOps Cloud Academy

Mar 25, 2023 · Operations

Essential Skills for DevOps Engineers

The article outlines the key competencies DevOps engineers must master—including cloud computing, automation, containerization, CI/CD, monitoring, logging, and infrastructure-as-code—to accelerate, stabilize, and scale software delivery in modern development environments.

DevOpsInfrastructure as Codeautomation

0 likes · 6 min read

MaGe Linux Operations

Mar 24, 2023 · Operations

How to Reduce False Alarms in Distributed Systems with Interval Detection

This article explains the challenges of monitoring highly distributed applications, why static alert thresholds often fail, and how interval detection using algorithms like Local Outlier Factor can improve alert accuracy while reducing noise across tools such as Grafana, Zabbix, and Open‑Falcon.

AlertingOperationsinterval detection

0 likes · 16 min read

How to Reduce False Alarms in Distributed Systems with Interval Detection

MaGe Linux Operations

Mar 24, 2023 · Operations

Why Most Monitoring Strategies Fail and How the CAR Framework Fixes Them

This article explains why typical monitoring approaches miss the mark, outlines four root causes of persistent incidents, and introduces the CAR framework—Customer, Application, Resource—to build user‑centric observability that reduces noise, restores trust, and improves reliability.

CAR frameworkOperationsincident management

0 likes · 11 min read

Why Most Monitoring Strategies Fail and How the CAR Framework Fixes Them

ITPUB

Mar 24, 2023 · Cloud Native

Why Open‑Falcon Stalled and How Cloud‑Native Monitoring Is Evolving

This article reviews the evolution of monitoring in the cloud‑native era, analyzes Open‑Falcon’s architecture, strengths, and shortcomings, explains why its development hit a bottleneck, and outlines the design principles and features of the Nightingale monitoring system as a modern, open‑source alternative.

MicroservicesOpen-Falconarchitecture

0 likes · 15 min read

Why Open‑Falcon Stalled and How Cloud‑Native Monitoring Is Evolving

dbaplus Community

Mar 23, 2023 · Operations

How Qunar Scaled Container Monitoring with VictoriaMetrics: Lessons from Replacing Prometheus

This article details Qunar's migration from Prometheus to VictoriaMetrics for large‑scale container monitoring, covering the shortcomings of Prometheus at massive data volumes, the architectural choices made, performance improvements achieved, and future optimization plans.

KubernetesPrometheusTime Series

0 likes · 13 min read

How Qunar Scaled Container Monitoring with VictoriaMetrics: Lessons from Replacing Prometheus

dbaplus Community

Mar 20, 2023 · Operations

How Xianyu’s Messaging Team Built a Zero‑Incident System with Gray Releases, Monitoring, and Automated Regression

The article details how Xianyu’s messaging team systematically improved system stability by classifying risks, implementing gray‑release traffic, establishing dedicated monitoring and alerting dashboards, integrating automated regression into CI/CD, and managing strong‑weak dependencies, ultimately reducing online incidents to near zero.

Operationsautomated regressiondependency management

0 likes · 10 min read

How Xianyu’s Messaging Team Built a Zero‑Incident System with Gray Releases, Monitoring, and Automated Regression

Architecture Digest

Mar 18, 2023 · Operations

Understanding Log Importance and Operations in Distributed Architecture

This article explains what logs are, why they are crucial in large‑scale distributed systems, outlines the requirements for effective log operations, reviews common tooling such as ELK, Prometheus and tracing solutions, provides a Go example for batch log retrieval, and shares best‑practice guidelines to achieve observability.

APMmonitoring

0 likes · 19 min read

Understanding Log Importance and Operations in Distributed Architecture

NetEase Smart Enterprise Tech+

Mar 15, 2023 · Operations

How Yidun Automates Performance Testing to Overcome Real‑World Pain Points

This article explains performance testing fundamentals, why it matters, the specific challenges Yidun faced such as complex execution, human‑dependent monitoring, data isolation, and cost loss, and describes their automated, gradient‑based testing platform with quantified monitoring and future visualisation plans.

Data IsolationOperationsPerformance Testing

0 likes · 8 min read

How Yidun Automates Performance Testing to Overcome Real‑World Pain Points

dbaplus Community

Mar 14, 2023 · Backend Development

How to Detect and Solve Java Application Performance Bottlenecks: A Practical Guide

This article walks through the evolution of a system’s performance concerns, defines speed and pressure dimensions, explains how to calculate RT, QPS and concurrency, compares QPS with TPS, and provides step‑by‑step methods using tools like Arthas, JMeter and JVM diagnostics to identify and fix CPU, memory and pressure issues before applying layered optimization strategies.

ArthasJMeterProfiling

0 likes · 12 min read

How to Detect and Solve Java Application Performance Bottlenecks: A Practical Guide

Tencent Cloud Developer

Mar 13, 2023 · Cloud Computing

Design Principles for High‑Availability System Architecture

The article outlines a comprehensive high‑availability architecture framework across six layers—development standards, application services, storage, product fallback, operations deployment, and emergency response—detailing design principles such as stateless services, elastic scaling, redundant storage, robust monitoring, gray releases, and chaos engineering to ensure resilient, continuously available systems.

DeploymentScalabilitySystem Architecture

0 likes · 25 min read

Design Principles for High‑Availability System Architecture

Open Source Linux

Mar 9, 2023 · Operations

Prometheus vs Zabbix: Which Monitoring Tool Wins for Modern Ops?

An in‑depth comparison of Prometheus and Zabbix examines their histories, architectures, data storage, scalability, and container support, highlighting Prometheus’s cloud‑native pull model and Go‑based performance versus Zabbix’s mature, relational‑database approach, to help teams choose the right monitoring solution.

PrometheusTime Series DatabaseZabbix

0 likes · 8 min read

Prometheus vs Zabbix: Which Monitoring Tool Wins for Modern Ops?

Top Architect

Mar 8, 2023 · Databases

Deep Dive into Prometheus V2 Storage Engine and Query Process

This article explains the internal storage layout, on‑disk and in‑memory data structures, and the query execution flow of Prometheus V2, illustrating how blocks, chunks, WAL, indexes and postings are organized and accessed to serve time‑series queries efficiently.

GoPrometheusStorage Engine

0 likes · 15 min read

Deep Dive into Prometheus V2 Storage Engine and Query Process

AntTech

Mar 7, 2023 · Cloud Native

Introduction to HoloInsight: A Cloud‑Native Lightweight Observability Platform

HoloInsight is an open‑source, cloud‑native observability platform derived from Ant Group's AntMonitor, offering integrated log‑based monitoring, business metric analysis, and AI‑driven AIOps capabilities while providing a lightweight, modular architecture and extensive extensibility for modern software stacks.

aiopscloud-nativelog analysis

0 likes · 13 min read

Introduction to HoloInsight: A Cloud‑Native Lightweight Observability Platform

IT Services Circle

Mar 1, 2023 · Backend Development

Root Cause Analysis and Resolution of OutOfMemoryError in a Java Backend Service

This article details a comprehensive investigation of a Java backend service suffering from severe OutOfMemoryError due to an unbounded userId list in a count query, describing monitoring findings, heap dump analysis, and practical mitigation steps including request limiting and JVM tuning.

JVMOutOfMemoryErrordatabase

0 likes · 7 min read

Root Cause Analysis and Resolution of OutOfMemoryError in a Java Backend Service

Architect

Feb 27, 2023 · Databases

Understanding Prometheus V2 Storage Engine and Query Process

This article explains the architecture of Prometheus V2, detailing its on‑disk block layout, chunk and index formats, the inverted index mechanism, and how queries locate and retrieve time‑series data, while also covering in‑memory structures and practical usage patterns.

CloudNativePrometheusStorageEngine

0 likes · 14 min read

Understanding Prometheus V2 Storage Engine and Query Process

DeWu Technology

Feb 27, 2023 · Operations

Message Push Monitoring and SLA Practices

The team implemented SLA‑based, node‑level monitoring for mobile push messages—splitting the workflow, measuring latency, blocking volume, and success rates, isolating metrics with Spring AOP, and tracking third‑party vendors—resulting in clear latency standards, doubled peak throughput, faster issue resolution, and improved overall reliability.

Message PushOperationsSLA

0 likes · 11 min read

Message Push Monitoring and SLA Practices

ITPUB

Feb 24, 2023 · Databases

How Ctrip Migrated MySQL to OceanBase: Tools, Process, and Lessons Learned

Ctrip evaluated and extended OceanBase Migration Assessment tools, built a one‑click migration workflow, implemented comprehensive monitoring and automatic fault‑diagnosis pipelines, and addressed compatibility challenges such as .NET charset issues and Druid parser errors, ultimately achieving a smooth MySQL‑to‑OceanBase transition.

OceanBasePerformance Diagnosisdatabase migration

0 likes · 18 min read

How Ctrip Migrated MySQL to OceanBase: Tools, Process, and Lessons Learned

Architecture Digest

Feb 24, 2023 · Operations

Understanding Prometheus Alerting: When Alerts Fire and Why They May Not

This article explains the principles behind Prometheus alerts, when they trigger, why they sometimes stay silent, and how Alertmanager’s routing tree and notification pipeline work together to manage alert noise, grouping, silencing, and deduplication.

AlertingAlertmanagerGolang

0 likes · 18 min read

Understanding Prometheus Alerting: When Alerts Fire and Why They May Not

Baidu Geek Talk

Feb 20, 2023 · Operations

Deep Dive into Logging Operations and Observability in Distributed Systems

The article examines logging’s critical role in distributed systems, detailing its purpose, severity levels, and value for debugging, performance, security, and auditing, while highlighting challenges of inconsistent formats and traceability, and reviewing observability pillars, ELK and tracing tools, and practical implementation best practices.

APMELKPrometheus

0 likes · 19 min read

Deep Dive into Logging Operations and Observability in Distributed Systems

21CTO

Feb 16, 2023 · Operations

Which Log Management Tool Is Right for You? A Comprehensive Comparison of 9 Solutions

This article provides a detailed comparison of nine popular log management tools—including Filebeat, Graylog, LogDNA, ELK, Grafana Loki, Datadog, Logstash, Fluentd, and Splunk—covering their main features, pricing, advantages, and disadvantages to guide readers in selecting the most suitable solution for their needs.

ELKLog ManagementOperations

0 likes · 16 min read

Which Log Management Tool Is Right for You? A Comprehensive Comparison of 9 Solutions

Xianyu Technology

Feb 16, 2023 · Operations

Stability Governance of Xianyu Messaging System

Since launching a systematic stability‑governance program in August 2022, Xianyu’s messaging system has employed gray releases, dedicated monitoring, daily automated regression, dependency reviews and drills, resulting in near‑zero online incidents within six months and demonstrating that continuous, context‑specific measures and vigilant change management are essential for reliable C2C transactions.

Messagingautomationdependency management

0 likes · 7 min read

Stability Governance of Xianyu Messaging System

macrozheng

Feb 11, 2023 · Operations

Deploy and Use Uptime Kuma: A Simple, Beautiful Open‑Source Monitoring Tool

This article introduces Uptime Kuma, an open‑source monitoring tool, and provides step‑by‑step instructions for installing it via Docker or manually, configuring monitors, and using its dashboard, highlighting its simplicity, visual appeal, and support for multiple services and notification methods.

DockerSelf-hostedUptime Kuma

0 likes · 3 min read

Deploy and Use Uptime Kuma: A Simple, Beautiful Open‑Source Monitoring Tool

MaGe Linux Operations

Feb 10, 2023 · Cloud Native

How to Detect and Prevent OOM and CPU Throttling in Kubernetes Pods

This article explains why Kubernetes pods encounter out‑of‑memory errors and CPU throttling, how limits and requests influence resource allocation, and provides practical monitoring techniques using Prometheus and cAdvisor to proactively identify and mitigate these issues before they impact performance or cause pod eviction.

CPU throttlingOOMcAdvisor

0 likes · 9 min read

How to Detect and Prevent OOM and CPU Throttling in Kubernetes Pods

Selected Java Interview Questions

Feb 10, 2023 · Backend Development

Integrating Spring Boot with Micrometer, Prometheus, and Grafana for Monitoring and Docker Deployment

This article explains how to combine Spring Boot with Micrometer, Prometheus, and Grafana for metrics collection and visualization, and provides Maven dependencies, configuration snippets, and Docker commands to deploy a fully monitored backend service using Docker containers.

DockerGrafanaMicrometer

0 likes · 6 min read

Integrating Spring Boot with Micrometer, Prometheus, and Grafana for Monitoring and Docker Deployment

Alibaba Cloud Native

Feb 8, 2023 · Cloud Native

Alibaba Cloud Prometheus vs Open‑Source Prometheus: Deep Performance Benchmark

This article benchmarks Alibaba Cloud Prometheus against the open‑source Prometheus across multiple cluster sizes, churn rates, and query patterns, revealing that while the open‑source version remains stable under light load, its CPU and memory usage grow non‑linearly with high cardinality, whereas Alibaba's managed service delivers higher compatibility, better query performance, and more predictable scaling.

Cloud NativeMetricsPrometheus

0 likes · 30 min read

Alibaba Cloud Prometheus vs Open‑Source Prometheus: Deep Performance Benchmark

Wukong Talks Architecture

Feb 8, 2023 · Operations

Comparison of Popular Log Management Solutions: Filebeat, Graylog, LogDNA, ELK, Loki, Datadog, Logstash, Fluentd, and Splunk

This article provides a comprehensive comparison of nine widely used log management tools, detailing their core features, pricing models, advantages, and drawbacks to help readers make informed decisions when selecting a logging solution for their infrastructure.

DevOpsELKLog Management

0 likes · 16 min read

Comparison of Popular Log Management Solutions: Filebeat, Graylog, LogDNA, ELK, Loki, Datadog, Logstash, Fluentd, and Splunk

dbaplus Community

Feb 6, 2023 · Operations

How Vivo Built a Scalable, Cloud‑Native Monitoring Platform for Millions of Services

This article outlines Vivo's multi‑year journey of designing, evolving, and operating a cloud‑native, AIOps‑enabled monitoring platform that supports tens of thousands of hosts, databases, containers, and services, detailing its architecture, challenges, and future directions for observability and reliability.

OperationsSystem Architectureaiops

0 likes · 18 min read

How Vivo Built a Scalable, Cloud‑Native Monitoring Platform for Millions of Services

ByteFE

Feb 5, 2023 · Frontend Development

Front‑End Development Insights: Monitoring, Low‑Code, TypeScript, Trends, and Performance Optimizations

This collection presents a range of front‑end development insights, including ByteDance’s monitoring practices, low‑code product considerations, DeepKit’s TypeScript runtime capabilities, emerging 2023 trends, web development forecasts, the need for diverse JavaScript frameworks, markdown‑to‑PPT tools, and Vue 3 table performance optimizations.

Vue3low-codemonitoring

0 likes · 5 min read

Front‑End Development Insights: Monitoring, Low‑Code, TypeScript, Trends, and Performance Optimizations

Alibaba Cloud Native

Feb 3, 2023 · Operations

How eBPF Enables Zero‑Intrusion Monitoring for Multi‑Language Serverless Apps

This article explains how eBPF technology provides a unified, zero‑intrusion monitoring solution for Serverless applications across any language, detailing its architecture, workflow, and the advantages it brings to cloud‑native operations such as low cost, high performance, and multi‑protocol support.

Cloud NativeServerlesseBPF

0 likes · 9 min read

How eBPF Enables Zero‑Intrusion Monitoring for Multi‑Language Serverless Apps

政采云技术

Feb 2, 2023 · Operations

Distributed Tracing Overview and SkyWalking Architecture

This article explains the fundamentals of distributed tracing, introduces the Dapper and OpenTracing models, and details SkyWalking's data collection, cross‑process propagation, bytecode enhancement, architecture components, monitoring, alerting, and performance characteristics for microservice environments.

Distributed TracingMicroservicesOpenTracing

0 likes · 10 min read

Distributed Tracing Overview and SkyWalking Architecture

HelloTech

Jan 31, 2023 · Operations

Stability Assurance Practices for Large‑Scale Promotional Events

The article outlines a comprehensive stability‑assurance framework for large‑scale promotional events—detailing planning, capacity and pressure‑test rehearsals, strict change‑freeze, internal gray releases, coordinated on‑call response, thorough link and capacity analysis, monitoring, emergency procedures, cross‑team collaboration, external partner coordination, and post‑event review to ensure resilient system performance.

Large-Scale EventsPerformance Testingcapacity planning

0 likes · 17 min read

Stability Assurance Practices for Large‑Scale Promotional Events

Efficient Ops

Jan 30, 2023 · Operations

Master Redis Monitoring: Key Metrics, Commands, and Performance Testing

This guide explains essential Redis monitoring metrics, the tools and commands for collecting performance, memory, activity, persistence, and error data, and shows how to use INFO, slowlog, and redis-benchmark to assess and improve database operations.

MetricsOpsdatabase

0 likes · 6 min read

Master Redis Monitoring: Key Metrics, Commands, and Performance Testing

Architect

Jan 30, 2023 · Backend Development

Unified Exception Monitoring and Reporting with ASM and JavaAgent

This article explains how to use Java bytecode instrumentation with ASM and a JavaAgent to automatically monitor, capture, and report exceptions across a backend system, covering exception fundamentals, best‑practice handling, and practical implementation steps.

ASMException HandlingJavaAgent

0 likes · 14 min read

Unified Exception Monitoring and Reporting with ASM and JavaAgent

Architect

Jan 29, 2023 · Backend Development

DynamicTp: A Dynamic ThreadPoolExecutor Extension for Real‑time Monitoring and Configuration

DynamicTp is a Java library that extends ThreadPoolExecutor to enable dynamic, configuration‑center‑driven adjustment of thread‑pool parameters, real‑time monitoring, alert notifications, and metric collection, providing a lightweight, zero‑intrusion solution for microservice architectures.

Dynamic ConfigurationSpringBootThreadPool

0 likes · 9 min read

DynamicTp: A Dynamic ThreadPoolExecutor Extension for Real‑time Monitoring and Configuration

IT Architects Alliance

Jan 27, 2023 · Backend Development

Comprehensive Guide to Building a Backend Technology Stack for Startup Companies

This article provides a detailed overview of the essential backend technology stack for startups, covering language choices, components, processes, systems, and cloud services, and offers practical recommendations for selecting databases, messaging, monitoring, CI/CD, and deployment tools to build a robust, scalable infrastructure.

BackendTechnology Stackdatabase

0 likes · 28 min read

Comprehensive Guide to Building a Backend Technology Stack for Startup Companies

NetEase Cloud Music Tech Team

Jan 17, 2023 · Big Data

How NetEase Cloud Music Cut Data Pipeline Delays by 60% with Full‑Link Baseline Governance

This case study details NetEase Cloud Music's full‑link baseline governance initiative, outlining the challenges of massive data pipelines, the metrics used to measure success, the three‑pronged action plan (infrastructure, task optimization, and standards), and the resulting improvements in availability, resource utilization, and monitoring accuracy.

Big DataResource Optimizationbaseline governance

0 likes · 11 min read

How NetEase Cloud Music Cut Data Pipeline Delays by 60% with Full‑Link Baseline Governance

dbaplus Community

Jan 16, 2023 · Operations

Beyond Success‑Ratio: How User‑Uptime Reveals Real Product Availability

The article reviews traditional availability metrics such as Success‑Ratio, Error‑Budget, MTTR/MTTF, SLA/SLO, and highlights their limitations, then introduces Google’s User‑Uptime and Windowed User‑Uptime metrics, explains their definitions, challenges, experimental results, and why they provide a more user‑centric view of service reliability.

AvailabilityMetricsSRE

0 likes · 27 min read

Beyond Success‑Ratio: How User‑Uptime Reveals Real Product Availability

Efficient Ops

Jan 16, 2023 · Operations

How China Mobile’s Centralized AIOps Platform Achieved Top‑Tier Evaluation

This article details China Mobile Information's interview about their centralized AIOps platform, the recent excellent‑level assessment by the China Academy of Information and Communications Technology, the system's key modules, future plans, and the broader significance of AI‑driven IT operations.

Artificial IntelligenceIT OperationsRoot Cause Analysis

0 likes · 11 min read

How China Mobile’s Centralized AIOps Platform Achieved Top‑Tier Evaluation

DataFunSummit

Jan 16, 2023 · Big Data

Building an O2O Industry Data Platform: From Monitoring to Diagnosis

This article shares practical insights on constructing an O2O industry data platform, detailing user classification, business pain points, and a three‑step strategy—monitoring, analysis, and diagnosis—to extract core metrics, implement tailored reporting, conduct operational and pricing analyses, and drive data‑driven product improvements.

AnalysisBusiness IntelligenceData Platform

0 likes · 15 min read

Building an O2O Industry Data Platform: From Monitoring to Diagnosis

Code Ape Tech Column

Jan 14, 2023 · Operations

Comparison of Common Log Management Tools: Features, Pricing, Pros and Cons

This article provides a detailed comparison of nine popular log management solutions—including Filebeat, Graylog, LogDNA, the ELK stack, Grafana Loki, Datadog, Logstash, Fluentd, and Splunk—covering their main features, pricing models, advantages, and disadvantages to help readers choose the right tool for their needs.

ELKLog Managementmonitoring

0 likes · 16 min read

Comparison of Common Log Management Tools: Features, Pricing, Pros and Cons

Ziru Technology

Jan 12, 2023 · Operations

Why Alertmanager Config Keeps Getting Overwritten in TiDB Clusters and How to Fix It

This guide explains why the Alertmanager configuration file in a TiDB cluster is repeatedly overwritten during reloads, analyzes error logs and TiUP documentation, and provides step‑by‑step instructions to edit the topology, set a custom config file, reload the service, and verify the fix.

AlertmanagerCluster ManagementConfiguration

0 likes · 8 min read

Why Alertmanager Config Keeps Getting Overwritten in TiDB Clusters and How to Fix It

Efficient Ops

Jan 11, 2023 · Operations

How to Effectively Monitor and Operate a DevOps System: From Metrics to NOC/MSP

This article explains how to maintain a DevOps environment by implementing comprehensive monitoring, handling fault detection and performance metrics, automating alerts in a continuously changing cloud landscape, and integrating NOC and MSP practices for 24/7 reliability and efficient incident response.

DevOpsMSPNOC

0 likes · 17 min read

How to Effectively Monitor and Operate a DevOps System: From Metrics to NOC/MSP

MaGe Linux Operations

Jan 11, 2023 · Databases

Master MySQL Monitoring: Built‑in Commands for Fast, Low‑Impact Insights

This article explains how to use MySQL's native SHOW commands and performance_schema tables to collect comprehensive monitoring metrics—including connections, buffer cache, locks, SQL activity, throughput, and slow queries—while minimizing overhead in a single‑node environment.

Slow Query Logmonitoringmysqldumpslow

0 likes · 10 min read

Master MySQL Monitoring: Built‑in Commands for Fast, Low‑Impact Insights

Data Thinking Notes

Jan 10, 2023 · Big Data

How Bilibili Built a Scalable Data Quality Platform for Billions of Events

This article describes Bilibili’s data quality platform, outlining its background, objectives, theoretical models, workflow stages (recording, checking, alerting), DSL for metrics, root‑cause analysis, scheduling strategies, heterogeneous source integration, rule coverage, intelligent monitoring, and future plans to achieve automated, real‑time, high‑reliability data assurance for massive daily workloads.

Big DataData QualityRoot Cause Analysis

0 likes · 21 min read

How Bilibili Built a Scalable Data Quality Platform for Billions of Events

Alibaba Cloud Developer

Jan 10, 2023 · Operations

How to Master System Stability: A Step‑by‑Step Guide for Reliable Operations

This article explains what stability assurance is, outlines a systematic workflow—including anomaly identification, monitoring configuration, impact assessment, and solution planning—and provides practical methods such as capacity estimation, traffic limiting, load testing, scaling, and pre‑heating to ensure services remain stable during both daily operations and high‑traffic events.

Operationscapacity planningincident response

0 likes · 25 min read

How to Master System Stability: A Step‑by‑Step Guide for Reliable Operations

Efficient Ops

Jan 9, 2023 · Operations

Boost Ops Efficiency: 5 Python Scripts Every Sysadmin Should Use

This article explains how Python can automate common operations tasks—remote command execution, log parsing, system monitoring with alerts, bulk software deployment, and backup/restore—providing code examples for each and highlighting additional tools that help sysadmins improve efficiency and reduce errors.

BackupDeploymentPython

0 likes · 9 min read

Boost Ops Efficiency: 5 Python Scripts Every Sysadmin Should Use

NetEase Yanxuan Technology Product Team

Jan 9, 2023 · Operations

Loggie: A High-Performance Log Collection Agent System Design and Implementation

Loggie is a cloud-native, Go-based log-collection agent that replaces Filebeat and Flume by using a micro-kernel producer-consumer architecture with hot-swappable pipelines, achieving 2 GB/s read speeds, 1.6‑2.6× higher throughput while using only a quarter of the CPU, and providing built-in observability, reliability, and latency monitoring for large-scale enterprise deployments.

GoOperationslog agent

0 likes · 16 min read

Loggie: A High-Performance Log Collection Agent System Design and Implementation

DevOps Operations Practice

Jan 8, 2023 · Operations

Zabbix vs Prometheus: A Detailed Comparison of Monitoring Systems

This article provides a comprehensive comparison between Zabbix and Prometheus, covering their architecture, data collection, storage, querying, visualization, and alerting capabilities to help enterprises choose the most suitable monitoring solution for their needs.

AlertingCloud NativeComparison

0 likes · 8 min read

Zabbix vs Prometheus: A Detailed Comparison of Monitoring Systems

Architecture Digest

Jan 8, 2023 · Operations

Design and Evolution of Vivo Server Monitoring System

This article systematically presents the business background, basic monitoring workflow, usage guidelines, OpenTSDB fundamentals, code precision issues, vmonitor collector architecture, old and new system designs, core alerting metrics, demo illustrations, and a comparison with mainstream monitoring solutions, offering insights for technology selection.

AlertingOpenTSDBServer

0 likes · 18 min read

Design and Evolution of Vivo Server Monitoring System

Alibaba Cloud Native

Jan 5, 2023 · Operations

Build Real‑Time MySQL Monitoring & Alerting with Prometheus on Alibaba Cloud

This guide explains why MySQL monitoring is critical, defines five key metric dimensions, shows how to collect them with Prometheus and the MySQL Exporter, provides ready‑to‑use alert rules, and walks through the full setup and dashboard creation on Alibaba Cloud.

AlertingAlibaba CloudCloud Native

0 likes · 7 min read

Build Real‑Time MySQL Monitoring & Alerting with Prometheus on Alibaba Cloud

vivo Internet Technology

Jan 4, 2023 · Artificial Intelligence

Root Cause Localization Algorithm and Its Implementation for Service Fault Diagnosis

The article describes a root‑cause localization algorithm implemented in vivo’s monitoring platform that automatically analyzes latency spikes by splitting service timelines, computing variance, clustering results with K‑means, and recursively tracing downstream services, achieving over 85 % accuracy for dependency failures while still requiring human verification and outlining future AI‑driven enhancements.

Fault LocalizationK-MeansRoot Cause Analysis

0 likes · 13 min read

Root Cause Localization Algorithm and Its Implementation for Service Fault Diagnosis

Architecture and Beyond

Jan 1, 2023 · Operations

How to Build Reliable Enterprise SaaS: Stability, Risk Modeling, and Governance

This article defines enterprise‑grade SaaS, contrasts it with consumer products, and presents a comprehensive framework for product, data, and system stability—including isolation requirements, SLA metrics, risk modeling, mitigation plans, and continuous review—to help SaaS teams deliver dependable services.

OperationsReliabilitySaaS

0 likes · 23 min read

How to Build Reliable Enterprise SaaS: Stability, Risk Modeling, and Governance

Architecture Digest

Dec 30, 2022 · Operations

Vivo Monitoring Platform: Architecture, Evolution, and Future Directions

The article details the evolution, architecture, capabilities, challenges, and future plans of Vivo's comprehensive monitoring platform, covering its transition from simple Zabbix setups to a cloud‑native, AI‑ops enabled system that ensures service availability across massive infrastructure.

Reliabilityaiopscloud-native

0 likes · 16 min read

Vivo Monitoring Platform: Architecture, Evolution, and Future Directions

vivo Internet Technology

Dec 28, 2022 · Operations

Monitoring Service System Construction and Exploration Practice

The article outlines vivo’s evolution from simple Zabbix monitoring to a self‑built, unified monitoring platform that now covers infrastructure, containers, databases and user experience at massive scale, integrating AI‑ops, cloud‑native observability and unified alerting to ensure end‑to‑end service reliability and future intelligent, one‑stop monitoring.

Vivoaiopsarchitecture

0 likes · 28 min read

Monitoring Service System Construction and Exploration Practice

Top Architect

Dec 26, 2022 · Backend Development

Comprehensive Guide to Backend Development: System Design, Architecture, Networking, Fault Handling, Monitoring, Governance and Deployment

This article provides a thorough overview of backend development, covering system design principles such as high cohesion and low coupling, architectural patterns for high concurrency and availability, network communication techniques, common fault scenarios, monitoring and alerting strategies, service governance practices, and deployment workflows.

BackendDeploymentScalability

0 likes · 30 min read

Comprehensive Guide to Backend Development: System Design, Architecture, Networking, Fault Handling, Monitoring, Governance and Deployment

MaGe Linux Operations

Dec 23, 2022 · Operations

How to Build an Enterprise‑Grade Observability System for Reliable SRE

This article explains how enterprises can design and implement a comprehensive observability platform—covering metrics, logs, tracing, fault response, post‑mortems, testing, capacity planning, and automation—to improve system reliability and user experience.

SREautomationcapacity planning

0 likes · 16 min read

How to Build an Enterprise‑Grade Observability System for Reliable SRE

Yunxuetang Frontend Team

Dec 23, 2022 · Frontend Development

Build the Ultimate Front-End Monitoring System & Master Advanced JS Tricks

It introduces a series of front‑end engineering articles covering comprehensive monitoring architecture, CSS‑based click throttling, drag‑and‑drop implementation, 30 essential JavaScript concepts, and a simple responsive data‑dashboard solution, aiming to deepen developers’ skills and showcase resume‑worthy expertise.

JavaScriptUIWeb Development

0 likes · 3 min read

Build the Ultimate Front-End Monitoring System & Master Advanced JS Tricks

Sohu Tech Products

Dec 21, 2022 · Frontend Development

Design and Implementation of a Front‑End Monitoring Platform and SDK

This article presents a comprehensive guide to building a front‑end monitoring system—including pain points, error‑reconstruction techniques, data collection methods, performance metrics, user‑behavior tracking, and a modular SDK architecture—illustrated with detailed code examples for Vue, React, XHR, fetch, and cross‑origin handling.

SDKerror trackingfrontend

0 likes · 32 min read

Design and Implementation of a Front‑End Monitoring Platform and SDK

Architecture Digest

Dec 21, 2022 · Operations

Designing High‑Availability Systems: Principles and Practices Across Six Layers

This article systematically explores high‑availability system design from development standards, capacity planning, application services, storage, product strategies, operations deployment, to incident response, presenting key concepts, architectural patterns, and practical guidelines for building resilient services.

DeploymentOperationsSystem Design

0 likes · 27 min read

Designing High‑Availability Systems: Principles and Practices Across Six Layers

Huawei Cloud Developer Alliance

Dec 20, 2022 · Operations

How Huawei Cloud SRE Scaled Monitoring with openGemini: A Real‑World Performance Case Study

Facing hundreds of terabytes of daily monitoring data, Huawei Cloud SRE replaced HBase with the open‑source time‑series database openGemini, conducting extensive write and query performance tests that demonstrated linear scaling, superior query speed, and significant reductions in storage, CPU, and memory usage.

Performance Testingcloud operationsmonitoring

0 likes · 8 min read

How Huawei Cloud SRE Scaled Monitoring with openGemini: A Real‑World Performance Case Study

Efficient Ops

Dec 19, 2022 · Operations

How Tencent CDN Achieves Seamless Business Continuity with AI‑Powered SRE

This article details Tencent CDN's challenges and solutions for business continuity, covering bandwidth and device resource constraints, massive request handling, fault‑management lifecycle, automation bottlenecks, and the implementation of AIOps, intelligent alerts, capacity planning, and root‑cause analysis to ensure reliable service.

CDNOperationsSRE

0 likes · 21 min read

How Tencent CDN Achieves Seamless Business Continuity with AI‑Powered SRE

Efficient Ops

Dec 18, 2022 · Operations

Mastering Application Monitoring with Prometheus: Practical Tips and Best Practices

This article explains how to design effective Prometheus metrics, choose appropriate vectors, labels, buckets, and naming conventions, and offers Grafana usage tricks to help engineers monitor online services, batch jobs, and offline processing systems with clear, actionable insights.

GrafanaMetricsOperations

0 likes · 9 min read

Mastering Application Monitoring with Prometheus: Practical Tips and Best Practices

Liangxu Linux

Dec 10, 2022 · Operations

Master Linux Performance: Load, CPU Context Switches, Memory & Swap Optimization

This guide explains Linux performance fundamentals—including throughput, latency, average load, CPU context switching, memory management, swap behavior, and the essential monitoring tools such as vmstat, pidstat, perf, and strace—while providing concrete command examples and troubleshooting steps.

CPULinuxmonitoring

0 likes · 44 min read

Master Linux Performance: Load, CPU Context Switches, Memory & Swap Optimization

Laravel Tech Community

Dec 6, 2022 · Backend Development

Using nginx-gui for Visual Nginx Configuration, Monitoring, and Management

This guide introduces the open‑source nginx‑gui tool, explains its configuration and performance‑monitoring features, and provides step‑by‑step instructions for downloading, setting up, and running the GUI on Linux, including code snippets and screenshots.

Backendmonitoring

0 likes · 4 min read

Using nginx-gui for Visual Nginx Configuration, Monitoring, and Management

Zhuanzhuan Tech

Dec 6, 2022 · Databases

Migrating MySQL Monitoring from Zabbix to Prometheus Using mysqld_exporter: Multi‑Instance Setup and Troubleshooting

This article explains how to replace Zabbix with Prometheus for MySQL monitoring by configuring mysqld_exporter to collect metrics from multiple MySQL instances, details the required user accounts, shows common errors, and provides step‑by‑step solutions including building a newer exporter, adjusting configuration files, and using auth_module for password management.

ConfigurationExporterMulti-Instance

0 likes · 14 min read

Migrating MySQL Monitoring from Zabbix to Prometheus Using mysqld_exporter: Multi‑Instance Setup and Troubleshooting

Laravel Tech Community

Dec 5, 2022 · Databases

Using MySQL Built‑in Commands for Comprehensive Database Monitoring

This article explains how to collect extensive MySQL performance metrics—including connections, buffer cache, locks, SQL status, statement counts, throughput, server configuration, and slow‑query logs—using only MySQL's native SHOW commands and the performance_schema, providing practical code snippets and optimization tips.

Performance Schemadatabasemonitoring

0 likes · 10 min read

Using MySQL Built‑in Commands for Comprehensive Database Monitoring

DeWu Technology

Dec 5, 2022 · Operations

Evolution of Application Monitoring at 得物: From CAT to OpenTelemetry

After rebuilding its transaction system in 2020, 得物 progressed from the basic CAT monitoring tool to OpenTracing with Prometheus, and finally adopted OpenTelemetry to unify metrics, traces, and logs via a custom vmagent‑Kafka‑Flink pipeline, dynamic sampling, and extensible javaagents, positioning the platform for a performance‑analysis‑driven future.

CATMicroservicesOpenTelemetry

0 likes · 18 min read

Evolution of Application Monitoring at 得物: From CAT to OpenTelemetry

ITPUB

Dec 4, 2022 · Cloud Native

How Qunar Scaled Container Monitoring with VictoriaMetrics: A Cloud‑Native Case Study

This article details Qunar's migration from a Prometheus‑based monitoring stack to VictoriaMetrics, describing the limitations they faced, the architectural redesign using vmagent, vmcluster, and vmalert, and the resulting performance improvements and operational benefits for large‑scale Kubernetes environments.

Cloud NativeKubernetesPrometheus

0 likes · 14 min read

How Qunar Scaled Container Monitoring with VictoriaMetrics: A Cloud‑Native Case Study

Tencent Cloud Developer

Dec 2, 2022 · Big Data

Design and Implementation of a Hundred‑Billion‑Scale Real‑Time Monitoring System

The paper presents the design and deployment of a hundred‑billion‑scale real‑time monitoring platform that meets stringent data‑collection, analysis, storage, alerting and visualization requirements, compares Oceanus + Elastic Stack against a Zabbix‑Prometheus‑Grafana stack, selects the former, and details performance‑and cost‑optimizations that enable massive, low‑latency monitoring while maintaining high availability.

ElasticsearchFlinkOceanus

0 likes · 20 min read

Design and Implementation of a Hundred‑Billion‑Scale Real‑Time Monitoring System

MaGe Linux Operations

Dec 1, 2022 · Operations

Mastering Zabbix: Install and Configure the Open‑Source Monitoring System

This article provides a comprehensive overview of Zabbix, an open‑source, web‑based monitoring solution, detailing its architecture, key features, monitoring principles, core components, default ports, and step‑by‑step instructions for deploying both the server and agent on Linux systems.

LinuxZabbixagent configuration

0 likes · 11 min read

Mastering Zabbix: Install and Configure the Open‑Source Monitoring System

Liangxu Linux

Nov 30, 2022 · Operations

Essential Ops Checklist: Prevent Data Loss, Secure Servers, and Optimize Performance

This guide shares practical operations best practices, covering safe online procedures, data protection, security hardening, daily monitoring, performance tuning, and the right mindset to avoid costly mistakes and keep production environments stable and secure.

BackupOperationsSysadmin

0 likes · 11 min read

Essential Ops Checklist: Prevent Data Loss, Secure Servers, and Optimize Performance

Alibaba Cloud Native

Nov 30, 2022 · Operations

How to Observe RocketMQ Message Lifecycle with OpenTelemetry Metrics

This article explains how RocketMQ's message lifecycle can be fully observed using OpenTelemetry‑based metrics, covering producer, broker, and consumer stages, and shows practical monitoring, alerting, and troubleshooting practices for cloud‑native deployments.

Cloud NativeMetricsOpenTelemetry

0 likes · 12 min read

How to Observe RocketMQ Message Lifecycle with OpenTelemetry Metrics