Tagged articles
2179 articles
Page 9 of 22
Open Source Linux
Open Source Linux
May 5, 2023 · Operations

Essential Ops Lessons from 3.5 Years of Real-World Crises

Drawing from three and a half years of operations work, this article shares hard‑earned best practices on testing, backups, security, monitoring, performance tuning, and the right mindset to avoid costly mistakes such as data loss, service outages, and security breaches.

Backupmonitoringperformance-tuning
0 likes · 12 min read
Essential Ops Lessons from 3.5 Years of Real-World Crises
政采云技术
政采云技术
Apr 29, 2023 · Cloud Native

Understanding Observability: Challenges, Principles, and OpenTelemetry Architecture

The article explains how growing system complexity drives the need for observability, outlines the three pillars of logs, traces, and metrics, compares traditional stability stacks with modern observability, and details OpenTelemetry's design, advantages, and implementation considerations for cloud‑native environments.

MicroservicesOpenTelemetrymonitoring
0 likes · 16 min read
Understanding Observability: Challenges, Principles, and OpenTelemetry Architecture
Liangxu Linux
Liangxu Linux
Apr 26, 2023 · Operations

Essential Linux Ops Practices to Prevent Disasters

Drawing from years of sysadmin experience, this guide lists concrete Linux operational habits—such as rigorous backups, cautious use of rm‑rf, single‑person changes, SSH hardening, firewall rules, monitoring, and disciplined performance tuning—to help teams avoid costly production failures.

Linuxmonitoringperformance tuning
0 likes · 12 min read
Essential Linux Ops Practices to Prevent Disasters
DeWu Technology
DeWu Technology
Apr 26, 2023 · Operations

Stability and Alerting Practices for E‑commerce Order Submission Service

The article details how a high‑throughput e‑commerce checkout pipeline achieves stability by combining fine‑grained metrics, custom trace logs, version‑based data validation, and targeted alert rules that detect latency spikes, error‑code surges, and downstream service failures, enabling rapid incident localization and reliable order processing.

Alertinge‑commercemonitoring
0 likes · 12 min read
Stability and Alerting Practices for E‑commerce Order Submission Service
Zhuanzhuan Tech
Zhuanzhuan Tech
Apr 26, 2023 · Backend Development

Design and Implementation of an Automated Payment Channel Management System

This article describes the design, technology choices, architecture, and implementation details of an automated payment channel management system that uses Redis‑based time‑series storage, custom circuit‑breaker logic, and monitoring to achieve fast fault detection, accurate alerting, and future automated failover.

Backendcircuit breakerfault tolerance
0 likes · 10 min read
Design and Implementation of an Automated Payment Channel Management System
Qunar Tech Salon
Qunar Tech Salon
Apr 24, 2023 · Operations

Design and Evolution of Qunar's Watcher Enterprise Monitoring Platform

The article details the background, architecture, core features, alert governance, trace integration, and cloud‑native evolution of Watcher, Qunar's internally built, highly scalable monitoring platform that unifies application‑level metrics, alerting, and observability across thousands of services and containers.

AlertingDevOpscloud-native
0 likes · 19 min read
Design and Evolution of Qunar's Watcher Enterprise Monitoring Platform
vivo Internet Technology
vivo Internet Technology
Apr 19, 2023 · Backend Development

Investigation of Midnight Interface Timeout in Vivo E‑commerce Activity System

The article details how a midnight interface timeout in Vivo’s e‑commerce activity system was traced to a logging bottleneck: a synchronous Log4j call blocked all threads while a cron‑driven log‑rotation script copied a 2.6 GB file, and the issue was resolved by switching to asynchronous logging with a non‑blocking appender.

BackendTomcatlogging
0 likes · 17 min read
Investigation of Midnight Interface Timeout in Vivo E‑commerce Activity System
DeWu Technology
DeWu Technology
Apr 19, 2023 · Backend Development

Web Project Code Refactoring: Practices, Challenges, and Solutions

The article details a year‑long refactoring case study of a fast‑iteration web project migrated from Python to Java/Go, describing inherited performance bugs, a prioritized migration plan, monitoring integration, concrete optimizations such as query reduction and cache redesign, and the resulting stability and latency gains, while outlining required developer skills and best‑practice recommendations.

Code RefactoringDevOpsGo
0 likes · 19 min read
Web Project Code Refactoring: Practices, Challenges, and Solutions
Efficient Ops
Efficient Ops
Apr 18, 2023 · Databases

Mastering MongoDB Clusters: Setup, Monitoring, Migration, and Optimization

This comprehensive guide explains MongoDB cluster architecture, component roles, common use cases, monitoring commands, essential maintenance operations, data migration steps, troubleshooting of typical production issues, and practical optimization recommendations for high‑performance deployments.

BackupClusterMongoDB
0 likes · 20 min read
Mastering MongoDB Clusters: Setup, Monitoring, Migration, and Optimization
Java Architect Essentials
Java Architect Essentials
Apr 16, 2023 · Databases

Alibaba Druid Connection Pool in Spring Boot: Concepts, Configuration, Monitoring and Customization

This article introduces Alibaba's Druid database connection pool, explains its core features and filters, shows how to add the Maven starter, configure properties and filters in Spring Boot, demonstrates the built‑in monitoring pages, slow‑SQL logging, Spring AOP integration, and provides methods to remove the default advertisement and retrieve monitoring data via code.

ConfigurationConnection PoolDruid
0 likes · 16 min read
Alibaba Druid Connection Pool in Spring Boot: Concepts, Configuration, Monitoring and Customization
MaGe Linux Operations
MaGe Linux Operations
Apr 16, 2023 · Operations

How Netflix’s Telltale Transforms Application Monitoring and Alerting

The article details Netflix’s self‑built Telltale monitoring system, explaining how it consolidates data sources, reduces alert fatigue, provides intelligent alerts, and continuously optimizes application health assessment for over 100 production services, ultimately improving operational efficiency and reliability.

AlertingNetflixOperations
0 likes · 11 min read
How Netflix’s Telltale Transforms Application Monitoring and Alerting
Aotu Lab
Aotu Lab
Apr 13, 2023 · Frontend Development

How to Triple Your Web App’s Speed: A Front‑End Performance Optimization Playbook

This article walks through a comprehensive front‑end performance optimization process—starting from diagnosing issues with Lighthouse, identifying bottlenecks such as large bundle size and uncompressed assets, applying code splitting, lazy loading, image optimization, CSP, SEO tweaks, and finally setting up continuous monitoring with a custom platform—to achieve a 279% improvement in Lighthouse performance scores and near‑three‑fold speed gains.

LighthouseSEOfrontend
0 likes · 11 min read
How to Triple Your Web App’s Speed: A Front‑End Performance Optimization Playbook
Ops Development Stories
Ops Development Stories
Apr 13, 2023 · Operations

How to Deploy N9e: A Step‑by‑Step Guide to Unified Observability

This article walks through the challenges of observability for small‑to‑medium companies and provides a detailed, hands‑on guide to installing, configuring, and using the N9e monitoring platform—including architecture options, component setup, and adding data sources—so readers can achieve integrated alerting, metrics, logs, and tracing in a single pane.

N9eOperationsmonitoring
0 likes · 13 min read
How to Deploy N9e: A Step‑by‑Step Guide to Unified Observability
Efficient Ops
Efficient Ops
Apr 12, 2023 · Operations

Building Highly Available Prometheus Monitoring with Thanos: A Practical Guide

This article explains why native Prometheus HA solutions fall short for large, multi‑region clusters and shows how to use Thanos components—including sidecar, query, store gateway, and compactor—to achieve long‑term storage, unlimited scaling, a global view, and non‑intrusive integration with existing Prometheus deployments.

KubernetesPrometheusThanos
0 likes · 22 min read
Building Highly Available Prometheus Monitoring with Thanos: A Practical Guide
dbaplus Community
dbaplus Community
Apr 10, 2023 · Operations

Can Ops Roles Disappear? Exploring Self‑Service Platforms, COE Experts, and SaaS in Modern Monitoring

The article examines whether traditional operations positions can become obsolete by analyzing a self‑service platform + COE + Business Partner model, detailing essential monitoring tools, the role of COE specialists, SaaS alternatives, and practical career pathways for newcomers, mid‑level, and senior engineers.

COEOperationsSaaS
0 likes · 8 min read
Can Ops Roles Disappear? Exploring Self‑Service Platforms, COE Experts, and SaaS in Modern Monitoring
ITPUB
ITPUB
Apr 10, 2023 · Operations

How Bytecode Enhancement Enables Zero‑Intrusion Monitoring for Microservices

This article, based on a SACC 2022 talk by Huolala architect Cao Wei, explains the principles of bytecode‑enhancement, its practical implementation for large‑scale microservice monitoring, compares enhancement frameworks, shares best‑practice patterns, and explores broader applications such as service‑mesh sidecars.

BackendInstrumentationJava Agent
0 likes · 18 min read
How Bytecode Enhancement Enables Zero‑Intrusion Monitoring for Microservices
Architecture Digest
Architecture Digest
Apr 10, 2023 · Operations

Comparison of Common Log Management Tools: Filebeat, Graylog, LogDNA, ELK, Loki, Datadog, Logstash, Fluentd, and Splunk

This article provides a detailed comparison of nine popular log management solutions—Filebeat, Graylog, LogDNA, ELK Stack, Grafana Loki, Datadog, Logstash, Fluentd, and Splunk—covering their core features, pricing models, advantages, and drawbacks to help readers choose the right tool for centralized logging.

ELKLog Managementcloud
0 likes · 13 min read
Comparison of Common Log Management Tools: Filebeat, Graylog, LogDNA, ELK, Loki, Datadog, Logstash, Fluentd, and Splunk
DataFunTalk
DataFunTalk
Apr 4, 2023 · Big Data

Compass: An Open‑Source Big Data Task Diagnosis Platform for DolphinScheduler, Airflow and Spark

Compass is an open‑source big‑data diagnostic platform developed by OPPO that provides non‑intrusive, real‑time monitoring and root‑cause analysis for offline and streaming tasks on schedulers such as DolphinScheduler and Airflow, covering workflow‑level failures, Spark engine anomalies, resource usage, and offering one‑click reports and extensible rule‑based diagnostics.

Big DataDolphinSchedulerSpark
0 likes · 13 min read
Compass: An Open‑Source Big Data Task Diagnosis Platform for DolphinScheduler, Airflow and Spark
Refining Core Development Skills
Refining Core Development Skills
Apr 4, 2023 · Cloud Native

Understanding Container CPU Utilization: Accurate Measurement Methods and the Missing Nice/IRQ/SoftIRQ Metrics

This article explains how to correctly obtain CPU utilization inside containers, compares host and container metrics, describes the use of lxcfs and cgroup files (including cgroup V1/V2) for accurate measurement, and clarifies why container statistics omit nice, irq, and softirq fields.

Cloud NativeContainercgroup
0 likes · 16 min read
Understanding Container CPU Utilization: Accurate Measurement Methods and the Missing Nice/IRQ/SoftIRQ Metrics
DevOps Cloud Academy
DevOps Cloud Academy
Mar 25, 2023 · Operations

Essential Skills for DevOps Engineers

The article outlines the key competencies DevOps engineers must master—including cloud computing, automation, containerization, CI/CD, monitoring, logging, and infrastructure-as-code—to accelerate, stabilize, and scale software delivery in modern development environments.

DevOpsInfrastructure as Codeautomation
0 likes · 6 min read
Essential Skills for DevOps Engineers
MaGe Linux Operations
MaGe Linux Operations
Mar 24, 2023 · Operations

How to Reduce False Alarms in Distributed Systems with Interval Detection

This article explains the challenges of monitoring highly distributed applications, why static alert thresholds often fail, and how interval detection using algorithms like Local Outlier Factor can improve alert accuracy while reducing noise across tools such as Grafana, Zabbix, and Open‑Falcon.

AlertingOperationsinterval detection
0 likes · 16 min read
How to Reduce False Alarms in Distributed Systems with Interval Detection
MaGe Linux Operations
MaGe Linux Operations
Mar 24, 2023 · Operations

Why Most Monitoring Strategies Fail and How the CAR Framework Fixes Them

This article explains why typical monitoring approaches miss the mark, outlines four root causes of persistent incidents, and introduces the CAR framework—Customer, Application, Resource—to build user‑centric observability that reduces noise, restores trust, and improves reliability.

CAR frameworkOperationsincident management
0 likes · 11 min read
Why Most Monitoring Strategies Fail and How the CAR Framework Fixes Them
ITPUB
ITPUB
Mar 24, 2023 · Cloud Native

Why Open‑Falcon Stalled and How Cloud‑Native Monitoring Is Evolving

This article reviews the evolution of monitoring in the cloud‑native era, analyzes Open‑Falcon’s architecture, strengths, and shortcomings, explains why its development hit a bottleneck, and outlines the design principles and features of the Nightingale monitoring system as a modern, open‑source alternative.

MicroservicesOpen-Falconarchitecture
0 likes · 15 min read
Why Open‑Falcon Stalled and How Cloud‑Native Monitoring Is Evolving
dbaplus Community
dbaplus Community
Mar 20, 2023 · Operations

How Xianyu’s Messaging Team Built a Zero‑Incident System with Gray Releases, Monitoring, and Automated Regression

The article details how Xianyu’s messaging team systematically improved system stability by classifying risks, implementing gray‑release traffic, establishing dedicated monitoring and alerting dashboards, integrating automated regression into CI/CD, and managing strong‑weak dependencies, ultimately reducing online incidents to near zero.

Operationsautomated regressiondependency management
0 likes · 10 min read
How Xianyu’s Messaging Team Built a Zero‑Incident System with Gray Releases, Monitoring, and Automated Regression
Architecture Digest
Architecture Digest
Mar 18, 2023 · Operations

Understanding Log Importance and Operations in Distributed Architecture

This article explains what logs are, why they are crucial in large‑scale distributed systems, outlines the requirements for effective log operations, reviews common tooling such as ELK, Prometheus and tracing solutions, provides a Go example for batch log retrieval, and shares best‑practice guidelines to achieve observability.

APMmonitoring
0 likes · 19 min read
Understanding Log Importance and Operations in Distributed Architecture
NetEase Smart Enterprise Tech+
NetEase Smart Enterprise Tech+
Mar 15, 2023 · Operations

How Yidun Automates Performance Testing to Overcome Real‑World Pain Points

This article explains performance testing fundamentals, why it matters, the specific challenges Yidun faced such as complex execution, human‑dependent monitoring, data isolation, and cost loss, and describes their automated, gradient‑based testing platform with quantified monitoring and future visualisation plans.

Data IsolationOperationsPerformance Testing
0 likes · 8 min read
How Yidun Automates Performance Testing to Overcome Real‑World Pain Points
dbaplus Community
dbaplus Community
Mar 14, 2023 · Backend Development

How to Detect and Solve Java Application Performance Bottlenecks: A Practical Guide

This article walks through the evolution of a system’s performance concerns, defines speed and pressure dimensions, explains how to calculate RT, QPS and concurrency, compares QPS with TPS, and provides step‑by‑step methods using tools like Arthas, JMeter and JVM diagnostics to identify and fix CPU, memory and pressure issues before applying layered optimization strategies.

ArthasJMeterProfiling
0 likes · 12 min read
How to Detect and Solve Java Application Performance Bottlenecks: A Practical Guide
Tencent Cloud Developer
Tencent Cloud Developer
Mar 13, 2023 · Cloud Computing

Design Principles for High‑Availability System Architecture

The article outlines a comprehensive high‑availability architecture framework across six layers—development standards, application services, storage, product fallback, operations deployment, and emergency response—detailing design principles such as stateless services, elastic scaling, redundant storage, robust monitoring, gray releases, and chaos engineering to ensure resilient, continuously available systems.

DeploymentScalabilitySystem Architecture
0 likes · 25 min read
Design Principles for High‑Availability System Architecture
Open Source Linux
Open Source Linux
Mar 9, 2023 · Operations

Prometheus vs Zabbix: Which Monitoring Tool Wins for Modern Ops?

An in‑depth comparison of Prometheus and Zabbix examines their histories, architectures, data storage, scalability, and container support, highlighting Prometheus’s cloud‑native pull model and Go‑based performance versus Zabbix’s mature, relational‑database approach, to help teams choose the right monitoring solution.

PrometheusTime Series DatabaseZabbix
0 likes · 8 min read
Prometheus vs Zabbix: Which Monitoring Tool Wins for Modern Ops?
Top Architect
Top Architect
Mar 8, 2023 · Databases

Deep Dive into Prometheus V2 Storage Engine and Query Process

This article explains the internal storage layout, on‑disk and in‑memory data structures, and the query execution flow of Prometheus V2, illustrating how blocks, chunks, WAL, indexes and postings are organized and accessed to serve time‑series queries efficiently.

GoPrometheusStorage Engine
0 likes · 15 min read
Deep Dive into Prometheus V2 Storage Engine and Query Process
AntTech
AntTech
Mar 7, 2023 · Cloud Native

Introduction to HoloInsight: A Cloud‑Native Lightweight Observability Platform

HoloInsight is an open‑source, cloud‑native observability platform derived from Ant Group's AntMonitor, offering integrated log‑based monitoring, business metric analysis, and AI‑driven AIOps capabilities while providing a lightweight, modular architecture and extensive extensibility for modern software stacks.

aiopscloud-nativelog analysis
0 likes · 13 min read
Introduction to HoloInsight: A Cloud‑Native Lightweight Observability Platform
Architect
Architect
Feb 27, 2023 · Databases

Understanding Prometheus V2 Storage Engine and Query Process

This article explains the architecture of Prometheus V2, detailing its on‑disk block layout, chunk and index formats, the inverted index mechanism, and how queries locate and retrieve time‑series data, while also covering in‑memory structures and practical usage patterns.

CloudNativePrometheusStorageEngine
0 likes · 14 min read
Understanding Prometheus V2 Storage Engine and Query Process
DeWu Technology
DeWu Technology
Feb 27, 2023 · Operations

Message Push Monitoring and SLA Practices

The team implemented SLA‑based, node‑level monitoring for mobile push messages—splitting the workflow, measuring latency, blocking volume, and success rates, isolating metrics with Spring AOP, and tracking third‑party vendors—resulting in clear latency standards, doubled peak throughput, faster issue resolution, and improved overall reliability.

Message PushOperationsSLA
0 likes · 11 min read
Message Push Monitoring and SLA Practices
ITPUB
ITPUB
Feb 24, 2023 · Databases

How Ctrip Migrated MySQL to OceanBase: Tools, Process, and Lessons Learned

Ctrip evaluated and extended OceanBase Migration Assessment tools, built a one‑click migration workflow, implemented comprehensive monitoring and automatic fault‑diagnosis pipelines, and addressed compatibility challenges such as .NET charset issues and Druid parser errors, ultimately achieving a smooth MySQL‑to‑OceanBase transition.

OceanBasePerformance Diagnosisdatabase migration
0 likes · 18 min read
How Ctrip Migrated MySQL to OceanBase: Tools, Process, and Lessons Learned
Baidu Geek Talk
Baidu Geek Talk
Feb 20, 2023 · Operations

Deep Dive into Logging Operations and Observability in Distributed Systems

The article examines logging’s critical role in distributed systems, detailing its purpose, severity levels, and value for debugging, performance, security, and auditing, while highlighting challenges of inconsistent formats and traceability, and reviewing observability pillars, ELK and tracing tools, and practical implementation best practices.

APMELKPrometheus
0 likes · 19 min read
Deep Dive into Logging Operations and Observability in Distributed Systems
21CTO
21CTO
Feb 16, 2023 · Operations

Which Log Management Tool Is Right for You? A Comprehensive Comparison of 9 Solutions

This article provides a detailed comparison of nine popular log management tools—including Filebeat, Graylog, LogDNA, ELK, Grafana Loki, Datadog, Logstash, Fluentd, and Splunk—covering their main features, pricing, advantages, and disadvantages to guide readers in selecting the most suitable solution for their needs.

ELKLog ManagementOperations
0 likes · 16 min read
Which Log Management Tool Is Right for You? A Comprehensive Comparison of 9 Solutions
Xianyu Technology
Xianyu Technology
Feb 16, 2023 · Operations

Stability Governance of Xianyu Messaging System

Since launching a systematic stability‑governance program in August 2022, Xianyu’s messaging system has employed gray releases, dedicated monitoring, daily automated regression, dependency reviews and drills, resulting in near‑zero online incidents within six months and demonstrating that continuous, context‑specific measures and vigilant change management are essential for reliable C2C transactions.

Messagingautomationdependency management
0 likes · 7 min read
Stability Governance of Xianyu Messaging System
macrozheng
macrozheng
Feb 11, 2023 · Operations

Deploy and Use Uptime Kuma: A Simple, Beautiful Open‑Source Monitoring Tool

This article introduces Uptime Kuma, an open‑source monitoring tool, and provides step‑by‑step instructions for installing it via Docker or manually, configuring monitors, and using its dashboard, highlighting its simplicity, visual appeal, and support for multiple services and notification methods.

DockerSelf-hostedUptime Kuma
0 likes · 3 min read
Deploy and Use Uptime Kuma: A Simple, Beautiful Open‑Source Monitoring Tool
MaGe Linux Operations
MaGe Linux Operations
Feb 10, 2023 · Cloud Native

How to Detect and Prevent OOM and CPU Throttling in Kubernetes Pods

This article explains why Kubernetes pods encounter out‑of‑memory errors and CPU throttling, how limits and requests influence resource allocation, and provides practical monitoring techniques using Prometheus and cAdvisor to proactively identify and mitigate these issues before they impact performance or cause pod eviction.

CPU throttlingOOMcAdvisor
0 likes · 9 min read
How to Detect and Prevent OOM and CPU Throttling in Kubernetes Pods
Alibaba Cloud Native
Alibaba Cloud Native
Feb 8, 2023 · Cloud Native

Alibaba Cloud Prometheus vs Open‑Source Prometheus: Deep Performance Benchmark

This article benchmarks Alibaba Cloud Prometheus against the open‑source Prometheus across multiple cluster sizes, churn rates, and query patterns, revealing that while the open‑source version remains stable under light load, its CPU and memory usage grow non‑linearly with high cardinality, whereas Alibaba's managed service delivers higher compatibility, better query performance, and more predictable scaling.

Cloud NativeMetricsPrometheus
0 likes · 30 min read
Alibaba Cloud Prometheus vs Open‑Source Prometheus: Deep Performance Benchmark
dbaplus Community
dbaplus Community
Feb 6, 2023 · Operations

How Vivo Built a Scalable, Cloud‑Native Monitoring Platform for Millions of Services

This article outlines Vivo's multi‑year journey of designing, evolving, and operating a cloud‑native, AIOps‑enabled monitoring platform that supports tens of thousands of hosts, databases, containers, and services, detailing its architecture, challenges, and future directions for observability and reliability.

OperationsSystem Architectureaiops
0 likes · 18 min read
How Vivo Built a Scalable, Cloud‑Native Monitoring Platform for Millions of Services
ByteFE
ByteFE
Feb 5, 2023 · Frontend Development

Front‑End Development Insights: Monitoring, Low‑Code, TypeScript, Trends, and Performance Optimizations

This collection presents a range of front‑end development insights, including ByteDance’s monitoring practices, low‑code product considerations, DeepKit’s TypeScript runtime capabilities, emerging 2023 trends, web development forecasts, the need for diverse JavaScript frameworks, markdown‑to‑PPT tools, and Vue 3 table performance optimizations.

Vue3low-codemonitoring
0 likes · 5 min read
Front‑End Development Insights: Monitoring, Low‑Code, TypeScript, Trends, and Performance Optimizations
政采云技术
政采云技术
Feb 2, 2023 · Operations

Distributed Tracing Overview and SkyWalking Architecture

This article explains the fundamentals of distributed tracing, introduces the Dapper and OpenTracing models, and details SkyWalking's data collection, cross‑process propagation, bytecode enhancement, architecture components, monitoring, alerting, and performance characteristics for microservice environments.

Distributed TracingMicroservicesOpenTracing
0 likes · 10 min read
Distributed Tracing Overview and SkyWalking Architecture
HelloTech
HelloTech
Jan 31, 2023 · Operations

Stability Assurance Practices for Large‑Scale Promotional Events

The article outlines a comprehensive stability‑assurance framework for large‑scale promotional events—detailing planning, capacity and pressure‑test rehearsals, strict change‑freeze, internal gray releases, coordinated on‑call response, thorough link and capacity analysis, monitoring, emergency procedures, cross‑team collaboration, external partner coordination, and post‑event review to ensure resilient system performance.

Large-Scale EventsPerformance Testingcapacity planning
0 likes · 17 min read
Stability Assurance Practices for Large‑Scale Promotional Events
Architect
Architect
Jan 30, 2023 · Backend Development

Unified Exception Monitoring and Reporting with ASM and JavaAgent

This article explains how to use Java bytecode instrumentation with ASM and a JavaAgent to automatically monitor, capture, and report exceptions across a backend system, covering exception fundamentals, best‑practice handling, and practical implementation steps.

ASMException HandlingJavaAgent
0 likes · 14 min read
Unified Exception Monitoring and Reporting with ASM and JavaAgent
Architect
Architect
Jan 29, 2023 · Backend Development

DynamicTp: A Dynamic ThreadPoolExecutor Extension for Real‑time Monitoring and Configuration

DynamicTp is a Java library that extends ThreadPoolExecutor to enable dynamic, configuration‑center‑driven adjustment of thread‑pool parameters, real‑time monitoring, alert notifications, and metric collection, providing a lightweight, zero‑intrusion solution for microservice architectures.

Dynamic ConfigurationSpringBootThreadPool
0 likes · 9 min read
DynamicTp: A Dynamic ThreadPoolExecutor Extension for Real‑time Monitoring and Configuration
IT Architects Alliance
IT Architects Alliance
Jan 27, 2023 · Backend Development

Comprehensive Guide to Building a Backend Technology Stack for Startup Companies

This article provides a detailed overview of the essential backend technology stack for startups, covering language choices, components, processes, systems, and cloud services, and offers practical recommendations for selecting databases, messaging, monitoring, CI/CD, and deployment tools to build a robust, scalable infrastructure.

BackendTechnology Stackdatabase
0 likes · 28 min read
Comprehensive Guide to Building a Backend Technology Stack for Startup Companies
NetEase Cloud Music Tech Team
NetEase Cloud Music Tech Team
Jan 17, 2023 · Big Data

How NetEase Cloud Music Cut Data Pipeline Delays by 60% with Full‑Link Baseline Governance

This case study details NetEase Cloud Music's full‑link baseline governance initiative, outlining the challenges of massive data pipelines, the metrics used to measure success, the three‑pronged action plan (infrastructure, task optimization, and standards), and the resulting improvements in availability, resource utilization, and monitoring accuracy.

Big DataResource Optimizationbaseline governance
0 likes · 11 min read
How NetEase Cloud Music Cut Data Pipeline Delays by 60% with Full‑Link Baseline Governance
dbaplus Community
dbaplus Community
Jan 16, 2023 · Operations

Beyond Success‑Ratio: How User‑Uptime Reveals Real Product Availability

The article reviews traditional availability metrics such as Success‑Ratio, Error‑Budget, MTTR/MTTF, SLA/SLO, and highlights their limitations, then introduces Google’s User‑Uptime and Windowed User‑Uptime metrics, explains their definitions, challenges, experimental results, and why they provide a more user‑centric view of service reliability.

AvailabilityMetricsSRE
0 likes · 27 min read
Beyond Success‑Ratio: How User‑Uptime Reveals Real Product Availability
Efficient Ops
Efficient Ops
Jan 16, 2023 · Operations

How China Mobile’s Centralized AIOps Platform Achieved Top‑Tier Evaluation

This article details China Mobile Information's interview about their centralized AIOps platform, the recent excellent‑level assessment by the China Academy of Information and Communications Technology, the system's key modules, future plans, and the broader significance of AI‑driven IT operations.

Artificial IntelligenceIT OperationsRoot Cause Analysis
0 likes · 11 min read
How China Mobile’s Centralized AIOps Platform Achieved Top‑Tier Evaluation
DataFunSummit
DataFunSummit
Jan 16, 2023 · Big Data

Building an O2O Industry Data Platform: From Monitoring to Diagnosis

This article shares practical insights on constructing an O2O industry data platform, detailing user classification, business pain points, and a three‑step strategy—monitoring, analysis, and diagnosis—to extract core metrics, implement tailored reporting, conduct operational and pricing analyses, and drive data‑driven product improvements.

AnalysisBusiness IntelligenceData Platform
0 likes · 15 min read
Building an O2O Industry Data Platform: From Monitoring to Diagnosis
Code Ape Tech Column
Code Ape Tech Column
Jan 14, 2023 · Operations

Comparison of Common Log Management Tools: Features, Pricing, Pros and Cons

This article provides a detailed comparison of nine popular log management solutions—including Filebeat, Graylog, LogDNA, the ELK stack, Grafana Loki, Datadog, Logstash, Fluentd, and Splunk—covering their main features, pricing models, advantages, and disadvantages to help readers choose the right tool for their needs.

ELKLog Managementmonitoring
0 likes · 16 min read
Comparison of Common Log Management Tools: Features, Pricing, Pros and Cons
Data Thinking Notes
Data Thinking Notes
Jan 10, 2023 · Big Data

How Bilibili Built a Scalable Data Quality Platform for Billions of Events

This article describes Bilibili’s data quality platform, outlining its background, objectives, theoretical models, workflow stages (recording, checking, alerting), DSL for metrics, root‑cause analysis, scheduling strategies, heterogeneous source integration, rule coverage, intelligent monitoring, and future plans to achieve automated, real‑time, high‑reliability data assurance for massive daily workloads.

Big DataData QualityRoot Cause Analysis
0 likes · 21 min read
How Bilibili Built a Scalable Data Quality Platform for Billions of Events
Alibaba Cloud Developer
Alibaba Cloud Developer
Jan 10, 2023 · Operations

How to Master System Stability: A Step‑by‑Step Guide for Reliable Operations

This article explains what stability assurance is, outlines a systematic workflow—including anomaly identification, monitoring configuration, impact assessment, and solution planning—and provides practical methods such as capacity estimation, traffic limiting, load testing, scaling, and pre‑heating to ensure services remain stable during both daily operations and high‑traffic events.

Operationscapacity planningincident response
0 likes · 25 min read
How to Master System Stability: A Step‑by‑Step Guide for Reliable Operations
Efficient Ops
Efficient Ops
Jan 9, 2023 · Operations

Boost Ops Efficiency: 5 Python Scripts Every Sysadmin Should Use

This article explains how Python can automate common operations tasks—remote command execution, log parsing, system monitoring with alerts, bulk software deployment, and backup/restore—providing code examples for each and highlighting additional tools that help sysadmins improve efficiency and reduce errors.

BackupDeploymentPython
0 likes · 9 min read
Boost Ops Efficiency: 5 Python Scripts Every Sysadmin Should Use

Loggie: A High-Performance Log Collection Agent System Design and Implementation

Loggie is a cloud-native, Go-based log-collection agent that replaces Filebeat and Flume by using a micro-kernel producer-consumer architecture with hot-swappable pipelines, achieving 2 GB/s read speeds, 1.6‑2.6× higher throughput while using only a quarter of the CPU, and providing built-in observability, reliability, and latency monitoring for large-scale enterprise deployments.

GoOperationslog agent
0 likes · 16 min read
Loggie: A High-Performance Log Collection Agent System Design and Implementation
Architecture Digest
Architecture Digest
Jan 8, 2023 · Operations

Design and Evolution of Vivo Server Monitoring System

This article systematically presents the business background, basic monitoring workflow, usage guidelines, OpenTSDB fundamentals, code precision issues, vmonitor collector architecture, old and new system designs, core alerting metrics, demo illustrations, and a comparison with mainstream monitoring solutions, offering insights for technology selection.

AlertingOpenTSDBServer
0 likes · 18 min read
Design and Evolution of Vivo Server Monitoring System
vivo Internet Technology
vivo Internet Technology
Jan 4, 2023 · Artificial Intelligence

Root Cause Localization Algorithm and Its Implementation for Service Fault Diagnosis

The article describes a root‑cause localization algorithm implemented in vivo’s monitoring platform that automatically analyzes latency spikes by splitting service timelines, computing variance, clustering results with K‑means, and recursively tracing downstream services, achieving over 85 % accuracy for dependency failures while still requiring human verification and outlining future AI‑driven enhancements.

Fault LocalizationK-MeansRoot Cause Analysis
0 likes · 13 min read
Root Cause Localization Algorithm and Its Implementation for Service Fault Diagnosis
Architecture Digest
Architecture Digest
Dec 30, 2022 · Operations

Vivo Monitoring Platform: Architecture, Evolution, and Future Directions

The article details the evolution, architecture, capabilities, challenges, and future plans of Vivo's comprehensive monitoring platform, covering its transition from simple Zabbix setups to a cloud‑native, AI‑ops enabled system that ensures service availability across massive infrastructure.

Reliabilityaiopscloud-native
0 likes · 16 min read
Vivo Monitoring Platform: Architecture, Evolution, and Future Directions
vivo Internet Technology
vivo Internet Technology
Dec 28, 2022 · Operations

Monitoring Service System Construction and Exploration Practice

The article outlines vivo’s evolution from simple Zabbix monitoring to a self‑built, unified monitoring platform that now covers infrastructure, containers, databases and user experience at massive scale, integrating AI‑ops, cloud‑native observability and unified alerting to ensure end‑to‑end service reliability and future intelligent, one‑stop monitoring.

Vivoaiopsarchitecture
0 likes · 28 min read
Monitoring Service System Construction and Exploration Practice
Top Architect
Top Architect
Dec 26, 2022 · Backend Development

Comprehensive Guide to Backend Development: System Design, Architecture, Networking, Fault Handling, Monitoring, Governance and Deployment

This article provides a thorough overview of backend development, covering system design principles such as high cohesion and low coupling, architectural patterns for high concurrency and availability, network communication techniques, common fault scenarios, monitoring and alerting strategies, service governance practices, and deployment workflows.

BackendDeploymentScalability
0 likes · 30 min read
Comprehensive Guide to Backend Development: System Design, Architecture, Networking, Fault Handling, Monitoring, Governance and Deployment
Yunxuetang Frontend Team
Yunxuetang Frontend Team
Dec 23, 2022 · Frontend Development

Build the Ultimate Front-End Monitoring System & Master Advanced JS Tricks

It introduces a series of front‑end engineering articles covering comprehensive monitoring architecture, CSS‑based click throttling, drag‑and‑drop implementation, 30 essential JavaScript concepts, and a simple responsive data‑dashboard solution, aiming to deepen developers’ skills and showcase resume‑worthy expertise.

JavaScriptUIWeb Development
0 likes · 3 min read
Build the Ultimate Front-End Monitoring System & Master Advanced JS Tricks
Sohu Tech Products
Sohu Tech Products
Dec 21, 2022 · Frontend Development

Design and Implementation of a Front‑End Monitoring Platform and SDK

This article presents a comprehensive guide to building a front‑end monitoring system—including pain points, error‑reconstruction techniques, data collection methods, performance metrics, user‑behavior tracking, and a modular SDK architecture—illustrated with detailed code examples for Vue, React, XHR, fetch, and cross‑origin handling.

SDKerror trackingfrontend
0 likes · 32 min read
Design and Implementation of a Front‑End Monitoring Platform and SDK
Architecture Digest
Architecture Digest
Dec 21, 2022 · Operations

Designing High‑Availability Systems: Principles and Practices Across Six Layers

This article systematically explores high‑availability system design from development standards, capacity planning, application services, storage, product strategies, operations deployment, to incident response, presenting key concepts, architectural patterns, and practical guidelines for building resilient services.

DeploymentOperationsSystem Design
0 likes · 27 min read
Designing High‑Availability Systems: Principles and Practices Across Six Layers
Huawei Cloud Developer Alliance
Huawei Cloud Developer Alliance
Dec 20, 2022 · Operations

How Huawei Cloud SRE Scaled Monitoring with openGemini: A Real‑World Performance Case Study

Facing hundreds of terabytes of daily monitoring data, Huawei Cloud SRE replaced HBase with the open‑source time‑series database openGemini, conducting extensive write and query performance tests that demonstrated linear scaling, superior query speed, and significant reductions in storage, CPU, and memory usage.

Performance Testingcloud operationsmonitoring
0 likes · 8 min read
How Huawei Cloud SRE Scaled Monitoring with openGemini: A Real‑World Performance Case Study
Efficient Ops
Efficient Ops
Dec 19, 2022 · Operations

How Tencent CDN Achieves Seamless Business Continuity with AI‑Powered SRE

This article details Tencent CDN's challenges and solutions for business continuity, covering bandwidth and device resource constraints, massive request handling, fault‑management lifecycle, automation bottlenecks, and the implementation of AIOps, intelligent alerts, capacity planning, and root‑cause analysis to ensure reliable service.

CDNOperationsSRE
0 likes · 21 min read
How Tencent CDN Achieves Seamless Business Continuity with AI‑Powered SRE
Zhuanzhuan Tech
Zhuanzhuan Tech
Dec 6, 2022 · Databases

Migrating MySQL Monitoring from Zabbix to Prometheus Using mysqld_exporter: Multi‑Instance Setup and Troubleshooting

This article explains how to replace Zabbix with Prometheus for MySQL monitoring by configuring mysqld_exporter to collect metrics from multiple MySQL instances, details the required user accounts, shows common errors, and provides step‑by‑step solutions including building a newer exporter, adjusting configuration files, and using auth_module for password management.

ConfigurationExporterMulti-Instance
0 likes · 14 min read
Migrating MySQL Monitoring from Zabbix to Prometheus Using mysqld_exporter: Multi‑Instance Setup and Troubleshooting
Laravel Tech Community
Laravel Tech Community
Dec 5, 2022 · Databases

Using MySQL Built‑in Commands for Comprehensive Database Monitoring

This article explains how to collect extensive MySQL performance metrics—including connections, buffer cache, locks, SQL status, statement counts, throughput, server configuration, and slow‑query logs—using only MySQL's native SHOW commands and the performance_schema, providing practical code snippets and optimization tips.

Performance Schemadatabasemonitoring
0 likes · 10 min read
Using MySQL Built‑in Commands for Comprehensive Database Monitoring
DeWu Technology
DeWu Technology
Dec 5, 2022 · Operations

Evolution of Application Monitoring at 得物: From CAT to OpenTelemetry

After rebuilding its transaction system in 2020, 得物 progressed from the basic CAT monitoring tool to OpenTracing with Prometheus, and finally adopted OpenTelemetry to unify metrics, traces, and logs via a custom vmagent‑Kafka‑Flink pipeline, dynamic sampling, and extensible javaagents, positioning the platform for a performance‑analysis‑driven future.

CATMicroservicesOpenTelemetry
0 likes · 18 min read
Evolution of Application Monitoring at 得物: From CAT to OpenTelemetry
ITPUB
ITPUB
Dec 4, 2022 · Cloud Native

How Qunar Scaled Container Monitoring with VictoriaMetrics: A Cloud‑Native Case Study

This article details Qunar's migration from a Prometheus‑based monitoring stack to VictoriaMetrics, describing the limitations they faced, the architectural redesign using vmagent, vmcluster, and vmalert, and the resulting performance improvements and operational benefits for large‑scale Kubernetes environments.

Cloud NativeKubernetesPrometheus
0 likes · 14 min read
How Qunar Scaled Container Monitoring with VictoriaMetrics: A Cloud‑Native Case Study
Tencent Cloud Developer
Tencent Cloud Developer
Dec 2, 2022 · Big Data

Design and Implementation of a Hundred‑Billion‑Scale Real‑Time Monitoring System

The paper presents the design and deployment of a hundred‑billion‑scale real‑time monitoring platform that meets stringent data‑collection, analysis, storage, alerting and visualization requirements, compares Oceanus + Elastic Stack against a Zabbix‑Prometheus‑Grafana stack, selects the former, and details performance‑and cost‑optimizations that enable massive, low‑latency monitoring while maintaining high availability.

ElasticsearchFlinkOceanus
0 likes · 20 min read
Design and Implementation of a Hundred‑Billion‑Scale Real‑Time Monitoring System