Tagged articles
2179 articles
Page 18 of 22
dbaplus Community
dbaplus Community
Sep 2, 2019 · Operations

How Qunar Leverages AI‑Driven Fault Prediction and Health Management to Boost System Reliability

This article summarizes Zhang Yan's presentation at the 2019 Gdevops Global Agile Operations Summit, detailing Qunar's OPS goals, evolution of its automation platform, the adoption of PHM concepts from aerospace to internet services, and practical fault‑prediction workflows, metrics, and challenges for achieving higher availability.

PHMQunaraiops
0 likes · 24 min read
How Qunar Leverages AI‑Driven Fault Prediction and Health Management to Boost System Reliability
macrozheng
macrozheng
Aug 30, 2019 · Backend Development

How to Build and Secure a Spring Boot Admin Dashboard with Eureka Integration

This tutorial walks through setting up Spring Boot Admin as a monitoring server and client, integrating it with Eureka for service discovery, adding Spring Security for authentication, and configuring email and custom notifications, complete with Maven and YAML configurations and Java code examples.

Spring Booteurekajava
0 likes · 23 min read
How to Build and Secure a Spring Boot Admin Dashboard with Eureka Integration
转转QA
转转QA
Aug 28, 2019 · Frontend Development

Using Puppeteer for UI Automation: Challenges, Solutions, and a Monitoring System

This article examines the difficulties of UI automation such as high script costs, instability, and rapid UI changes, and presents practical solutions using Puppeteer—including device emulation, robust test architecture with Mocha, error handling, dynamic selector strategies, and a monitoring system that captures screenshots and reports failures.

monitoringnodejstesting
0 likes · 11 min read
Using Puppeteer for UI Automation: Challenges, Solutions, and a Monitoring System
Youzan Coder
Youzan Coder
Aug 23, 2019 · Big Data

How to Build a Robust Event Logging Quality System with Real‑Time Validation

This article outlines common event‑logging quality problems, a systematic registration and real‑time validation framework built on Flink, configurable rule syntax, explainable results, continuous monitoring, targeted optimizations, and an evaluation model that together form a comprehensive quality‑center for big‑data platforms.

Big DataData QualityFlink
0 likes · 11 min read
How to Build a Robust Event Logging Quality System with Real‑Time Validation
58 Tech
58 Tech
Aug 20, 2019 · Frontend Development

Architecture Design of a Front-End Monitoring Platform

This article describes the design and architecture of a front‑end monitoring platform, detailing its JS SDK, data analyzer, web UI, reference log‑collection architectures, use of Kafka, MySQL, Hive and HBase, scaling considerations, storage conventions, and operational best practices.

JavaScriptfrontendlogging
0 likes · 8 min read
Architecture Design of a Front-End Monitoring Platform
Programmer DD
Programmer DD
Aug 13, 2019 · Operations

Mastering Prometheus Histograms: How Cumulative Buckets Simplify Metrics

This article explains the fundamentals of Prometheus histogram metrics, illustrates why they are cumulative, shows how to drop unwanted buckets with relabeling, and demonstrates quantile calculations using the histogram_quantile function, providing practical examples and code snippets for effective monitoring.

HistogramMetricsPrometheus
0 likes · 7 min read
Mastering Prometheus Histograms: How Cumulative Buckets Simplify Metrics
DevOps
DevOps
Aug 13, 2019 · Operations

Comprehensive DevOps Toolset Overview

This article presents a detailed, categorized list of DevOps tools—including version control, automated build and testing, CI/CD, container platforms, configuration management, micro‑service platforms, logging, and monitoring solutions—providing concise descriptions for each to help teams select appropriate utilities for modern software delivery pipelines.

Configuration ManagementDevOpsautomation
0 likes · 14 min read
Comprehensive DevOps Toolset Overview
dbaplus Community
dbaplus Community
Jul 29, 2019 · Operations

How to Build a Cost‑Effective, Multi‑Layer Monitoring System for Distributed Applications

This article explains why comprehensive, multi‑layer monitoring is essential for distributed systems, outlines environment, program, and business metrics, recommends practical tools such as Zabbix, open‑falcon, Prometheus and Grafana, and provides a step‑by‑step evolution plan and alerting strategy.

Distributed SystemsMetricsPrometheus
0 likes · 10 min read
How to Build a Cost‑Effective, Multi‑Layer Monitoring System for Distributed Applications
58 Tech
58 Tech
Jul 23, 2019 · Operations

Design and Implementation of an Open Alarm Platform for Monitoring Systems

The Open Alarm Platform provides a flexible data model, modular architecture, and robust stability features to enable various business lines to integrate their custom monitoring systems via APIs, offering alert convergence, merging, multi‑channel delivery, and comprehensive management while reducing development and maintenance costs.

AlertingOperationsScalability
0 likes · 9 min read
Design and Implementation of an Open Alarm Platform for Monitoring Systems
Suning Technology
Suning Technology
Jul 17, 2019 · Artificial Intelligence

What the 2019 International AIOps Challenge Reveals About AI‑Driven Operations

The 2019 International AIOps Challenge, co‑hosted by Suning Technology, the China Computer Federation, Tsinghua, Nankai and Huawei, showcased AI‑powered solutions for KPI anomaly detection, highlighted academic‑industry collaboration, and underscored the growing impact of intelligent monitoring on modern IT operations.

AIIntelligent Operationsaiops
0 likes · 6 min read
What the 2019 International AIOps Challenge Reveals About AI‑Driven Operations
21CTO
21CTO
Jul 13, 2019 · Operations

How to Set Up Automated Linux Memory & Swap Monitoring with Email Alerts

Learn step‑by‑step how to install the msmtp email client, configure mutt, use the free command to monitor Linux memory and swap usage, write Bash scripts that log and email the results, and schedule these checks with cron for continuous system health alerts.

BashEmailLinux
0 likes · 7 min read
How to Set Up Automated Linux Memory & Swap Monitoring with Email Alerts
360 Tech Engineering
360 Tech Engineering
Jul 12, 2019 · Operations

StackStorm‑Based Monitoring Alert Auto‑Remediation Solution

This article introduces a StackStorm‑driven monitoring and alert auto‑remediation architecture that converges alarms, performs root‑cause analysis, and executes self‑healing actions, detailing its components, workflow, configuration examples, and real‑world deployment outcomes.

Auto‑RemediationOperations AutomationStackStorm
0 likes · 7 min read
StackStorm‑Based Monitoring Alert Auto‑Remediation Solution
Meitu Technology
Meitu Technology
Jul 9, 2019 · Backend Development

Performance Optimization Practices in Meitu XiuXiu Community

The Meitu XiuXiu community tackled rapid user‑growth‑induced performance bottlenecks by deploying end‑to‑end monitoring (client Hubble and RED‑based server metrics), full‑link load testing, DNS and image‑delivery optimizations, and server‑side tuning such as bias‑locking removal and JIT warm‑up, emphasizing user‑experience and cross‑team collaboration.

BackendDNS Optimizationfull‑link testing
0 likes · 25 min read
Performance Optimization Practices in Meitu XiuXiu Community
dbaplus Community
dbaplus Community
Jul 8, 2019 · Big Data

How to Use ClickHouse Sampling and Materialized Views for Real‑Time Monitoring of Billion‑Scale Ad Traffic

This article explains how to handle high‑volume advertising monitoring by storing raw request logs in ClickHouse, enabling sampling and materialized views, and using TP999 metrics, aggregating tables, and Grafana queries to achieve fast, flexible, and low‑impact real‑time analytics on billions of events.

Samplingbig-dataclickhouse
0 likes · 10 min read
How to Use ClickHouse Sampling and Materialized Views for Real‑Time Monitoring of Billion‑Scale Ad Traffic
Architecture Digest
Architecture Digest
Jul 8, 2019 · Backend Development

Evolution and Architecture of MaFengWo Payment Center (Version 1.0 → 2.0)

The article details the evolution of MaFengWo's payment center from a basic payment‑refund module (1.0) to a comprehensive, modular platform (2.0), describing its core capabilities, layered architecture, customizable checkout, routing management, monitoring system, and future micro‑service roadmap.

Backend ArchitectureScalabilitymonitoring
0 likes · 15 min read
Evolution and Architecture of MaFengWo Payment Center (Version 1.0 → 2.0)
System Architect Go
System Architect Go
Jul 5, 2019 · Backend Development

Key Monitoring Metrics for Node.js Applications and Open‑Source Tools

This article explains why monitoring is essential for Node.js applications, outlines the most important performance metrics such as CPU usage, memory usage, garbage collection, event‑loop latency, clustering, and request/response latency, and introduces several ready‑to‑use open‑source monitoring tools.

Node.jsOpen-sourcemonitoring
0 likes · 6 min read
Key Monitoring Metrics for Node.js Applications and Open‑Source Tools
Architecture Digest
Architecture Digest
Jul 5, 2019 · Operations

The Story of Elasticsearch and the Elastic Stack: From Origins to ELK

This article narrates the origin and evolution of Elasticsearch, its underlying Lucene technology, the surrounding Elastic Stack components such as Logstash, Kibana, and Beats, and illustrates how they together provide powerful search, logging, monitoring, and analytics solutions for modern applications.

BeatsElastic StackKibana
0 likes · 11 min read
The Story of Elasticsearch and the Elastic Stack: From Origins to ELK
Tencent IMWeb Frontend Team
Tencent IMWeb Frontend Team
Jul 4, 2019 · Cloud Computing

Migrating a Lightweight Web App to Serverless on Tencent Cloud: A Step‑by‑Step Guide

This article explains the fundamentals of Serverless architecture, its pros and cons, and provides a detailed, practical walkthrough for migrating a lightweight web application to Tencent Cloud's Serverless Cloud Function platform, covering architecture redesign, data storage, performance tuning, debugging, deployment, logging, and monitoring.

Deploymentdebuggingmonitoring
0 likes · 22 min read
Migrating a Lightweight Web App to Serverless on Tencent Cloud: A Step‑by‑Step Guide
ITPUB
ITPUB
Jul 2, 2019 · Databases

How ClickHouse Powers Ctrip’s Hotel Data Platform for Billions of Daily Updates

This article explains how Ctrip’s hotel data intelligence platform handles over ten billion daily data updates and nearly a million queries by adopting ClickHouse, detailing the system's background, the reasons for choosing ClickHouse over other solutions, the data ingestion pipelines, monitoring strategies, operational practices, and performance outcomes.

Big DataReal-time analyticsclickhouse
0 likes · 13 min read
How ClickHouse Powers Ctrip’s Hotel Data Platform for Billions of Daily Updates
Java High-Performance Architecture
Java High-Performance Architecture
Jul 2, 2019 · Operations

How to Build Highly Available Systems: 8 Essential Strategies

This article outlines eight practical high‑availability techniques—multiple replicas, isolation, rate limiting, circuit breaking, degradation, gray releases with rollback, comprehensive monitoring, and proactive log alerting—to help engineers design systems that are both efficient and reliable under heavy load.

System Designcircuit breakerdegradation
0 likes · 7 min read
How to Build Highly Available Systems: 8 Essential Strategies
Architecture Digest
Architecture Digest
Jul 2, 2019 · Fundamentals

Key Practices for High Availability, Isolation, and Data Consistency in Large‑Scale Internet Systems

The article outlines essential techniques for building highly available internet services, covering system availability metrics, multi‑level caching, database and service isolation, concurrency control, gray‑release deployment, comprehensive monitoring, graceful degradation, asynchronous design, and data‑consistency scenarios for both real‑time and offline big‑data workloads.

Data ConsistencySystem Architecturehigh availability
0 likes · 8 min read
Key Practices for High Availability, Isolation, and Data Consistency in Large‑Scale Internet Systems
Tencent Cloud Developer
Tencent Cloud Developer
Jul 1, 2019 · Information Security

How to Detect and Prevent Cloud Data Leaks: Practical Strategies and Rule Configurations

This guide explains recent cloud‑based data‑leak incidents, categorizes common leak vectors, analyzes technical and managerial root causes, and provides actionable monitoring techniques, rule‑configuration examples, and incident‑response steps using Tencent Cloud Security Operations Center.

GitHubSecurity OperationsTencent Cloud
0 likes · 19 min read
How to Detect and Prevent Cloud Data Leaks: Practical Strategies and Rule Configurations
dbaplus Community
dbaplus Community
Jun 27, 2019 · Artificial Intelligence

How AI Powers Intelligent Multi-Modal Financial Data Quality Monitoring

This article presents the design, implementation, and evaluation of X‑monitor, an AI‑driven, adaptive, multi‑modal financial data quality monitoring platform that combines rule‑based and self‑learning strategies to improve detection efficiency, accuracy, and flexibility for large‑scale securities‑firm data streams.

AIbig-datadata-quality
0 likes · 24 min read
How AI Powers Intelligent Multi-Modal Financial Data Quality Monitoring
Sohu Tech Products
Sohu Tech Products
Jun 26, 2019 · Operations

Distributed Tracing and Observability: Principles, OpenTracing Standard, and Open‑Source Solutions Comparison

This article explains how microservice complexity drives the need for observability, outlines its three pillars—logging, metrics, and tracing—describes OpenTracing concepts and APIs, and compares major open‑source distributed tracing systems to help engineers choose the right solution for fault localization, performance analysis, and capacity planning.

OpenTracingmonitoring
0 likes · 11 min read
Distributed Tracing and Observability: Principles, OpenTracing Standard, and Open‑Source Solutions Comparison
Architecture Digest
Architecture Digest
Jun 25, 2019 · Operations

Design and Implementation of a Unified Monitoring and Alert System for MaFengWo Large Transportation Business

This article describes the motivation, architecture, key components, rule engine, alert actions, and practical lessons learned while building a unified monitoring and alarm system for MaFengWo's large‑scale transportation platform, highlighting data collection, Elasticsearch storage, scheduling, and future enhancements.

AlertingElasticsearcharchitecture
0 likes · 13 min read
Design and Implementation of a Unified Monitoring and Alert System for MaFengWo Large Transportation Business
DevOps Cloud Academy
DevOps Cloud Academy
Jun 20, 2019 · Operations

Step-by-Step Installation and Configuration of Node Exporter, Alertmanager, Prometheus, and Grafana for Monitoring and Alerting

This guide walks through downloading, extracting, and setting up Node Exporter, Alertmanager, Prometheus, and Grafana on a Linux server, configuring their systemd services, customizing alert rules, and verifying the monitoring and alerting pipeline with screenshots of each verification step.

AlertmanagerGrafanaOperations
0 likes · 7 min read
Step-by-Step Installation and Configuration of Node Exporter, Alertmanager, Prometheus, and Grafana for Monitoring and Alerting
ITPUB
ITPUB
Jun 20, 2019 · Operations

Essential Ops Lessons: Avoid Disasters with Backups, Permissions, and Monitoring

This article shares hard‑earned operational guidelines for Linux servers, covering safe testing, cautious use of rm ‑rf, the importance of backups, strict access control, SSH hardening, firewall rules, intrusion detection, systematic monitoring, performance tuning, and maintaining a calm mindset to prevent costly incidents.

OperationsServer Administrationmonitoring
0 likes · 12 min read
Essential Ops Lessons: Avoid Disasters with Backups, Permissions, and Monitoring
Architecture Digest
Architecture Digest
Jun 19, 2019 · Big Data

Design and Optimization of Large‑Scale Log Systems for High‑Volume Data

This article examines the challenges of handling massive log data in large‑scale e‑commerce platforms, outlines a baseline ELK‑based architecture, discusses real‑time versus near‑real‑time requirements, and presents four optimization strategies—including basic tuning, platform scaling, data partitioning, and system degradation—to improve performance, resource utilization, and reliability.

ELKLog ManagementSystem optimization
0 likes · 17 min read
Design and Optimization of Large‑Scale Log Systems for High‑Volume Data
Java Backend Technology
Java Backend Technology
Jun 19, 2019 · Backend Development

Enterprise Redis: Scaling, Monitoring, and Business Isolation

This article explores how enterprises can effectively use Redis by partitioning clusters for independent or shared use, addressing key naming conflicts, implementing graceful scaling with Zookeeper, monitoring performance via Open-Falcon, and quickly isolating problematic business traffic to maintain system stability.

Business IsolationClustermonitoring
0 likes · 10 min read
Enterprise Redis: Scaling, Monitoring, and Business Isolation
Alibaba Cloud Developer
Alibaba Cloud Developer
Jun 18, 2019 · Operations

Why Designing for Failure Is the Key to Resilient Systems

The article explains how anticipating and engineering for diverse failure scenarios—from hardware faults and software bugs to traffic spikes and external attacks—can dramatically improve system reliability, reduce downtime, and protect business continuity in modern distributed and cloud environments.

disaster recoveryfailure designmonitoring
0 likes · 12 min read
Why Designing for Failure Is the Key to Resilient Systems
Meitu Technology
Meitu Technology
Jun 12, 2019 · Cloud Computing

Meitu's Cloud-Based Image Beautification and Large-Scale Video Processing Architecture

Meitu replaced on-device beautification and video processing with a cloud-native architecture that routes requests by region, uses a dedicated upload SDK for detailed monitoring, employs edge-computing, a configuration-driven plug-in framework and Kubernetes-based elastic scaling, enabling fast, reliable, globally-distributed image and video services.

Edge ComputingMeituVideo processing
0 likes · 12 min read
Meitu's Cloud-Based Image Beautification and Large-Scale Video Processing Architecture
Architecture Digest
Architecture Digest
Jun 12, 2019 · Fundamentals

Comprehensive Guide to Distributed System Theory – Curated Article Collection

This resource compiles a complete series of articles on distributed system theory covering consistency, consensus, high availability, scalability, performance, testing, and operations, offering both quick overviews for newcomers and in‑depth readings for practitioners seeking to master modern distributed architectures.

ConsistencyScalabilityarchitecture
0 likes · 8 min read
Comprehensive Guide to Distributed System Theory – Curated Article Collection
DevOps Cloud Academy
DevOps Cloud Academy
Jun 9, 2019 · Operations

Prometheus Metric Definitions, Types, and Data Samples

This article explains Prometheus metric naming conventions, label usage, metric types such as Counter, Gauge, Summary, and Histogram, and describes the structure of data samples, providing examples and best‑practice guidelines for defining and classifying metrics in monitoring systems.

MetricsOperationsPrometheus
0 likes · 5 min read
Prometheus Metric Definitions, Types, and Data Samples
dbaplus Community
dbaplus Community
Jun 3, 2019 · Operations

Top 5 Open‑Source Log Analysis Tools Every Ops Team Should Try

Monitoring network activity and ensuring compliance requires effective log analysis, and this article reviews five open‑source tools—Graylog, Nagios, Elastic Stack, LOGalyze, and Fluentd—detailing their features, strengths, and use cases for operations and security teams.

log analysismonitoring
0 likes · 11 min read
Top 5 Open‑Source Log Analysis Tools Every Ops Team Should Try
MaGe Linux Operations
MaGe Linux Operations
May 28, 2019 · Operations

What Skills and Knowledge Do You Need to Master Large‑Scale Website Operations?

This article explains what large‑scale website operations entail, outlines the product lifecycle and the crucial role of operations engineers, lists essential technical skills and personal qualities, and discusses current challenges, future prospects, and key technical topics such as cluster management, monitoring, fault handling, and automation.

Cluster ManagementDevOpsSite Operations
0 likes · 18 min read
What Skills and Knowledge Do You Need to Master Large‑Scale Website Operations?
Big Data Technology & Architecture
Big Data Technology & Architecture
May 23, 2019 · Backend Development

Error Handling Strategies for Kafka Connectors: Immediate Stop, Silent Ignoring, and Dead‑Letter Queue

This article explains how to configure Kafka Connect error handling options—including stopping on failure, silently ignoring malformed messages, and routing failed records to a dead‑letter queue—while providing practical examples, monitoring techniques, and code snippets for robust data pipelines.

ConfigurationDead Letter Queueerror-handling
0 likes · 21 min read
Error Handling Strategies for Kafka Connectors: Immediate Stop, Silent Ignoring, and Dead‑Letter Queue
dbaplus Community
dbaplus Community
May 22, 2019 · Operations

Designing a Scalable Monitoring System: From Data Collection to Alerting

This article explains how to build a comprehensive monitoring system for distributed applications by classifying monitoring functions, describing data quadrants, outlining core modules such as collection, processing, feature extraction, and visualization, and reviewing typical implementations for metrics, logs, tracing, alerting, and the key open‑source components involved.

Distributed SystemsMetricsmonitoring
0 likes · 18 min read
Designing a Scalable Monitoring System: From Data Collection to Alerting
Efficient Ops
Efficient Ops
May 21, 2019 · Operations

Essential Linux Ops Tools: Nethogs, IOZone, IOTop, and More

This guide introduces a dozen practical Linux operation tools—including Nethogs, IOZone, IOTop, IPtraf, IFTop, Fail2ban, Tmux, and others—providing concise descriptions, download links, and ready‑to‑run installation commands to help system administrators boost monitoring, performance testing, and security on their servers.

LinuxOperationsmonitoring
0 likes · 12 min read
Essential Linux Ops Tools: Nethogs, IOZone, IOTop, and More
Architects' Tech Alliance
Architects' Tech Alliance
May 13, 2019 · Operations

Comprehensive Guide to System Monitoring: Objectives, Methods, Tools, Processes, and Best Practices

This article provides a thorough overview of system monitoring, covering its objectives, practical methods, core concepts, a comparison of popular open‑source and commercial tools, detailed monitoring processes (using Zabbix as an example), key metrics, alerting strategies, interview tips, and a summary of how organizations extend monitoring solutions.

AlertingZabbixmonitoring
0 likes · 17 min read
Comprehensive Guide to System Monitoring: Objectives, Methods, Tools, Processes, and Best Practices
Qu Tech
Qu Tech
May 7, 2019 · Frontend Development

How to Pinpoint JavaScript Errors in Production Using Source Maps

This article explains how to use SourceMap files to trace minified JavaScript errors back to their original source lines, covering overall design, code examples, error reporting workflow, CI integration, storage strategies, and future monitoring enhancements.

error trackingfrontend debuggingmonitoring
0 likes · 7 min read
How to Pinpoint JavaScript Errors in Production Using Source Maps
Efficient Ops
Efficient Ops
May 6, 2019 · Operations

How Live Streaming Ops Ensure Real-Time Reliability at Scale

Zhang Guanshi, the operations director at Huya Live, shares how his team designs a hybrid‑cloud architecture, implements a six‑pillar reliability framework, and leverages real‑time monitoring, AIOps, and rapid‑recovery tools to maintain stable, low‑latency live video streams for millions of viewers.

Operationscloud architecturelive streaming
0 likes · 22 min read
How Live Streaming Ops Ensure Real-Time Reliability at Scale
Efficient Ops
Efficient Ops
May 5, 2019 · Operations

How Qunar Uses AI-Driven Fault Prediction to Boost System Reliability

This article outlines Qunar's operational strategy for reducing failures and extending uptime through precise fault detection, rapid recovery, and AI-powered predictive health management, detailing the evolution of their OPS processes, practical implementations, and future challenges in applying PHM to internet services.

OperationsPHMaiops
0 likes · 18 min read
How Qunar Uses AI-Driven Fault Prediction to Boost System Reliability
dbaplus Community
dbaplus Community
Apr 24, 2019 · Operations

Choosing and Tuning Open‑Source Monitoring Stacks for Large‑Scale Operations

This article reviews common open‑source monitoring tools, shares the evolution of China Unicom's big‑data platform monitoring, and provides practical guidance on selecting collectors, databases, and visualization components, with detailed configurations for Prometheus, Alertmanager, Grafana, and automation recovery techniques.

AlertmanagerGrafanaInfluxDB
0 likes · 19 min read
Choosing and Tuning Open‑Source Monitoring Stacks for Large‑Scale Operations
21CTO
21CTO
Apr 19, 2019 · Operations

From Junior to Senior Ops Engineer: Master the Skills to Level Up

This guide walks you through the entire career ladder of a senior operations engineer, covering essential Linux, networking, monitoring, container, automation, and security skills, while offering practical advice on job roles, learning paths, and professional growth.

ContainerizationDevOpsOperations
0 likes · 13 min read
From Junior to Senior Ops Engineer: Master the Skills to Level Up
ITPUB
ITPUB
Apr 19, 2019 · Operations

How to Level Up from Junior to Senior DevOps Engineer: A Complete Roadmap

This guide outlines the career stages, skill sets, and practical tasks for DevOps engineers—from entry‑level troubleshooting to senior‑level architecture, automation, and performance optimization—providing concrete learning paths, tools, and personal development advice to help engineers advance their operations careers.

ContainerizationDevOpsLinux
0 likes · 12 min read
How to Level Up from Junior to Senior DevOps Engineer: A Complete Roadmap
Efficient Ops
Efficient Ops
Apr 18, 2019 · Operations

Choosing the Right Monitoring Stack: From Nagios to Prometheus & Grafana

This article reviews common open‑source monitoring combinations, compares their strengths and weaknesses, and shares practical guidance on selecting collectors, storage back‑ends, and visualization tools such as Telegraf, InfluxDB, Prometheus, Grafana, and alertmanager for large‑scale data platform operations.

GrafanaInfluxDBNagios
0 likes · 12 min read
Choosing the Right Monitoring Stack: From Nagios to Prometheus & Grafana
Mafengwo Technology
Mafengwo Technology
Apr 18, 2019 · Frontend Development

How to Build an Efficient Front‑End Monitoring Data Collection System

This article explains why front‑end monitoring is essential for user experience, outlines the key data types to collect, and provides practical AOP‑based implementations for route changes, JavaScript errors, performance metrics, resource failures, API calls, and reliable log reporting.

JavaScriptaopdata collection
0 likes · 14 min read
How to Build an Efficient Front‑End Monitoring Data Collection System
ITPUB
ITPUB
Apr 15, 2019 · Operations

Essential Practices to Prevent Operational Failures and Boost System Availability

This guide outlines six practical strategies—rollback testing, cautious destructive actions, clear command prompts, verified backups, careful handovers, and proactive monitoring—to help operations teams minimize outages and maintain high system availability.

AvailabilityOperationsbackup verification
0 likes · 6 min read
Essential Practices to Prevent Operational Failures and Boost System Availability
ITFLY8 Architecture Home
ITFLY8 Architecture Home
Apr 14, 2019 · Operations

8 Essential DevOps Skills Every Engineer Should Master

Shane Boulden, a Red Hat DevOps certification expert, outlines the eight most valuable DevOps skills—from mastering Kubernetes and micro‑service scaling to automation, container optimization, multi‑runtime interaction, identity management, OS expertise, and effective learning strategies—providing a practical roadmap for 2019 and beyond.

ContainersKubernetesci/cd
0 likes · 7 min read
8 Essential DevOps Skills Every Engineer Should Master
Efficient Ops
Efficient Ops
Apr 1, 2019 · Operations

Beyond Linux: Mastering Modern Operations – From Deployment to Cloud

This article explores the full spectrum of modern operations, covering environment deployment, troubleshooting, backup, high availability, monitoring, security, automation, virtualization, and cloud services, while highlighting essential tools and best practices for both Linux and Windows environments.

DeploymentOperationsautomation
0 likes · 8 min read
Beyond Linux: Mastering Modern Operations – From Deployment to Cloud
Efficient Ops
Efficient Ops
Mar 31, 2019 · Operations

How to Design Actionable Alerts and Effective Monitoring Strategies

This article explains why most alerts are poorly designed, defines actionable alerts, outlines monitoring objectives, discusses metric selection, and presents simple yet powerful algorithms for anomaly detection to improve system reliability and operational efficiency.

MetricsOperationsalert design
0 likes · 21 min read
How to Design Actionable Alerts and Effective Monitoring Strategies
Architecture Digest
Architecture Digest
Mar 29, 2019 · Backend Development

Building Large-Scale Go Microservices at Toutiao: Architecture, Concurrency, Performance, and Monitoring

This article describes how Toutiao migrated its backend to Go, detailing the reasons for choosing Go, the design of a five‑tuple microservice architecture, concurrency models, timeout and performance optimizations, monitoring techniques, and engineering practices for large‑scale cloud‑native services.

cloud-nativemonitoringperformance
0 likes · 16 min read
Building Large-Scale Go Microservices at Toutiao: Architecture, Concurrency, Performance, and Monitoring
Ctrip Technology
Ctrip Technology
Mar 28, 2019 · Operations

Comprehensive Guide to Enterprise WiFi Planning, Deployment, and Operations – Practices from Ctrip

This article presents a detailed, practice‑driven guide for enterprise WiFi, covering network planning, full‑coverage design, channel optimization, security, KPI‑based monitoring, probe‑based measurement, troubleshooting techniques, and real‑world case studies from Ctrip, highlighting how systematic operations can ensure high‑quality wireless service.

EnterpriseOperationsWiFi
0 likes · 16 min read
Comprehensive Guide to Enterprise WiFi Planning, Deployment, and Operations – Practices from Ctrip
58 Tech
58 Tech
Mar 25, 2019 · Artificial Intelligence

Machine Learning‑Based Threshold‑Free Monitoring for Business Metrics

This article describes a monitoring system that leverages machine learning to perform threshold‑free, real‑time anomaly detection on macro business indicators such as network traffic and access volume, detailing its architecture, sample labeling, model training, and multi‑level alarm strategies.

AIOperationsanomaly detection
0 likes · 7 min read
Machine Learning‑Based Threshold‑Free Monitoring for Business Metrics
58 Tech
58 Tech
Mar 25, 2019 · Operations

Alarm Convergence, Merging, and Self‑Healing in the 58 Monitoring Platform

The article describes how the 58 monitoring platform reduces alarm storms through alarm convergence, intelligent merging using Gini‑based decision trees, and automated self‑healing, thereby improving alert quality, cutting noise by about 70%, and helping engineers resolve incidents faster.

Operationsalarm convergencealert merging
0 likes · 9 min read
Alarm Convergence, Merging, and Self‑Healing in the 58 Monitoring Platform
Efficient Ops
Efficient Ops
Mar 23, 2019 · Operations

How to Build a Bank Ops SWAT Team for 5‑Minute Incident Recovery

This article explains how a bank can create a specialized Operations SWAT team, define its role, adopt seven essential “weapons” such as layered monitoring, intelligent alerts, communication protocols, automation, and disaster‑recovery tactics, and continuously train the team to meet strict five‑minute recovery targets.

SWAT teamautomationbank operations
0 likes · 21 min read
How to Build a Bank Ops SWAT Team for 5‑Minute Incident Recovery
Tencent Music Tech Team
Tencent Music Tech Team
Mar 22, 2019 · Frontend Development

How to Build a Frontend User‑Behavior Tracing System for Debugging External Network Issues

This article analyzes the challenges of reproducing external‑network bugs, outlines common failure causes, and presents a complete design for a JavaScript SDK that records environment data, AJAX calls, errors, and user actions, stores them in IndexedDB, and visualizes the timeline for efficient troubleshooting.

IndexedDBJavaScriptUser Behavior Tracking
0 likes · 15 min read
How to Build a Frontend User‑Behavior Tracing System for Debugging External Network Issues
转转QA
转转QA
Mar 20, 2019 · Operations

Real-time Monitoring of H5 Pages Using Headless Browser and Puppeteer

This article describes a real‑time monitoring solution for large numbers of H5 pages that combines Python's Requests library for data crawling with a headless Chrome browser driven by Puppeteer to detect resource errors, API failures, and DOM anomalies, automatically alerting stakeholders.

Headless BrowserNode.jsPuppeteer
0 likes · 8 min read
Real-time Monitoring of H5 Pages Using Headless Browser and Puppeteer
Efficient Ops
Efficient Ops
Mar 18, 2019 · Operations

How to Build a Bank Ops SWAT Team for Rapid Incident Recovery

This article explains how a bank can create a specialized SWAT‑style operations team, define its roles, adopt seven essential "weapons" such as monitoring and intelligent alerts, and apply ten tactical processes—from communication to automation—to meet strict five‑minute recovery and regulatory requirements.

SWAT teamautomationbank operations
0 likes · 21 min read
How to Build a Bank Ops SWAT Team for Rapid Incident Recovery
Alibaba Cloud Developer
Alibaba Cloud Developer
Mar 18, 2019 · Operations

Alibaba Hema’s 7‑Layer Funnel & 23 Tactics for Ultra‑Fast Delivery Stability

The article outlines Alibaba’s Hema delivery platform’s end‑to‑end stability strategy, detailing a 7‑layer funnel review process, three core norms (development, architecture, stability), and 23 practical tactics—including core‑noncore isolation, proactive monitoring, fault prevention, rapid recovery, and service‑level controls—to ensure reliable 30‑minute deliveries despite complex logistics and external disruptions.

Operationsarchitecturedelivery
0 likes · 13 min read
Alibaba Hema’s 7‑Layer Funnel & 23 Tactics for Ultra‑Fast Delivery Stability
QQ Music Frontend Team
QQ Music Frontend Team
Mar 17, 2019 · Frontend Development

How to Build a Front‑End User Behavior Tracing System for Faster Issue Diagnosis

This article explains the design and implementation of a front‑end user behavior tracing system, covering common external network problems, the importance of collecting runtime environment, data, JS errors, and interaction logs, and detailing SDK data collection, reporting strategies, server processing, and query platform visualization.

IndexedDBUser Behavior Trackingajax
0 likes · 14 min read
How to Build a Front‑End User Behavior Tracing System for Faster Issue Diagnosis
iQIYI Technical Product Team
iQIYI Technical Product Team
Mar 15, 2019 · Cloud Computing

Design and Architecture of QLive Large‑Scale Live Streaming Service

The QLive service powers iQIYI’s massive live‑streaming events—such as the Spring Festival Gala—by combining vertical and horizontal scaling, a three‑layer architecture with dual data‑center isolation, multi‑level caching, circuit‑breaker/degradation controls, and a Flume‑Kafka‑Hive monitoring pipeline to sustain over 400 k QPS and 99.9999 % availability.

Vertical Scalingcachingfault tolerance
0 likes · 9 min read
Design and Architecture of QLive Large‑Scale Live Streaming Service
Xianyu Technology
Xianyu Technology
Mar 14, 2019 · Operations

Ensuring High Availability of Search Engine Services: A Case Study of Xianyu's Search System

The article explains how Xianyu guarantees high‑availability of its core Ha3‑based search engine through independent gateway deployment, multi‑datacenter disaster recovery, traffic isolation, comprehensive monitoring, pressure testing, gray releases, and automated/manual failover, enabling rapid issue detection, recovery, and continuous service stability.

System Architecturedisaster recoveryemergency response
0 likes · 19 min read
Ensuring High Availability of Search Engine Services: A Case Study of Xianyu's Search System
JD Tech
JD Tech
Mar 13, 2019 · Operations

Evolution of JD Digital Technology’s Host Monitoring System “DiTing”: From V1 to V3

The article chronicles the design, evolution, and lessons learned of JD Digital Technology’s self‑built host monitoring platform “DiTing”, detailing its initial requirements, V1 architecture, subsequent V2 and V3 redesigns, encountered challenges, and future directions toward intelligent operations.

Big DataOperationsSystem Architecture
0 likes · 12 min read
Evolution of JD Digital Technology’s Host Monitoring System “DiTing”: From V1 to V3
Efficient Ops
Efficient Ops
Mar 10, 2019 · Operations

Essential Linux and Java Debugging Tools for Rapid Issue Diagnosis

This guide compiles a practical toolbox of Linux commands and Java utilities—including tail, grep, awk, find, tsar, jstack, jmap, jstat, btrace, Greys, JProfiler, and RateLimiter—to help engineers quickly locate, analyze, and resolve performance and stability problems in production environments.

debuggingmonitoringtools
0 likes · 12 min read
Essential Linux and Java Debugging Tools for Rapid Issue Diagnosis
dbaplus Community
dbaplus Community
Mar 10, 2019 · Operations

How Alibaba’s Table Store Auto‑Solves Hotspot Issues with Real‑Time Load Balancing

This article explains the architecture and mechanisms of Alibaba Cloud's Table Store load‑balancing system, detailing how it collects metrics, detects user‑access and machine hotspots, and automatically applies actions such as partition moves, splits, merges, and isolation to maintain high availability and performance.

Alibaba CloudNoSQLhotspot mitigation
0 likes · 17 min read
How Alibaba’s Table Store Auto‑Solves Hotspot Issues with Real‑Time Load Balancing
Efficient Ops
Efficient Ops
Mar 6, 2019 · Databases

How NetEase Built an Automated DBA Platform with AIOps for Massive Scale

This article details NetEase's journey in designing and implementing a large‑scale database automation platform, covering its requirements, tool‑based operations, architecture, AIOps integration, and the practical lessons learned for managing thousands of database clusters efficiently.

OperationsScalabilityaiops
0 likes · 20 min read
How NetEase Built an Automated DBA Platform with AIOps for Massive Scale
HomeTech
HomeTech
Feb 28, 2019 · Artificial Intelligence

How to Systematically Test and Monitor AI Models in Large‑Scale Production

This article presents a comprehensive approach to testing, automating, and monitoring AI prediction models in a high‑traffic environment, covering background, challenges, evaluation metrics, data sampling methods, automated test scripts, and online monitoring to ensure model accuracy, performance, and reliability.

AI testingBig DataMetrics
0 likes · 13 min read
How to Systematically Test and Monitor AI Models in Large‑Scale Production
Liulishuo Tech Team
Liulishuo Tech Team
Feb 19, 2019 · Backend Development

My Journey as a New Backend Engineer: Project Setup, Testing Approaches, and Monitoring at FlowingTalk

Joining a new team and project as a fresh graduate at FlowingTalk, I describe the supportive environment, codebase initialization, various HTTP testing strategies using Go and Gin, the adoption of OpenCensus, Prometheus, and Sentry for monitoring, and how iterative development accelerates my growth as a backend engineer.

Microservicesmonitoringtesting
0 likes · 9 min read
My Journey as a New Backend Engineer: Project Setup, Testing Approaches, and Monitoring at FlowingTalk
MaGe Linux Operations
MaGe Linux Operations
Feb 4, 2019 · Operations

60+ Essential Open‑Source DevOps Tools Every Engineer Should Know

This guide compiles over sixty top open‑source DevOps utilities—including version control, build automation, CI/CD platforms, container orchestration, configuration management, monitoring, and logging tools—to help developers and operations teams streamline development, deployment, and maintenance workflows.

DeploymentDevOpsautomation
0 likes · 14 min read
60+ Essential Open‑Source DevOps Tools Every Engineer Should Know
ITPUB
ITPUB
Jan 31, 2019 · Operations

Master Monitoring: Collect Metrics for New Systems Using White‑Box Techniques & the Four Golden SRE Indicators

This article explains how to approach monitoring for a newly introduced system by focusing on white‑box metric collection, distinguishing basic and business metrics, outlining common collection methods, and detailing Google SRE's four golden indicators—error, latency, traffic, and saturation—to guide effective observability.

MetricsOperationsSRE
0 likes · 10 min read
Master Monitoring: Collect Metrics for New Systems Using White‑Box Techniques & the Four Golden SRE Indicators
Efficient Ops
Efficient Ops
Jan 30, 2019 · Operations

From Rookie to Ops Manager: Key Lessons on Linux, Infrastructure, and Career Growth

The author shares a journey from a college Linux basics class to becoming an operations manager, detailing early hands‑on tasks, challenges in chaotic server environments, the creation of monitoring systems, and three key career lessons about learning, deepening technical understanding, and evaluating workplace fit.

LinuxOperationsSystem Administration
0 likes · 6 min read
From Rookie to Ops Manager: Key Lessons on Linux, Infrastructure, and Career Growth
MaGe Linux Operations
MaGe Linux Operations
Jan 24, 2019 · Operations

What It Takes to Master Large‑Scale Website Operations?

This article explores the definition, responsibilities, required skills, career challenges, and key technologies of large‑scale website operations, offering a comprehensive guide for aspiring and current operations engineers to understand and excel in this demanding field.

Career DevelopmentCluster Managementautomation
0 likes · 20 min read
What It Takes to Master Large‑Scale Website Operations?
Efficient Ops
Efficient Ops
Jan 23, 2019 · Operations

Designing an Operations Monitoring Platform: Tools & Best Practices

This article explores the essential concepts for selecting and building an operations monitoring platform, reviewing popular tools such as Cacti, Nagios, Zabbix, Ganglia, Centreon, Prometheus, and Grafana, and outlines a six‑layer architecture and practical strategies for scaling, alerting, and high‑availability in diverse environments.

AlertingDevOpsInfrastructure
0 likes · 19 min read
Designing an Operations Monitoring Platform: Tools & Best Practices
Youzan Coder
Youzan Coder
Jan 16, 2019 · Big Data

How Youzan Scaled Real‑Time Analytics with Flink: Architecture, Pitfalls, and Lessons

This article walks through Youzan's real‑time platform architecture, explains why Flink was chosen over Spark Structured Streaming, details practical challenges such as container over‑provisioning and monitoring overhead, shares solutions for Spring integration and async caching, and outlines future directions for SQL‑based streaming and scheduler improvements.

Big DataFlinkReal-time Streaming
0 likes · 19 min read
How Youzan Scaled Real‑Time Analytics with Flink: Architecture, Pitfalls, and Lessons
StarRing Big Data Open Lab
StarRing Big Data Open Lab
Jan 16, 2019 · Big Data

What’s New in Transwarp TDH 5.2.3? Key Performance and Stability Enhancements

TDH 5.2.3 introduces a series of stability and performance upgrades—including transaction and compaction optimizations, enhanced error handling, SQL length protection, improved Oracle‑compatible UDFs, default resource pool support, Guardian caching, TxSQL monitoring, and workflow and OLAP engine fixes—aimed at delivering a more reliable big‑data platform.

Big Datadatabasemonitoring
0 likes · 10 min read
What’s New in Transwarp TDH 5.2.3? Key Performance and Stability Enhancements
Architects Research Society
Architects Research Society
Jan 9, 2019 · Operations

Enterprise Azure Governance Framework: Scaffolding, Policies, Security, Cost Management, and Automation

This guide explains how enterprises can build a comprehensive Azure governance scaffold—covering hierarchy, naming standards, policies, initiatives, identity and access management, security, monitoring, cost control, automation, and DevOps—to balance agility with control and risk mitigation across cloud workloads.

AzureCost Managementautomation
0 likes · 29 min read
Enterprise Azure Governance Framework: Scaffolding, Policies, Security, Cost Management, and Automation
Didi Tech
Didi Tech
Jan 7, 2019 · Operations

Data‑Driven Risk Quantification Platform for SRE at Didi

Didi’s data‑driven Risk Quantification Platform assigns numeric Change Credit and Monitoring Health scores to deployments, alerts and core services, turning operational best‑practice adoption into a competitive game that has raised scores, cut incident rates despite higher change volume, and paves the way for broader risk‑management across the organization.

Risk QuantificationSREdata-driven operations
0 likes · 9 min read
Data‑Driven Risk Quantification Platform for SRE at Didi
Ctrip Technology
Ctrip Technology
Jan 7, 2019 · Artificial Intelligence

AIOps Practices and Exploration at Ctrip: Challenges, Solutions, and Future Outlook

This article presents Ctrip's extensive AIOps exploration, detailing operational challenges caused by massive monitoring data, the evolution of DevOps practices, the design of intelligent anomaly detection and diagnosis systems, practical use cases, and a forward‑looking perspective on the future of AI‑driven operations.

Fourier TransformOperationsaiops
0 likes · 20 min read
AIOps Practices and Exploration at Ctrip: Challenges, Solutions, and Future Outlook
JD Tech
JD Tech
Jan 3, 2019 · Operations

Comprehensive Monitoring Strategies for E‑commerce Platforms: Black‑Box and White‑Box Approaches

This article systematically explains how to enhance e‑commerce platform availability by implementing both black‑box monitoring to detect functional failures and white‑box monitoring to pinpoint root causes, detailing core order‑process metrics, common issues, mitigation strategies, and illustrative Grafana dashboards.

GrafanaOperationsSRE
0 likes · 9 min read
Comprehensive Monitoring Strategies for E‑commerce Platforms: Black‑Box and White‑Box Approaches
Efficient Ops
Efficient Ops
Jan 2, 2019 · Operations

Essential Ops Practices: Prevent Disasters with Backups, Security, and Monitoring

This guide outlines critical operational practices for Linux server management, emphasizing thorough testing, cautious command execution, regular backups, strict access controls, comprehensive monitoring, performance tuning, and a disciplined mindset to avoid costly incidents and ensure system stability.

Operationsmonitoringsecurity
0 likes · 12 min read
Essential Ops Practices: Prevent Disasters with Backups, Security, and Monitoring
58 Tech
58 Tech
Dec 26, 2018 · Operations

Overview of the 58 Intelligent Monitoring System and Its Multi‑Dimensional Architecture

The 58 Intelligent Monitoring System provides a flexible, 24/7, multi‑dimensional monitoring solution that covers network, server, system, application and business layers, incorporates AI‑driven prediction, anomaly detection, alarm merging, root‑cause analysis and self‑healing, and offers both PC and WeChat interfaces for operators.

AlertingOperationsSystem Architecture
0 likes · 16 min read
Overview of the 58 Intelligent Monitoring System and Its Multi‑Dimensional Architecture
Ops Development Stories
Ops Development Stories
Dec 21, 2018 · Operations

How to Install Zabbix Server, MySQL, Nginx, PHP, and Elasticsearch on CentOS

This comprehensive tutorial walks you through adding the Zabbix repository, installing Zabbix server and web interface, setting up MySQL 5.7, configuring Nginx and PHP from source, deploying the Zabbix agent, installing Elasticsearch with the head plugin, and finally storing Zabbix history data in Elasticsearch on a CentOS system.

CentOSElasticsearchPHP
0 likes · 20 min read
How to Install Zabbix Server, MySQL, Nginx, PHP, and Elasticsearch on CentOS