Tagged articles

Monitoring

2256 articles · Page 22 of 23
Efficient Ops
Efficient Ops
Apr 16, 2017 · Operations

How China Life Built a Self‑Developed Automated Ops Platform from Scratch

China Life’s Shanghai Data Center team transformed chaotic, multi‑system operations into a unified, automated platform by standardizing hardware, processes, and tools, leveraging OpenStack, Docker, Zabbix, and custom scripts, ultimately achieving efficient monitoring, change management, and a mobile‑enabled DevOps workflow.

AutomationCloudMonitoring
0 likes · 17 min read
How China Life Built a Self‑Developed Automated Ops Platform from Scratch
dbaplus Community
dbaplus Community
Apr 13, 2017 · Backend Development

Scalable Small .NET E‑Commerce Architecture: Monitoring, DB Master‑Slave & Capacity Planning

This guide walks through the evolution of a small .NET‑based e‑commerce system, covering its initial LAMP‑style setup, detailed backend architecture, logging and monitoring solutions, master‑slave database design, shared‑storage image server, mobile M‑site construction, capacity estimation methods, and caching strategies.

.NETMonitoringarchitecture
0 likes · 22 min read
Scalable Small .NET E‑Commerce Architecture: Monitoring, DB Master‑Slave & Capacity Planning
Efficient Ops
Efficient Ops
Apr 12, 2017 · Operations

Mastering Enterprise Monitoring: From Basics to Advanced Toolchains

This comprehensive guide explains why monitoring is vital for operations, outlines clear objectives and methods, compares popular open‑source and commercial tools, details a Zabbix‑based workflow, and covers hardware, system, application, network, security, API, performance, and business metrics with practical alerting strategies.

AlertingMonitoringOperations
0 likes · 21 min read
Mastering Enterprise Monitoring: From Basics to Advanced Toolchains
Tongcheng Travel Technology Center
Tongcheng Travel Technology Center
Apr 10, 2017 · Operations

Sentinel Monitoring System: Real‑Time Business Log Monitoring and Incident Detection for an Airline Ticket Platform

The Sentinel system was built to provide real‑time, zero‑modification monitoring of airline ticket business services by consuming Tianwang logs through a Storm cluster, offering flexible rule configuration, addressing performance pitfalls, and planning future enhancements such as custom monitoring scripts and visual dashboards.

Log ProcessingMonitoringReal-time
0 likes · 6 min read
Sentinel Monitoring System: Real‑Time Business Log Monitoring and Incident Detection for an Airline Ticket Platform
Efficient Ops
Efficient Ops
Apr 9, 2017 · Cloud Native

How Ctrip Achieved Seconds‑Level Scaling with a Container Cloud

Ctrip built a private container cloud to handle massive seasonal traffic spikes, enabling rapid, automated scaling and shrinking of resources, improving deployment speed, resource utilization, and operational intelligence across more than 20 business units.

CtripMonitoringcloud-native
0 likes · 16 min read
How Ctrip Achieved Seconds‑Level Scaling with a Container Cloud
Alibaba Cloud Developer
Alibaba Cloud Developer
Apr 6, 2017 · Backend Development

How Alibaba Scaled GitLab to Support Millions of Users with Sharding and High‑Availability

This article details Alibaba Group's journey of transforming its GitLab deployment from a single‑node setup to a distributed, sharded architecture that handles tens of millions of daily requests, achieves near‑perfect reliability, and incorporates performance, monitoring, and disaster‑recovery innovations.

GitLabHigh AvailabilityMonitoring
0 likes · 15 min read
How Alibaba Scaled GitLab to Support Millions of Users with Sharding and High‑Availability
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
Apr 6, 2017 · Operations

How SRE’s Dialectical Thinking Redefines Modern Operations

An insightful reflection on Google’s SRE philosophy shows how dialectical thinking—questioning absolute stability, embracing limited toil, prioritizing simple monitoring, recognizing automation’s hidden risks, and practicing real‑world failure drills—can reshape operations, encouraging smarter, more resilient system design.

AutomationMonitoringReliability
0 likes · 7 min read
How SRE’s Dialectical Thinking Redefines Modern Operations
Efficient Ops
Efficient Ops
Mar 30, 2017 · Backend Development

Designing a Scalable, Configurable Distributed Web Crawler

This article outlines the motivation, requirements, modular decomposition, and architecture of a distributed web crawling platform that emphasizes reusability, lightweight modules, real‑time monitoring, and easy configuration for diverse data‑collection tasks.

ConfigurationMonitoringbackend-architecture
0 likes · 10 min read
Designing a Scalable, Configurable Distributed Web Crawler
Baidu Waimai Technology Team
Baidu Waimai Technology Team
Mar 30, 2017 · Backend Development

Design and Implementation of a Unified Voucher Issuance Platform for Baidu Waimai

This article describes the design, architecture, and operational features of Baidu Waimai's unified voucher issuance platform, detailing its four‑layer backend structure, permission and strategy configurations, flow‑control mechanisms, service isolation, monitoring visualizations, and re‑entrancy safeguards to support large‑scale marketing distribution.

Flow ControlMonitoringSystem Design
0 likes · 7 min read
Design and Implementation of a Unified Voucher Issuance Platform for Baidu Waimai
Qunar Tech Salon
Qunar Tech Salon
Mar 23, 2017 · Cloud Native

Ctrip Container Cloud: Architecture, Scaling, and Operational Practices

The article details Ctrip's rapid business growth driving the need for elastic scaling, the adoption of container technology to achieve second‑level provisioning, the design of their container cloud platform—including deployment principles, network choices, orchestration evaluations, monitoring solutions, and the CDOS overview—providing practical insights for large‑scale cloud‑native operations.

MonitoringOrchestrationcloud-native
0 likes · 16 min read
Ctrip Container Cloud: Architecture, Scaling, and Operational Practices
Baidu Intelligent Testing
Baidu Intelligent Testing
Mar 21, 2017 · Operations

Server Monitoring Solution: Requirements, Design Decisions, and Implementation Details

This article presents a comprehensive server‑side monitoring solution covering functional and performance requirements, monitoring objects, design choices between self‑monitoring and centralized reporting, system architecture, API definitions, key challenges such as key collisions, data formats, storage options, and operational considerations.

AlertingMetricsMonitoring
0 likes · 12 min read
Server Monitoring Solution: Requirements, Design Decisions, and Implementation Details
ITPUB
ITPUB
Mar 20, 2017 · Operations

How to Diagnose Linux Performance in the First 60 Seconds

Learn the essential Linux command-line tools and step-by-step commands you need to run within the first minute of logging into a server to quickly assess process activity, resource usage, and potential bottlenecks, enabling effective performance troubleshooting in production environments.

Command-lineMonitoringperformance
0 likes · 12 min read
How to Diagnose Linux Performance in the First 60 Seconds
Architecture Digest
Architecture Digest
Mar 18, 2017 · Backend Development

Technical Strategies for Startup Engineering Teams: Simplicity, Cloud Servers, Databases, Caching, and DevOps

The article outlines practical engineering guidelines for internet startups, emphasizing simplicity, rapid development, resource efficiency, and the use of cloud servers, MySQL, caching, asynchronous processing, logging, monitoring, documentation, and integrated build‑deploy pipelines to build stable, low‑cost backend systems.

Backend DevelopmentCachingMonitoring
0 likes · 16 min read
Technical Strategies for Startup Engineering Teams: Simplicity, Cloud Servers, Databases, Caching, and DevOps
Ctrip Technology
Ctrip Technology
Mar 17, 2017 · Cloud Computing

Ctrip Container Cloud: Architecture, Elastic Scaling, and Monitoring Practices

This article details Ctrip's journey in building a private container cloud to support rapid business growth, covering elasticity challenges, container deployment principles, orchestration platform choices, network design, operational issues, custom executors, monitoring solutions, and the overarching CDOS system.

Cloud ComputingContainer OrchestrationDocker
0 likes · 16 min read
Ctrip Container Cloud: Architecture, Elastic Scaling, and Monitoring Practices
High Availability Architecture
High Availability Architecture
Mar 16, 2017 · Operations

Stormcrow: Dropbox’s Scalable Feature‑Flag Platform for Rapid Deployment and A/B Testing

The article describes Dropbox’s Stormcrow system, a configurable feature‑gate platform that enables fast, safe rollout of new functionality across web, desktop, and mobile clients, supports granular A/B testing, leverages custom data fields, and integrates deployment, monitoring, and audit tooling for large‑scale operations.

A/B testingMonitoringScalable Systems
0 likes · 15 min read
Stormcrow: Dropbox’s Scalable Feature‑Flag Platform for Rapid Deployment and A/B Testing
Efficient Ops
Efficient Ops
Mar 1, 2017 · Operations

How Metrics-Driven Development Transforms Software Iteration and Ops

Metrics‑Driven Development (MDD) extends test‑driven principles by embedding real‑time monitoring into design, enabling rapid, precise, and granular software iterations, improving early problem detection, decision support, and aligning development with DevOps culture.

MetricsMonitoringObservability
0 likes · 13 min read
How Metrics-Driven Development Transforms Software Iteration and Ops
Efficient Ops
Efficient Ops
Feb 28, 2017 · Operations

Prepare Your E‑Commerce System for Mega‑Sales: Proactive Prevention & Rapid Response

This article outlines a comprehensive PDCA‑based methodology for e‑commerce platforms to proactively prevent issues, quickly detect anomalies, and execute rapid decisions during large‑scale promotions, covering system goal definition, performance evaluation, capacity planning, SLA management, and team/process maturity.

MonitoringOperationscapacity planning
0 likes · 18 min read
Prepare Your E‑Commerce System for Mega‑Sales: Proactive Prevention & Rapid Response
Meituan Technology Team
Meituan Technology Team
Feb 24, 2017 · Operations

Improvements and Architecture of Mt-Falcon Monitoring System

Mt‑Falcon, Meituan’s re‑engineered successor to Zabbix, introduces a modular architecture—Agent, Transfer, HBS, Judge, Graph, Alarm, Portal—and extensive refactorings that boost memory efficiency, asynchronous data handling, multi‑condition alerts, and API exposure, enabling over one million QPS, 200 million metrics, and robust, scalable monitoring across the company.

AlertingMonitoringarchitecture
0 likes · 24 min read
Improvements and Architecture of Mt-Falcon Monitoring System
Efficient Ops
Efficient Ops
Feb 21, 2017 · Mobile Development

How Alibaba Scales Mobile App Ops: Gray Release, Monitoring, and Rapid Fixes

This article details Alibaba's mobile app operational practices, covering the challenges of client-side maintenance, their high‑frequency release pipeline, gray‑release mechanisms, monitoring, trace systems, remote logging, and rapid issue resolution to ensure stability and performance at massive scale.

MonitoringOperationsTrace
0 likes · 21 min read
How Alibaba Scales Mobile App Ops: Gray Release, Monitoring, and Rapid Fixes
转转QA
转转QA
Feb 13, 2017 · Databases

Redis Connection Pool Saturation: A Debugging Tale

A developer recounts how a Redis connection pool overflow across dozens of clusters was traced to a single misbehaving service, diagnosed with netstat and ps commands, and resolved by adjusting configuration and stopping the offending process, illustrating practical troubleshooting of connection limits.

Connection PoolMonitoringOperations
0 likes · 4 min read
Redis Connection Pool Saturation: A Debugging Tale
dbaplus Community
dbaplus Community
Feb 9, 2017 · Operations

Scalable Monitoring for Massive Physical & Container Clusters: JD's MDC Insights

This article shares JD’s large‑scale monitoring system (MDC) design, covering its three‑tier architecture, agent‑based data collection, performance optimizations for SNMP/IPMI, low‑overhead deployment, high‑availability strategies, and practical lessons on scaling monitoring across thousands of physical machines and containers.

JDMDCMonitoring
0 likes · 10 min read
Scalable Monitoring for Massive Physical & Container Clusters: JD's MDC Insights
dbaplus Community
dbaplus Community
Feb 6, 2017 · Operations

How JD’s CallGraph Transforms Distributed Tracing for Real‑Time Operations

CallGraph, JD.com’s in‑house distributed tracing platform, provides low‑intrusion, high‑performance monitoring for micro‑service ecosystems, enabling real‑time call‑graph analysis, TP metrics, flexible configuration, and future extensions such as deep‑learning‑driven insights.

Distributed TracingLog ProcessingMonitoring
0 likes · 15 min read
How JD’s CallGraph Transforms Distributed Tracing for Real‑Time Operations
Node Underground
Node Underground
Jan 24, 2017 · Operations

11 Essential Practices to Master Node.js Application Monitoring

Effective Node.js monitoring boosts competitiveness, user experience, and cost efficiency, and this guide outlines eleven key recommendations—from tracking downtime and response thresholds to linking performance with business metrics and leveraging third‑party APM tools—ensuring robust, noise‑free alerts and secure, scalable applications.

APMMonitoringNode.js
0 likes · 3 min read
11 Essential Practices to Master Node.js Application Monitoring
Efficient Ops
Efficient Ops
Jan 22, 2017 · Operations

What 2016 Ops Teams Learned About Monitoring Tools and Alert Patterns

The 2016 Ops Alert Report reveals Zabbix’s dominance, preferred notification channels, monthly and daily alert trends, peak alert times, regional distribution, and quirky usage statistics, offering valuable insights for operations teams to optimize monitoring and incident response.

MonitoringOperationsZabbix
0 likes · 5 min read
What 2016 Ops Teams Learned About Monitoring Tools and Alert Patterns

Building a Scalable Business Monitoring System: Architecture, Modules & Lessons

This article presents a comprehensive case study of a business monitoring system, covering its background, architectural analysis, module design, time‑series database selection, visualization with Grafana, alerting strategies, decision‑making logic, and intelligent monitoring experiments, followed by key takeaways and lessons learned.

InfluxDBMonitoringOperations
0 likes · 12 min read
Building a Scalable Business Monitoring System: Architecture, Modules & Lessons
Liulishuo Tech Team
Liulishuo Tech Team
Dec 31, 2016 · Cloud Native

Designing Scalable and Reliable Backend Services at English Fluently: Architecture, Service Discovery, Monitoring, and Autoscaling

This article shares the engineering team’s experience of building a high‑growth, reliable backend for English Fluently, covering inter‑service communication with gRPC, service discovery, Docker‑based deployment, health‑checking, monitoring, autoscaling, Kubernetes orchestration, and multi‑cell availability strategies.

DockerMicroservicesMonitoring
0 likes · 10 min read
Designing Scalable and Reliable Backend Services at English Fluently: Architecture, Service Discovery, Monitoring, and Autoscaling
dbaplus Community
dbaplus Community
Dec 26, 2016 · Databases

How to Build a Scalable, Automated MySQL Operations Platform

This article explains how to standardize and automate MySQL management at scale, covering dedicated instance deployment, configuration consistency, multi‑instance creation, metadata collection, backup, monitoring, high‑availability with Zookeeper, and task orchestration using DBTask to achieve rapid, reliable database services.

AutomationDBTaskDatabase operations
0 likes · 12 min read
How to Build a Scalable, Automated MySQL Operations Platform
Alibaba Cloud Developer
Alibaba Cloud Developer
Dec 26, 2016 · Operations

How Alibaba’s SunFire Powers Real‑Time Monitoring for Billion‑Scale Transactions

Alibaba’s SunFire platform delivers massive‑scale, real‑time log collection, processing, and visualization for e‑commerce spikes like Double 11, using low‑overhead agents, asynchronous Map/Reduce pipelines, fault‑tolerant task scheduling, and shared inputs to ensure accurate, low‑latency monitoring across billions of transactions.

AlibabaMonitoringOperations
0 likes · 18 min read
How Alibaba’s SunFire Powers Real‑Time Monitoring for Billion‑Scale Transactions
Weidian Tech Team
Weidian Tech Team
Dec 15, 2016 · Databases

How to Build a Scalable Automated MySQL Operations Platform

This article explains how to standardize and automate MySQL operations—including multi‑instance deployment, metadata collection, monitoring, backup, and high‑availability using Zookeeper—so that large‑scale database services can be provisioned, managed, and scaled with minimal human intervention.

Database operationsHigh AvailabilityMonitoring
0 likes · 11 min read
How to Build a Scalable Automated MySQL Operations Platform
Alibaba Cloud Developer
Alibaba Cloud Developer
Dec 12, 2016 · Cloud Native

How Alibaba Built a Decade-Long Microservice Architecture: Challenges and Lessons

This article chronicles Alibaba's ten‑year journey from monolithic Java EE deployments to a cloud‑native microservice ecosystem, detailing the technical challenges, the evolution of its EDAS RPC frameworks, comprehensive monitoring, capacity planning, and the strategies that enabled resilient large‑scale services during massive traffic events.

Cloud NativeMonitoringService Governance
0 likes · 11 min read
How Alibaba Built a Decade-Long Microservice Architecture: Challenges and Lessons
Ctrip Technology
Ctrip Technology
Dec 2, 2016 · Backend Development

Challenges and Practices in Service‑Oriented Splitting of Qunar Payment System

The article details the technical challenges encountered during the service‑oriented decomposition of Qunar's payment platform, covering Dubbo and HTTP service conventions, database sharding and read/write separation, asynchronous processing, multi‑system management, and comprehensive monitoring and alerting solutions.

Monitoringasynchronous processingdatabase sharding
0 likes · 10 min read
Challenges and Practices in Service‑Oriented Splitting of Qunar Payment System
ITFLY8 Architecture Home
ITFLY8 Architecture Home
Nov 21, 2016 · Operations

Taobao’s Scaling Secrets: Stateless Sessions, Caching, Service Splitting & Sharding

This article explains how Taobao achieves horizontal scalability by adopting stateless session handling, efficient client‑side cookie storage, multi‑level caching, service splitting with HSF, database sharding via TDDL, asynchronous messaging, unstructured data storage, and comprehensive monitoring and configuration management.

CachingMonitoringService Splitting
0 likes · 18 min read
Taobao’s Scaling Secrets: Stateless Sessions, Caching, Service Splitting & Sharding
Efficient Ops
Efficient Ops
Nov 20, 2016 · Operations

Why Most Log‑Analysis Features Are Overrated and What Really Matters

The article critiques popular but unnecessary log‑analysis features—such as sub‑second alerts, endless pagination, flashy maps, full SQL support, bulk downloads, and live tail—arguing that focusing on practical alert content, efficient querying, and proper architecture yields far more value for IT operations.

AlertingData VisualizationMonitoring
0 likes · 10 min read
Why Most Log‑Analysis Features Are Overrated and What Really Matters
ITFLY8 Architecture Home
ITFLY8 Architecture Home
Nov 20, 2016 · Backend Development

How Meizu Scales Real‑Time Push to 600 M Messages/min: Architecture, Pitfalls & Solutions

The article details Meizu’s real‑time push system that supports 25 million online users and 6 million messages per minute, describing its four‑layer architecture, power‑saving strategies, network‑instability fixes, massive‑connection handling, monitoring practices, and gray‑release deployment techniques.

High concurrencyMonitoringdistributed systems
0 likes · 12 min read
How Meizu Scales Real‑Time Push to 600 M Messages/min: Architecture, Pitfalls & Solutions
Efficient Ops
Efficient Ops
Nov 17, 2016 · Operations

How Qunar Built an Automated Network Device Operations Platform to Boost Efficiency

This article explains how Qunar tackled growing network device management workload, low‑efficiency manual processes, and operational risk by designing an integrated platform that automates common tasks, enforces permission‑based controls, records audits, and provides real‑time monitoring and scalable data collection.

AutomationMonitoringPermission control
0 likes · 8 min read
How Qunar Built an Automated Network Device Operations Platform to Boost Efficiency
Efficient Ops
Efficient Ops
Nov 14, 2016 · Operations

How a Banking Card Organization Built a Scalable Cloud Operations Platform

This article details the evolution from manual, standardized operations to an automated, intelligent cloud operations platform for a banking card organization, describing its motivations, core features, key scenarios, technical architecture, scheduling algorithms, data visualization, and real‑world outcomes.

AutomationMonitoringOperations Management
0 likes · 13 min read
How a Banking Card Organization Built a Scalable Cloud Operations Platform
Qunar Tech Salon
Qunar Tech Salon
Nov 12, 2016 · Backend Development

Challenges and Solutions in Service‑Oriented Splitting of Qunar Payment System

The article examines the technical challenges encountered during the service‑oriented decomposition of Qunar's payment platform—including development efficiency, interface conventions, concurrency, security, monitoring, database sharding, read‑write separation, and asynchronous processing—and presents concrete solutions and best‑practice recommendations.

Backend DevelopmentMonitoringasynchronous processing
0 likes · 10 min read
Challenges and Solutions in Service‑Oriented Splitting of Qunar Payment System
ITPUB
ITPUB
Nov 11, 2016 · Databases

Essential Oracle SQL Queries for Performance Monitoring and Troubleshooting

This guide compiles a comprehensive set of Oracle SQL statements and explanations for detecting fragmented tables, index fragmentation, high clustering factor tables, session and process mapping, DML lock analysis, DDL lock inspection, active SQL tracking, resource usage statistics, and various performance‑related metrics, helping DBAs diagnose and tune database behavior efficiently.

AdministrationMonitoringOracle
0 likes · 26 min read
Essential Oracle SQL Queries for Performance Monitoring and Troubleshooting
Architecture Digest
Architecture Digest
Nov 10, 2016 · Operations

Interview with Lu Pengcheng on Mogu Street’s Monitoring System Architecture and Evolution

In this interview, Lu Pengcheng, a platform architect at Mogu Street, discusses the company’s large‑scale e‑commerce architecture, the evolution of its monitoring platform, design choices for high‑availability distributed systems, and future open‑source plans, providing practical insights for engineers and technical managers.

C++High AvailabilityMonitoring
0 likes · 9 min read
Interview with Lu Pengcheng on Mogu Street’s Monitoring System Architecture and Evolution
Nightwalker Tech
Nightwalker Tech
Nov 9, 2016 · Operations

Best Practices for Service Monitoring and Alerting in E‑commerce Systems

The discussion outlines essential service‑monitoring techniques—including health checks, JVM metrics, traffic and payment ring‑ratio analysis, client‑side exception tracking, third‑party CDN monitoring, alert thresholds, instrumentation via AOP or SDKs, and tooling such as Datadog, Zabbix, and the Elastic stack—to reliably detect and respond to incidents in e‑commerce environments.

AlertingLoggingMonitoring
0 likes · 10 min read
Best Practices for Service Monitoring and Alerting in E‑commerce Systems
Efficient Ops
Efficient Ops
Oct 29, 2016 · Databases

Why Your System Slows Down: Uncover Hidden Database Bottlenecks

The article explains how unnoticed database issues often cause system slowness, outlines key diagnostic questions for operations teams, and presents a three‑step approach—discover, solve, prevent—to regularly health‑check and optimize databases for reliable performance.

DatabasesMonitoringOperations
0 likes · 8 min read
Why Your System Slows Down: Uncover Hidden Database Bottlenecks
Meituan Technology Team
Meituan Technology Team
Oct 28, 2016 · Big Data

Design and Architecture of the CAT Real-Time Monitoring System

The CAT real‑time monitoring system, open‑sourced in 2014 for Java applications, combines a lightweight ThreadLocal‑based client SDK, Netty‑driven asynchronous transport, and a highly scalable backend that processes ~100 TB of logs daily across 70 machines, using custom binary serialization, in‑memory modeling, segmented storage with 48‑bit indexing, and hourly aggregation to provide near‑full‑volume fault detection, localization, and performance analysis.

JavaMonitoringReal-time
0 likes · 18 min read
Design and Architecture of the CAT Real-Time Monitoring System
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
Oct 19, 2016 · Operations

Wonder Monitoring: Scaling Ops with Open‑Falcon‑Powered Automation

This article explains how the internally built Wonder monitoring system, based on Open‑Falcon, tackles large‑scale operational challenges by offering automated agent updates, customizable metrics, log and port monitoring, persistent alarm storage, enhanced alert content, and comprehensive dashboards for thousands of devices.

AlertingAutomationMonitoring
0 likes · 7 min read
Wonder Monitoring: Scaling Ops with Open‑Falcon‑Powered Automation
Efficient Ops
Efficient Ops
Oct 17, 2016 · Operations

How Shanda Games Built a Scalable Automated Operations System

This article details Shanda Games' journey in designing and implementing a comprehensive automated operations platform—including installation, deployment, security, client and server updates, data analysis, backup, and monitoring—to efficiently manage hundreds of games across diverse hardware and operating systems.

AutomationMonitoringOperations
0 likes · 22 min read
How Shanda Games Built a Scalable Automated Operations System
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
Oct 14, 2016 · Operations

Mastering Rapid Incident Response: An Ops Engineer’s 4‑Step Method

This guide teaches operations engineers a four‑step “Wang‑Wen‑Wen‑Qie” methodology—observing system health, listening via alerts, questioning changes, and pulse‑checking with profiling tools—to rapidly diagnose and resolve high‑impact incidents while maintaining clear communication and post‑mortem learning.

MonitoringOperationsincident response
0 likes · 5 min read
Mastering Rapid Incident Response: An Ops Engineer’s 4‑Step Method
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
Sep 18, 2016 · Artificial Intelligence

How Linear Regression Can Tame Your Nighttime Alert Fatigue

This article explores how historical monitoring alerts can be analyzed and predicted using linear regression, guiding operations engineers to preprocess data, build regression models, and forecast future alert trends to reduce manual alarm handling and improve system stability.

MonitoringOperationsalert prediction
0 likes · 8 min read
How Linear Regression Can Tame Your Nighttime Alert Fatigue
Architecture Digest
Architecture Digest
Sep 7, 2016 · Backend Development

Design and Maintenance of High‑Peak E‑Commerce Systems for Traditional Enterprises

The article examines common pitfalls and best‑practice solutions for traditional enterprises building e‑commerce platforms that must handle traffic spikes, covering large‑scale query optimization, distributed architecture, database design, service degradation strategies, and comprehensive monitoring and operations.

CachingMonitoringe-commerce
0 likes · 13 min read
Design and Maintenance of High‑Peak E‑Commerce Systems for Traditional Enterprises
High Availability Architecture
High Availability Architecture
Aug 30, 2016 · Operations

Evolution of Meizu Flyme Operations Architecture and High‑Availability Practices

The article details Meizu's Flyme operations platform evolution—from a single‑cabinet setup in 2011 to a multi‑IDC, 6000‑server infrastructure—highlighting challenges, architectural upgrades, monitoring, cost control, automation, and future high‑availability directions for large‑scale internet services.

Cost ControlHigh AvailabilityMonitoring
0 likes · 13 min read
Evolution of Meizu Flyme Operations Architecture and High‑Availability Practices
Efficient Ops
Efficient Ops
Aug 22, 2016 · Cloud Computing

Mastering OpenStack Monitoring: Key Metrics and Best Practices

This article explains what OpenStack is, outlines its core modules, and details the most important monitoring metrics for Nova, Neutron, Keystone, hypervisors, tenants, and RabbitMQ, helping engineers build a robust, scalable OpenStack monitoring solution.

Cloud ComputingMetricsMonitoring
0 likes · 11 min read
Mastering OpenStack Monitoring: Key Metrics and Best Practices
Efficient Ops
Efficient Ops
Aug 21, 2016 · Operations

How to Build a Standardized CI/CD Pipeline for Enterprise Delivery

This article explains how enterprises can overcome manual deployment challenges by standardizing their tech stack, defining mutable and immutable deployment modes, designing an XY‑axis model for system components, and implementing continuous integration, delivery, feedback, and monitoring using tools like Jenkins.

CI/CDContinuous DeliveryJenkins
0 likes · 21 min read
How to Build a Standardized CI/CD Pipeline for Enterprise Delivery
ITPUB
ITPUB
Aug 21, 2016 · Backend Development

How to Diagnose and Prevent Redis Data Loss in Production

This article examines common causes of Redis data loss, walks through a real‑world incident where 90,000 keys vanished, and provides concrete monitoring, configuration, and operational safeguards to detect and avoid such failures.

Data lossMonitoringOperations
0 likes · 11 min read
How to Diagnose and Prevent Redis Data Loss in Production
ITPUB
ITPUB
Aug 18, 2016 · Backend Development

Step-by-Step Guide to Building a Django‑Based Operations Platform

This article fills the gap of detailed tutorials by chronologically documenting how the author constructed a Django‑powered operations platform, adding monitoring, dashboards, log viewing, task submission, and various management features, each illustrated with screenshots and brief explanations.

Backend DevelopmentDjangoMonitoring
0 likes · 8 min read
Step-by-Step Guide to Building a Django‑Based Operations Platform
Baidu Intelligent Testing
Baidu Intelligent Testing
Aug 16, 2016 · Mobile Development

Building a Comprehensive Monitoring System for Mobile Apps: Problem Discovery, Localization, and Damage Control

This article explains how to design a complete mobile app monitoring framework that covers problem discovery through key quality metrics and user feedback, systematic log instrumentation, effective issue localization methods, and rapid damage‑control strategies such as cloud‑based feature toggles and hot‑fix mechanisms.

LoggingMonitoringcrash-analysis
0 likes · 12 min read
Building a Comprehensive Monitoring System for Mobile Apps: Problem Discovery, Localization, and Damage Control
Ctrip Technology
Ctrip Technology
Aug 12, 2016 · Big Data

Ctrip's Real-Time Data Platform: Architecture, Practices, and Lessons Learned

This article details Ctrip's journey building a unified real-time data platform—covering business motivations, architectural requirements, technology choices like Kafka and Storm, implementation of Avro schemas, monitoring, alerting, operational lessons, and future explorations such as Streaming CQL and JStorm.

AlertingBig DataMonitoring
0 likes · 15 min read
Ctrip's Real-Time Data Platform: Architecture, Practices, and Lessons Learned
Baidu Intelligent Testing
Baidu Intelligent Testing
Aug 11, 2016 · Mobile Development

Common Mobile App Quality Issues and Monitoring Challenges

The article outlines typical mobile app quality problems such as adaptation, user experience, and traffic consumption, discusses their characteristics and impact assessment, and emphasizes the need for a comprehensive monitoring system to quickly detect, locate, and mitigate issues in production.

Monitoringapp qualitymobile
0 likes · 8 min read
Common Mobile App Quality Issues and Monitoring Challenges
Architecture Digest
Architecture Digest
Aug 7, 2016 · Operations

Designing a Three‑Dimensional High‑Availability Architecture for Alibaba's Game Integration System

The article describes how Alibaba's game integration platform achieved business‑oriented high availability by abandoning traditional system‑centric designs and implementing a three‑dimensional architecture that combines clear HA goals, multi‑active deployment, client‑side retries, functional isolation, automated monitoring, and rapid fault recovery, ultimately meeting a 3‑minute issue‑location and 5‑minute business‑recovery target.

High AvailabilityMonitoringOperations
0 likes · 21 min read
Designing a Three‑Dimensional High‑Availability Architecture for Alibaba's Game Integration System
Architecture Digest
Architecture Digest
Jul 30, 2016 · Frontend Development

Evolution and Architecture of Taobao Home Page: From PHP to Node, Performance Optimization, Stability, and Agile Operations

This article details the evolution of Taobao's home page over a year and a half, covering its background, migration from PHP to Node, modular architecture, performance tuning, stability mechanisms, and agile operational practices that keep a billion‑scale front‑end service reliable and fast.

CDNMonitoringnodejs
0 likes · 18 min read
Evolution and Architecture of Taobao Home Page: From PHP to Node, Performance Optimization, Stability, and Agile Operations
Architecture Digest
Architecture Digest
Jul 19, 2016 · Operations

Designing a Multi‑Dimensional High‑Availability Architecture for a Game Access System

The article presents a business‑oriented, three‑layer high‑availability architecture for a large‑scale game access platform, detailing measurable goals, client‑side retry with HTTP‑DNS, functional separation and degradation, multi‑region active‑active deployment, and automated, visual monitoring to achieve rapid fault detection, isolation, and recovery.

MonitoringOperationsdistributed-systems
0 likes · 20 min read
Designing a Multi‑Dimensional High‑Availability Architecture for a Game Access System
High Availability Architecture
High Availability Architecture
Jul 15, 2016 · Backend Development

High‑Availability Architecture for Weibo Paid Reading and Tipping Services

The article describes the high‑availability, high‑concurrency backend architecture of Weibo's paid reading and tipping platform, covering layered design, database sharding, asynchronous processing, monitoring, idempotency, distributed transaction handling, and security measures for a large‑scale internet‑finance system.

Monitoringbackenddistributed-systems
0 likes · 8 min read
High‑Availability Architecture for Weibo Paid Reading and Tipping Services

Designing a Business‑Oriented High Availability Architecture for a Game Access System

The article presents a business‑centric high‑availability solution for a large‑scale game access platform, detailing measurable goals, a three‑dimensional architecture that includes client‑side retry, HTTP‑DNS, functional separation, multi‑region active‑active deployment, and automated, visual monitoring to achieve rapid problem detection, recovery, and minimal outage frequency.

Monitoringbusiness continuitydistributed systems
0 likes · 23 min read
Designing a Business‑Oriented High Availability Architecture for a Game Access System
DevOps
DevOps
Jun 20, 2016 · Operations

A Comprehensive Overview of Popular DevOps Tools for IT Operations

This article provides a detailed overview of widely used DevOps tools—including monitoring solutions like Microsoft SCOM, Vistara, SolarWinds, Nimsoft, ServiceNow, automation platforms Chef and Puppet, container platform Docker, orchestration systems Apache Mesos and Kubernetes, as well as performance monitoring tools New Relic and Graphite/Grafana—highlighting their features, typical use cases, and important considerations.

AutomationIT OperationsMonitoring
0 likes · 10 min read
A Comprehensive Overview of Popular DevOps Tools for IT Operations
MaGe Linux Operations
MaGe Linux Operations
Jun 15, 2016 · Operations

Build Python Scripts for Real-Time Linux Server Monitoring

This article explains how to create Python scripts that monitor Linux server CPU, load, memory, and network usage by reading data from the /proc virtual filesystem, providing step‑by‑step code examples and illustrating each script’s output with screenshots.

LinuxMonitoringPython
0 likes · 11 min read
Build Python Scripts for Real-Time Linux Server Monitoring
Java High-Performance Architecture
Java High-Performance Architecture
Jun 14, 2016 · Backend Development

How Hotjar Scaled to 500M Daily Requests: 8 Lessons for Rapid Backend Growth

This article chronicles Hotjar's evolution from a simple two‑server setup to a robust, eight‑server architecture handling billions of daily requests, sharing eight practical lessons on scaling, CDN usage, language choice, data storage, monitoring, and cost‑effective optimizations for fast‑growing web services.

CDNMonitoringPerformance Optimization
0 likes · 7 min read
How Hotjar Scaled to 500M Daily Requests: 8 Lessons for Rapid Backend Growth
Liulishuo Tech Team
Liulishuo Tech Team
May 27, 2016 · Mobile Development

Evolution of the Android Architecture of the English Fluency App

This article details the step‑by‑step evolution of the English Fluency Android app’s architecture, covering its early broadcast‑based design, the adoption of a plugin‑based modular core, multi‑process integration, auxiliary systems such as asynchronous loading, event bus, monitoring, and support components for file storage, DNS protection, image loading, and downloading.

AndroidMobile DevelopmentMonitoring
0 likes · 13 min read
Evolution of the Android Architecture of the English Fluency App
MaGe Linux Operations
MaGe Linux Operations
May 23, 2016 · Operations

Top 20 Linux Monitoring Tools Every Sysadmin Should Know

This guide surveys more than twenty essential Linux monitoring utilities—covering system, network, log, and infrastructure tools such as top, htop, ntopng, Nagios, and Zabbix—to help administrators efficiently diagnose performance issues and maintain reliable services.

LinuxMonitoringperformance
0 likes · 9 min read
Top 20 Linux Monitoring Tools Every Sysadmin Should Know
Architecture Digest
Architecture Digest
May 22, 2016 · Big Data

Design and Architecture of Youzan Unified Log Platform

The article details the design, components, and operational challenges of Youzan's unified log platform, describing its multi‑layer architecture, ingestion methods using rsyslog/logstash and Flume‑NG, Kafka‑based log center, processing pipelines with Storm/Spark, and storage in HDFS and Elasticsearch.

FlumeMonitoringdistributed systems
0 likes · 10 min read
Design and Architecture of Youzan Unified Log Platform
Efficient Ops
Efficient Ops
May 16, 2016 · Cloud Native

How JD Scaled to 100,000 Docker Containers: Lessons in Cloud‑Native Operations

This article details JD.com's journey from physical servers to a massive Docker‑based cloud‑native platform, covering challenges, architecture, elastic scheduling, monitoring, and resource‑driven operations that support tens of thousands of containers across multiple data centers.

DockerMonitoringResource Management
0 likes · 26 min read
How JD Scaled to 100,000 Docker Containers: Lessons in Cloud‑Native Operations
21CTO
21CTO
May 14, 2016 · Backend Development

How We Scaled a Billion‑User System: From Monolith to Microservices

This article recounts how a rapidly growing online platform transformed a tightly coupled, fragile architecture into a scalable, high‑availability system by applying dynamic/static separation, read‑write splitting, caching, load‑balancing, intelligent monitoring, and finally migrating to a micro‑service architecture.

Cloud NativeMicroservicesMonitoring
0 likes · 11 min read
How We Scaled a Billion‑User System: From Monolith to Microservices
MaGe Linux Operations
MaGe Linux Operations
May 10, 2016 · Operations

10 Essential Practices to Prevent Operational Failures in Database Management

This article outlines ten practical guidelines for operations engineers—ranging from mandatory rollback testing and cautious handling of destructive commands to robust backup verification, vigilant monitoring, and disciplined handover procedures—to dramatically reduce system outages and improve overall reliability.

AutomationMonitoringOperations
0 likes · 18 min read
10 Essential Practices to Prevent Operational Failures in Database Management
Efficient Ops
Efficient Ops
May 7, 2016 · Operations

400+ Free DevOps Tools & Resources Every Sysadmin Should Know

This article compiles a curated list of over 400 free DevOps and system administration resources—including CI/CD services, monitoring tools, crash handling platforms, IaaS, PaaS, and DBaaS solutions—to help engineers streamline workflows and improve operational efficiency.

CI/CDIaaSMonitoring
0 likes · 7 min read
400+ Free DevOps Tools & Resources Every Sysadmin Should Know
Baidu Intelligent Testing
Baidu Intelligent Testing
Apr 28, 2016 · Operations

Testing and Evaluation Practices for Baidu Doctor Platform

This article details Baidu Doctor’s comprehensive testing and monitoring strategies, covering user experience data analysis, source data trust, online monitoring systems, log‑based automated checks, retrieval backend testing, evaluation metrics, Badcase mining, and user search habit analysis to ensure high‑quality medical O2O services.

Monitoringdata analysismedical platform
0 likes · 14 min read
Testing and Evaluation Practices for Baidu Doctor Platform
Architecture Digest
Architecture Digest
Apr 21, 2016 · Backend Development

Evolution and Refactoring of Autohome Mobile Backend Architecture

The article chronicles Autohome's mobile backend transformation from a monolithic ALL‑IN‑ONE design to a modular, high‑availability microservice architecture, detailing the challenges of traffic surge, resource coupling, and rapid releases, and describing the adopted solutions such as service decomposition, stateless design, Java migration, RPC framework, asynchronous components, and comprehensive monitoring and tracing.

MicroservicesMonitoringmobile
0 likes · 11 min read
Evolution and Refactoring of Autohome Mobile Backend Architecture
Big Data and Microservices
Big Data and Microservices
Apr 18, 2016 · Operations

Designing a Unified IT Operations Monitoring Indicator System for Banks

The article presents a comprehensive, business‑oriented IT operations monitoring framework for banks, detailing its lifecycle relevance, regulatory drivers, hierarchical AHP‑based design, indicator categories, weighting methods, SMART evaluation, and practical implementation steps to enhance risk control and service quality.

AHPIT OperationsITIL
0 likes · 12 min read
Designing a Unified IT Operations Monitoring Indicator System for Banks
Big Data and Microservices
Big Data and Microservices
Apr 1, 2016 · Operations

How to Build a Business‑Transaction‑Centric IT Operations Monitoring System

This article outlines a comprehensive approach for designing an IT operations monitoring platform that focuses on real‑time business transaction metrics, automatic topology discovery, event‑transaction correlation, deep component diagnostics, and unified data processing to improve availability, performance, and fault‑resolution speed in large‑scale data centers.

AutomationBusiness TransactionFault diagnosis
0 likes · 15 min read
How to Build a Business‑Transaction‑Centric IT Operations Monitoring System
Efficient Ops
Efficient Ops
Mar 31, 2016 · Operations

Rethinking CMDB: Building Scalable, Automated Configuration Management for Modern Ops

This talk explores the challenges of building and maintaining a CMDB, proposes a goal‑driven, industry‑referenced modeling approach, and outlines practical steps such as tagging, relationship mapping, dynamic attributes, automation, and visualization to create a service‑oriented, scalable configuration management database.

CMDBMonitoringmodeling
0 likes · 11 min read
Rethinking CMDB: Building Scalable, Automated Configuration Management for Modern Ops
21CTO
21CTO
Mar 22, 2016 · Operations

Build a Scalable Unified Monitoring & Alert Platform with Ganglia & Centreon

This article explains how to design and implement a unified operations monitoring and alerting platform by combining Ganglia for data collection with Centreon for alerting, covering architecture layers, module functions, integration steps, and practical Q&A for large‑scale deployments.

AlertingAutomationCentreon
0 likes · 20 min read
Build a Scalable Unified Monitoring & Alert Platform with Ganglia & Centreon
Big Data and Microservices
Big Data and Microservices
Mar 19, 2016 · Operations

Essential Linux Commands for Comprehensive System Inspection

This guide compiles essential Linux commands for inspecting system details, resources, disks, networks, processes, users, services, and installed programs, providing concise descriptions that help administrators quickly gather kernel, hardware, memory, storage, and runtime information.

LinuxMonitoringcommand-line
0 likes · 6 min read
Essential Linux Commands for Comprehensive System Inspection