Tagged articles
2179 articles
Page 19 of 22
MaGe Linux Operations
MaGe Linux Operations
Dec 19, 2018 · Operations

Top 30 Linux Monitoring Tools Every Sysadmin Should Know

This article compiles over 80 Linux monitoring tools—including system, network, log, and infrastructure utilities—providing detailed descriptions and usage tips to help administrators efficiently manage and troubleshoot their servers.

LinuxSystem Toolsmonitoring
0 likes · 10 min read
Top 30 Linux Monitoring Tools Every Sysadmin Should Know
MaGe Linux Operations
MaGe Linux Operations
Dec 19, 2018 · Operations

20 Essential Python Libraries for Sysadmins and DevOps Engineers

This article lists twenty powerful Python libraries—from psutil and dnspython to Ansible and SaltStack—detailing their functions for system monitoring, automation, networking, and configuration management, and explains why mastering these tools can boost a sysadmin’s efficiency and technical depth.

PythonSysadminmonitoring
0 likes · 5 min read
20 Essential Python Libraries for Sysadmins and DevOps Engineers
360 Tech Engineering
360 Tech Engineering
Dec 18, 2018 · Cloud Native

Design and Implementation of 360 Container Platform Monitoring System

The article describes how 360 built a Kubernetes‑based container platform monitoring system using Prometheus, ELK, Grafana and custom components, detailing its architecture, monitoring dimensions, log collection, alerting, selection rationale, high‑availability design, and future evolution for scalable cloud‑native operations.

KubernetesPrometheuscontainer platform
0 likes · 12 min read
Design and Implementation of 360 Container Platform Monitoring System
JD Tech
JD Tech
Dec 13, 2018 · Operations

Monitoring Puppet Configuration Management: Workflow, Metrics, and Troubleshooting

This article explains how to monitor the Puppet configuration management system, covering its request‑response‑execution‑report workflow, key monitoring metrics, black‑box and white‑box monitoring approaches, common issues, and practical solutions for ensuring large‑scale cluster consistency.

Configuration ManagementOperationsPuppet
0 likes · 8 min read
Monitoring Puppet Configuration Management: Workflow, Metrics, and Troubleshooting
Programmer DD
Programmer DD
Dec 12, 2018 · Databases

Essential Redis Health Metrics Every Engineer Should Monitor

This guide explains how to monitor critical Redis health indicators—including ping response, client connections, blocked clients, memory usage, fragmentation, cache hit rate, OPS, persistence status, expired keys, and slow logs—to ensure optimal performance and prevent failures.

CachePersistencememory
0 likes · 7 min read
Essential Redis Health Metrics Every Engineer Should Monitor
Tongcheng Travel Technology Center
Tongcheng Travel Technology Center
Nov 30, 2018 · Operations

Automated Data Center Management System: Architecture, Implementation Steps, and Operational Benefits

The article describes a comprehensive data‑center automation solution that standardizes hardware, implements a CMDB‑driven workflow, integrates procurement, visualization, fault diagnosis, and fine‑grained component management to improve efficiency, accuracy, and reliability of large‑scale operations.

CMDBdata centermonitoring
0 likes · 12 min read
Automated Data Center Management System: Architecture, Implementation Steps, and Operational Benefits
Meituan Technology Team
Meituan Technology Team
Nov 22, 2018 · Databases

Meituan's TiDB Adoption: Architecture, Deployment, and Operational Practices

Facing MySQL limits, Meituan selected TiDB for its MySQL compatibility, strong consistency, and online scaling, deploying it on ten clusters and 200 nodes, building automated deployment, monitoring, and data‑sync tools, resolving performance issues, and planning broader adoption and joint development of future TiDB features.

MeituanTiDBdatabase migration
0 likes · 20 min read
Meituan's TiDB Adoption: Architecture, Deployment, and Operational Practices
HomeTech
HomeTech
Nov 16, 2018 · Operations

Open-Sourcing Windows Agent for Open-Falcon Monitoring

The article announces the open-source release of the Windows Agent component under the Apache license, its integration into the Open-Falcon community, future feature enhancements, and gratitude to contributors, while providing links to the source code and related documentation.

Apache LicenseOperationsWindows Agent
0 likes · 5 min read
Open-Sourcing Windows Agent for Open-Falcon Monitoring
Efficient Ops
Efficient Ops
Nov 14, 2018 · Operations

How Zabbix Tackles FinTech Monitoring Challenges in the VUCA Era

This article explores how the VUCA-driven volatility of modern FinTech demands robust, multi‑layered monitoring solutions and explains why Zabbix, with its open‑source flexibility, automated discovery, and deep integration capabilities, is a compelling choice for achieving resilient, automated operations.

FinTechOperationsVUCA
0 likes · 19 min read
How Zabbix Tackles FinTech Monitoring Challenges in the VUCA Era
21CTO
21CTO
Nov 14, 2018 · Operations

Master Linux Performance: From 5W2H Methodology to Flame Graphs

This comprehensive guide explains how to diagnose Linux performance issues using a structured 5W2H approach, introduces essential monitoring tools for CPU, memory, disk I/O, and network, and demonstrates practical flame‑graph techniques—including on‑CPU, off‑CPU, memory, and differential analyses—to quickly locate and resolve bottlenecks.

flamegraphmonitoringperformance
0 likes · 20 min read
Master Linux Performance: From 5W2H Methodology to Flame Graphs
58 Tech
58 Tech
Nov 12, 2018 · Operations

Key Takeaways from the 58 Group Technical Salon on Monitoring Platforms

The article summarizes the 58 Group technical salon where experts from Momo and 58 shared practical experiences on monitoring platform architectures, coverage, alarm configurations, convergence techniques, custom dimensions, multi‑view dashboards, and future directions for intelligent and automated monitoring across the company.

AlertingDevOpsOperations
0 likes · 9 min read
Key Takeaways from the 58 Group Technical Salon on Monitoring Platforms
21CTO
21CTO
Nov 9, 2018 · Backend Development

How Meituan Optimized High‑Traffic Backend Performance: Real‑World Strategies and Case Studies

This article shares Meituan's practical performance‑optimization techniques—including code analysis, database tuning, caching strategies, asynchronous processing, JVM adjustments, multithreading, and monitoring—illustrated with real case studies that reduced job runtimes from over 40 minutes to under 15 minutes.

Database Tuningasynchronous processingcaching
0 likes · 25 min read
How Meituan Optimized High‑Traffic Backend Performance: Real‑World Strategies and Case Studies
JD Retail Technology
JD Retail Technology
Nov 9, 2018 · Operations

JD Finance Technical Operations and System Optimization for the 11.11 Promotion

The JD Finance technical teams—including Wealth R&D, Consumer Finance, Payment, Middle‑Platform, and Crowdfunding—conducted comprehensive system reviews, performance stress tests, capacity expansions, monitoring enhancements, and emergency downgrade plans to ensure stable, high‑throughput service during the 11.11 shopping festival.

11.11 promotionPerformance TestingSystem optimization
0 likes · 8 min read
JD Finance Technical Operations and System Optimization for the 11.11 Promotion
High Availability Architecture
High Availability Architecture
Nov 9, 2018 · Backend Development

Scaling Coinbase’s Platform for Spikes in Customer Demand: Lessons, Monitoring, and Traffic Replay

Since 2017, Coinbase has faced rapid cryptocurrency‑driven traffic growth, prompting a series of backend engineering improvements—including database upgrades, monitoring enhancements, relationship refactoring, caching, and a custom traffic capture‑replay system—to ensure reliability and scalability during demand spikes.

BackendMongoDBcaching
0 likes · 9 min read
Scaling Coinbase’s Platform for Spikes in Customer Demand: Lessons, Monitoring, and Traffic Replay
JD Tech
JD Tech
Oct 29, 2018 · Operations

SGM Service Governance Monitoring Platform: Design, Features, and Use Cases

The article introduces SGM, a comprehensive service governance and monitoring solution that addresses scaling, dependency complexity, and operational challenges by providing automated topology, real‑time tracing, capacity planning, root‑cause analysis, and extensive monitoring features such as performance metrics, JVM stats, call‑chain visualization, business dashboards, and intelligent alerting.

AlertingOperationscall chain
0 likes · 13 min read
SGM Service Governance Monitoring Platform: Design, Features, and Use Cases
Architects' Tech Alliance
Architects' Tech Alliance
Oct 27, 2018 · Cloud Native

Design and Architecture of Ping An Cloud Container Service Platform

The article outlines Ping An Cloud’s container service platform, describing its positioning, multi‑tenant design, architecture, key components such as CaaS portal, Docker Server, Rancher orchestration, networking, storage, logging, monitoring, and discusses the technologies and implementation choices behind each layer.

ContainerDockercloud-native
0 likes · 15 min read
Design and Architecture of Ping An Cloud Container Service Platform
58 Tech
58 Tech
Oct 24, 2018 · Backend Development

Overview of the SCF RPC Framework: Architecture, Call Modes, Serialization, Service Registration, and Monitoring

This article introduces the SCF RPC framework developed by 58, covering its overall architecture, synchronous and callback call modes, timeout handling, custom serialization techniques, service registration and discovery using etcd, as well as data collection, storage, and monitoring mechanisms for large‑scale distributed services.

Distributed SystemsRPCSCF
0 likes · 16 min read
Overview of the SCF RPC Framework: Architecture, Call Modes, Serialization, Service Registration, and Monitoring
dbaplus Community
dbaplus Community
Oct 21, 2018 · Artificial Intelligence

How Weibo’s Hubble Platform Uses AI for Real‑Time Monitoring and Trend Forecasting

The article details Weibo Advertising's Hubble monitoring system, describing its three‑layer architecture, metric taxonomy, AI‑driven trend prediction with LSTM models, dynamic alert thresholds, and performance testing using GoReplay, illustrating how large‑scale data and machine learning enable proactive operations.

AILSTMOperations
0 likes · 22 min read
How Weibo’s Hubble Platform Uses AI for Real‑Time Monitoring and Trend Forecasting
21CTO
21CTO
Oct 19, 2018 · Big Data

How Meituan Scales Real‑Time Computing with Flink: Architecture, Challenges & Solutions

This article summarizes Meituan’s real‑time computing platform, detailing its layered architecture built on Kafka, Flink on YARN, state management, resource isolation, fault tolerance, monitoring, and the Petra metric aggregation system, while highlighting the challenges faced and the solutions implemented to achieve high‑throughput, low‑latency stream processing at massive scale.

Big DataFlinkReal-time Streaming
0 likes · 18 min read
How Meituan Scales Real‑Time Computing with Flink: Architecture, Challenges & Solutions
MaGe Linux Operations
MaGe Linux Operations
Oct 18, 2018 · Fundamentals

Master Python File I/O and System Scripting with 10 Practical Exercises

This guide walks you through Python's built‑in open() function, file modes, reading and writing techniques, directory traversal with os.walk, and a series of practical scripts including a triangle printer, number‑guessing game, log analysis, IP counting, prime number generator, command‑line argument handling, web page fetching, process memory aggregation, port monitoring, and SNMP‑based CPU and network traffic monitoring.

NetworkingPythonfile-io
0 likes · 17 min read
Master Python File I/O and System Scripting with 10 Practical Exercises
Efficient Ops
Efficient Ops
Oct 16, 2018 · Operations

How Tencent Built an AI‑Powered Network Fault Detection System in Minutes

In this talk, Tencent’s infrastructure lead explains how their team created an AI‑driven, three‑minute fault detection and recovery pipeline—combining high‑precision Meshping monitoring, multi‑KPI analytics, and automated Moveout isolation—to dramatically shorten network outage resolution from hours to minutes.

aiopsautomationmonitoring
0 likes · 18 min read
How Tencent Built an AI‑Powered Network Fault Detection System in Minutes
Efficient Ops
Efficient Ops
Oct 15, 2018 · Operations

How Automated Operations Transform Enterprise IT: Trends, Tools, and Best Practices

This article examines the current state and future trends of enterprise operations, outlines common challenges and requirements, explains the importance of standardizing processes and management policies, compares leading open‑source automation tools, and provides a practical SaltStack deployment guide for building an automated operations platform.

IT OperationsITILautomation
0 likes · 25 min read
How Automated Operations Transform Enterprise IT: Trends, Tools, and Best Practices
MaGe Linux Operations
MaGe Linux Operations
Oct 15, 2018 · Operations

Essential Bash Scripting Practices for Linux Operations and Monitoring

This guide presents practical Bash scripting techniques for Linux operations, covering script conventions, random string/number generation, colored output functions, batch user creation, package and service checks, host liveness testing, CPU/memory/disk monitoring, disk‑usage surveys across hosts, and website availability verification.

Bashmonitoringshell-scripting
0 likes · 5 min read
Essential Bash Scripting Practices for Linux Operations and Monitoring
Meitu Technology
Meitu Technology
Oct 10, 2018 · Backend Development

How Meitu Scaled Twemproxy with Multi‑Process Architecture and Live Reload

This article details Meitu's engineering of a Redis/Memcached proxy platform, describing why twemproxy was chosen, the limitations of its upstream version, the multi‑process redesign with live configuration reload, added latency metrics, reuse‑port handling, Redis master‑slave support, performance testing, and remaining challenges.

MemcachedProxyTwemproxy
0 likes · 12 min read
How Meitu Scaled Twemproxy with Multi‑Process Architecture and Live Reload
Efficient Ops
Efficient Ops
Oct 9, 2018 · Operations

How Tencent Scales Automated Operations for Massive Services

Tencent’s architecture platform team explains how they monitor, automate, and secure billions of daily operations across storage, CDN, and live services, using multi‑dimensional metrics, real‑time and instant computation, AI‑driven anomaly detection, and a custom control platform for safe changes.

Operationsaiopsautomation
0 likes · 23 min read
How Tencent Scales Automated Operations for Massive Services
Tencent Cloud Developer
Tencent Cloud Developer
Oct 9, 2018 · Cloud Native

A Comprehensive List of 50+ Useful Docker Tools

This guide catalogs over fifty essential Docker‑related tools—including orchestration platforms like Kubernetes and Swarm, CI/CD services such as Jenkins and GitLab, monitoring solutions like Prometheus, logging utilities, security scanners, storage plugins, and networking options—helping developers, DevOps, SREs, and architects select the right solution for each stage of container development.

DevOpsDockerOrchestration
0 likes · 27 min read
A Comprehensive List of 50+ Useful Docker Tools
Architects' Tech Alliance
Architects' Tech Alliance
Oct 8, 2018 · Backend Development

Publishing, Registering, Discovering, Monitoring, Tracing and Governing RPC Services in Microservice Architecture

This article explains how to describe, publish, register, discover, invoke, monitor, trace, and govern RPC services in a microservice architecture, covering RESTful API, XML configuration, IDL files, registry principles, Zookeeper deployment, connection methods, server processing models, monitoring metrics, tracing concepts, and common governance techniques such as load balancing and fault tolerance.

RPCService Registrationmonitoring
0 likes · 31 min read
Publishing, Registering, Discovering, Monitoring, Tracing and Governing RPC Services in Microservice Architecture
360 Tech Engineering
360 Tech Engineering
Sep 29, 2018 · Operations

Design and Implementation of a Multi‑Task Scheduling System Based on Apache Mesos

This article describes how we identified underutilized CPU and memory resources in our company's servers, evaluated Kubernetes versus Apache Mesos, and built a non‑intrusive, Mesos‑based multi‑task scheduling system with dynamic resource reservation, monitoring, task isolation, and cluster‑wide observability, while addressing deployment challenges.

Cluster ManagementDocker alternativeMesos
0 likes · 11 min read
Design and Implementation of a Multi‑Task Scheduling System Based on Apache Mesos
Tongcheng Travel Technology Center
Tongcheng Travel Technology Center
Sep 26, 2018 · Databases

Design and Implementation of the Thor System for Containerized Management of TiDB

This article describes the challenges of scaling MySQL workloads, introduces TiDB’s distributed architecture, and details the Thor system’s container‑orchestrated design—including scheduling, cluster and database management, data synchronization with Hamal, and integrated monitoring and alerting—to achieve efficient, automated operation of large‑scale TiDB clusters.

TiDBcontainer orchestrationdistributed database
0 likes · 9 min read
Design and Implementation of the Thor System for Containerized Management of TiDB
iQIYI Technical Product Team
iQIYI Technical Product Team
Sep 21, 2018 · Databases

Case Study of iQIYI’s Adoption of TiDB for Scalable High‑Availability Database Services

iQIYI migrated its critical Edge Control, Video Transcoding, and User Login services from MySQL to TiDB, gaining automatic sharding, high‑availability multi‑datacenter replication, and stable query performance that eliminated storage bottlenecks, complex sharding logic and frequent downtime, while enabling future OLTP/OLAP integration.

Data MigrationScalabilityTiDB
0 likes · 10 min read
Case Study of iQIYI’s Adoption of TiDB for Scalable High‑Availability Database Services
Efficient Ops
Efficient Ops
Sep 17, 2018 · Operations

How Alibaba Scales Monitoring: From CMDB to AI‑Driven Full‑Link Observability

Alibaba’s monitoring evolution—from fragmented early tools to the standardized Sunfire platform and now AI‑powered full‑link observability—addresses scaling challenges, introduces business‑centric metrics, automated traceability, and intelligent anomaly detection, illustrating how massive, multi‑tenant infrastructures achieve unified, proactive operations at scale.

AlibabaOperationsaiops
0 likes · 19 min read
How Alibaba Scales Monitoring: From CMDB to AI‑Driven Full‑Link Observability
Liangxu Linux
Liangxu Linux
Sep 16, 2018 · Operations

Essential Linux Performance Tools and How to Use Them

A practical guide covering common Linux performance commands—uptime, dmesg, vmstat, mpstat, pidstat, iostat, free, sar, and top—explaining their output, key columns to monitor, and how to interpret results for system troubleshooting.

System Administrationmonitoring
0 likes · 16 min read
Essential Linux Performance Tools and How to Use Them
Mike Chen's Internet Architecture
Mike Chen's Internet Architecture
Sep 13, 2018 · Operations

Common Open‑Source Monitoring Systems and Zabbix Monitoring Process

The article introduces common open‑source monitoring tools such as Zabbix and Nagios, explains why distributed systems need proactive health checks, compares features, and provides a detailed Zabbix monitoring workflow including data collection, storage, visualization, alerting, and specific metrics for servers, networks, JVM and MySQL.

Distributed SystemsNagiosOperations
0 likes · 8 min read
Common Open‑Source Monitoring Systems and Zabbix Monitoring Process
21CTO
21CTO
Aug 30, 2018 · Operations

Inside Google’s Production: How Requests Travel Through Its Massive Infrastructure

Google’s production environment spans a global edge network, massive data centers, sophisticated job scheduling with Borg, distributed storage systems like Bigtable and Spanner, and comprehensive monitoring, illustrating how user requests traverse multiple layers—from ISP to edge, GFE, load balancers, and finally to services.

DeploymentGoogleInfrastructure
0 likes · 9 min read
Inside Google’s Production: How Requests Travel Through Its Massive Infrastructure
Alibaba Cloud Developer
Alibaba Cloud Developer
Aug 28, 2018 · Operations

How Alibaba Achieves Full‑Link Business Monitoring: A Practical Guide

Alibaba’s infrastructure team introduces a full‑link business monitoring approach that visualizes end‑to‑end health from a business perspective, unifies metrics, automates data collection, and leverages intelligent baseline alerts, enabling rapid issue detection, precise root‑cause analysis, and fine‑grained dimension monitoring across services.

AlibabaOperationsbusiness metrics
0 likes · 11 min read
How Alibaba Achieves Full‑Link Business Monitoring: A Practical Guide
Big Data and Microservices
Big Data and Microservices
Aug 14, 2018 · Cloud Native

Building Enterprise-Ready Spring Cloud Microservices: Core Components & Best Practices

This article reviews the essential Spring Cloud microservice stack for enterprise use, covering core gateway, service discovery, configuration, security, monitoring, tracing, and alerting components, and explains why tools like Apollo, Consul, Kafka, ELK, Pinpoint, InfluxDB, and Prometheus are preferred in production environments.

BackendConfigurationMicroservices
0 likes · 10 min read
Building Enterprise-Ready Spring Cloud Microservices: Core Components & Best Practices
MaGe Linux Operations
MaGe Linux Operations
Aug 14, 2018 · Backend Development

Mastering Microservice Architecture: 10 Essential Design Principles

This article outlines ten crucial design principles for building robust microservice systems, covering API gateways, stateless services, database scaling, caching, service splitting, orchestration, configuration management, logging, resilience patterns, and comprehensive monitoring to ensure high performance and reliability.

Backend ArchitectureMicroservicesapi-gateway
0 likes · 12 min read
Mastering Microservice Architecture: 10 Essential Design Principles
ITFLY8 Architecture Home
ITFLY8 Architecture Home
Aug 13, 2018 · Operations

Why IT Operations Must Embrace Automation: Benefits and Architecture

This article explains why IT operations must adopt automation, describing its definition, benefits such as zero‑delay response and fault prediction, essential operational components, the self‑built and open‑source infrastructure, and detailed automation frameworks for development, testing, release, monitoring, and service governance.

IT Operationsmonitoring
0 likes · 11 min read
Why IT Operations Must Embrace Automation: Benefits and Architecture
JD Tech
JD Tech
Aug 13, 2018 · Backend Development

Building Scalable High‑Concurrency Backend Systems: Guarding the Baseline, Raising Throughput, and Horizontal Expansion

This article shares practical guidance on designing, protecting, and continuously improving high‑concurrency backend services—covering baseline capacity, rate limiting, data‑structure optimization, stateless architecture, and horizontal scaling—to help engineers evolve small systems into robust, production‑grade platforms.

BackendMicroservicesScalability
0 likes · 8 min read
Building Scalable High‑Concurrency Backend Systems: Guarding the Baseline, Raising Throughput, and Horizontal Expansion
Meituan Technology Team
Meituan Technology Team
Aug 9, 2018 · Frontend Development

Improving Front-End Service Availability in Meituan Financial Payments

The article outlines Meituan Finance’s front‑end availability challenges in its million‑order payment service and presents a disciplined, end‑to‑end approach—standardized release processes, simple fallback designs, automated testing, robust monitoring, and regular fault‑drill simulations—to ensure stable user experiences across diverse client environments.

AvailabilityMeituanbest practices
0 likes · 17 min read
Improving Front-End Service Availability in Meituan Financial Payments
DataFunTalk
DataFunTalk
Aug 3, 2018 · Databases

HBase in Practice: Performance Tuning, Monitoring, and Issue Diagnosis

This article presents a comprehensive guide to HBase performance optimization, covering I/O throttling, compaction and flush settings, multi‑WAL strategies, SSD usage, version‑specific pitfalls, key monitoring metrics, log analysis, and practical troubleshooting techniques for production clusters.

monitoringperformance
0 likes · 12 min read
HBase in Practice: Performance Tuning, Monitoring, and Issue Diagnosis
Efficient Ops
Efficient Ops
Aug 1, 2018 · Operations

How Tencent Revolutionized Monitoring: From IDC Crises to AI‑Driven AIOps

This talk by Tencent’s monitoring R&D lead outlines a decade of evolution in large‑scale monitoring, covering real‑world incident cases, the three drivers behind architectural upgrades, the implementation of a three‑dimensional monitoring framework, and the application of AI‑powered AIOps for precise, rapid anomaly detection.

Big DataOperationsaiops
0 likes · 18 min read
How Tencent Revolutionized Monitoring: From IDC Crises to AI‑Driven AIOps
Qunar Tech Salon
Qunar Tech Salon
Jul 31, 2018 · Operations

Best Practices for Container Operations: Logging, Monitoring, Security, and Immutability

This article outlines essential container operation best practices—including native logging, JSON log formatting, sidecar aggregators, stateless and immutable design, avoiding privileged containers, effective monitoring, health checks, non‑root execution, and careful image tagging—to help developers build secure, maintainable, and observable workloads on Kubernetes.

ContainersKubernetesbest practices
0 likes · 17 min read
Best Practices for Container Operations: Logging, Monitoring, Security, and Immutability
Architect's Tech Stack
Architect's Tech Stack
Jul 26, 2018 · Operations

Deploying Pinpoint for Distributed Tracing of Dubbo Services

This guide explains how to install, configure, and use the open‑source Pinpoint APM tool to monitor Java‑based Dubbo applications, covering environment preparation, downloading binaries, modifying configuration files, deploying collector and web components, installing agents, and adding startup parameters for both Tomcat and SpringBoot deployments.

APMDistributed TracingDubbo
0 likes · 9 min read
Deploying Pinpoint for Distributed Tracing of Dubbo Services
360 Quality & Efficiency
360 Quality & Efficiency
Jul 23, 2018 · Operations

Introduction to X Monitoring System: Architecture, Modules, and Implementation Details

The article presents a detailed overview of the internally developed X Monitoring system, covering its architecture, configuration, reporting and monitoring modules, the use of Redis, Qbus, ElasticSearch and MySQL, as well as both server‑side (API/DB) and agent‑side (PHP) monitoring features, data collection commands, alert thresholds, and overall operational benefits.

AlertingPHPSystem Architecture
0 likes · 5 min read
Introduction to X Monitoring System: Architecture, Modules, and Implementation Details
Youzan Coder
Youzan Coder
Jul 20, 2018 · Big Data

How Youzan Built a Scalable Big Data Development Platform (DP)

This article details the design, architecture, and operational experience of Youzan's Data Platform (DP), covering its scheduling, data‑sync, service, and monitoring modules, the custom Airflow‑based task scheduler, current production metrics, supported task types, and future improvement plans.

AirflowBig DataData Platform
0 likes · 12 min read
How Youzan Built a Scalable Big Data Development Platform (DP)
Qunar Tech Salon
Qunar Tech Salon
Jul 10, 2018 · Artificial Intelligence

Design and Implementation of Qunar's Algorithm Service Platform for Machine Learning

The article describes the background, design, key components, and current status of Qunar's algorithm service platform, which provides a unified, scalable, and automated environment for feature engineering, model training, deployment, monitoring, and management of machine‑learning projects within the company's large‑accommodation division.

Model Managementfeature engineeringmachine learning
0 likes · 15 min read
Design and Implementation of Qunar's Algorithm Service Platform for Machine Learning
Efficient Ops
Efficient Ops
Jul 8, 2018 · Operations

How to Seamlessly Take Over New Ops Responsibilities: A Practical Checklist

This guide outlines a step‑by‑step approach for taking over new operational responsibilities, covering communication with development leaders, business overview, asset inventory, basic and business‑specific monitoring, standardization, SOP creation, failure drills, cost and capacity planning, and effective cross‑team communication.

Operationsasset managementhandovers
0 likes · 10 min read
How to Seamlessly Take Over New Ops Responsibilities: A Practical Checklist
MaGe Linux Operations
MaGe Linux Operations
Jul 7, 2018 · Operations

How to Seamlessly Take Over a New Service: An Operations Playbook

This guide outlines a step‑by‑step operations playbook for assuming responsibility of a new business service, covering initial communication, asset inventory, monitoring setup, standardization, SOP creation, incident drills, ongoing optimization, and effective cross‑team communication to ensure stable, low‑cost, and high‑quality service delivery.

SOPasset managementincident response
0 likes · 9 min read
How to Seamlessly Take Over a New Service: An Operations Playbook
Tencent Cloud Developer
Tencent Cloud Developer
Jul 5, 2018 · Cloud Native

Overview of Tencent Cloud Managed Kubernetes Service and Its Integration

Tencent Cloud’s fully managed Kubernetes service, launched in 2016, delivers one‑click, VPC‑isolated cluster deployment with automated lifecycle, integrated monitoring, logging, storage (CBS/CFS) and CI/CD, custom components for metrics and storage, flat VPC networking, CSI drivers, and flexible master deployment models that simplify scaling, upgrades, and developer focus on applications.

Cloud NativeContainer ServiceKubernetes
0 likes · 18 min read
Overview of Tencent Cloud Managed Kubernetes Service and Its Integration
JD Tech
JD Tech
Jul 5, 2018 · Backend Development

Design and Optimization of JD's High‑Availability Open Gateway System

This article describes how JD's open gateway handles billions of requests during major sales events by employing a multi‑layer architecture, Nginx + Lua unified access, NIO asynchronous processing, service isolation, dynamic routing, degradation, rate‑limiting, circuit‑breaking, fast‑fail mechanisms, and comprehensive monitoring to ensure high performance and reliability.

Circuit Breakingasynchronous processinggateway
0 likes · 16 min read
Design and Optimization of JD's High‑Availability Open Gateway System
Ctrip Technology
Ctrip Technology
Jul 3, 2018 · Big Data

Ctrip's Presto Engine: Challenges, Improvements, and Upgrade Roadmap

This article details Ctrip's experience with the Presto distributed SQL engine, outlining the initial performance and stability issues, the comprehensive enhancements made in security, resource control, compatibility, and monitoring, and the multi‑stage upgrade plan that guides its future evolution.

Big DataKerberosPresto
0 likes · 11 min read
Ctrip's Presto Engine: Challenges, Improvements, and Upgrade Roadmap
DataFunTalk
DataFunTalk
Jun 24, 2018 · Big Data

OPPO Big Data Platform Operations and R&D Practices: Architecture, Scaling, and Monitoring

This article summarizes OPPO's rapid growth of its big‑data platform, detailing the three‑layer architecture, the evolution from Flume‑Kafka to NiFi for data ingestion, the upgrade of the OFlow task scheduler, comprehensive monitoring of data, resources and task SLA, and the development of a self‑service analytics tool called InnerEye to ensure stability, efficiency, and security.

AirflowBig DataNiFi
0 likes · 10 min read
OPPO Big Data Platform Operations and R&D Practices: Architecture, Scaling, and Monitoring
ITPUB
ITPUB
Jun 23, 2018 · Operations

How to Diagnose Server Failures Within the First 5 Minutes

This guide walks you through a systematic, step‑by‑step process for quickly identifying the root cause of a server outage, covering everything from gathering context and checking who is logged in to inspecting processes, network services, hardware, I/O performance, filesystem mounts, and relevant logs.

Operationsmonitoringserver troubleshooting
0 likes · 8 min read
How to Diagnose Server Failures Within the First 5 Minutes
JD Tech
JD Tech
Jun 22, 2018 · Operations

JDOS Operations Platform: Managing Millions of Containers at JD.com

The article describes how JD.com built and operates the JDOS Operations Platform to manage a multi‑million‑container Docker and Kubernetes fleet, detailing the challenges of massive scale, the architectural components such as the configuration center, operation center, inspection system, gossip‑based communication, and an intelligent alerting system that together enable efficient, automated, and reliable large‑scale container operations.

Container ManagementGossip ProtocolKubernetes
0 likes · 12 min read
JDOS Operations Platform: Managing Millions of Containers at JD.com
Tencent Cloud Developer
Tencent Cloud Developer
Jun 14, 2018 · Operations

Tencent Cloud Database Massive Operations: Team Building, Automated Operations Platform, and Intelligent Practices

Tencent Cloud Database’s massive‑operation strategy combines a dedicated architect team, a three‑layer automated platform for resource, task and health management, and AI‑driven intelligent services that customize workloads, automate tuning, and enable proactive scaling and self‑healing across hundreds of thousands of instances.

AIOperationsautomation
0 likes · 11 min read
Tencent Cloud Database Massive Operations: Team Building, Automated Operations Platform, and Intelligent Practices
JD Tech
JD Tech
Jun 14, 2018 · Operations

Design and Implementation of a Lightweight Service Monitoring and Traffic Management System

This article shares the design and implementation of a lightweight, robust, and low‑intrusion monitoring management system for microservice traffic, detailing data collection via client filters, Redis‑based structured storage, alerting, rate‑limiting, degradation, and authorization mechanisms, and discusses performance optimizations and future improvements.

MicroservicesOperationsmonitoring
0 likes · 11 min read
Design and Implementation of a Lightweight Service Monitoring and Traffic Management System
Architecture Digest
Architecture Digest
Jun 14, 2018 · Frontend Development

Key Concerns and Challenges for Front‑End Architecture

The article examines front‑end architecture by highlighting its primary focus on user experience, performance, reliability, tooling choices, and engineering difficulties such as monitoring and caching, arguing that front‑end architects must balance these factors just like their back‑end counterparts.

ToolingUser experiencearchitecture
0 likes · 6 min read
Key Concerns and Challenges for Front‑End Architecture
ITPUB
ITPUB
Jun 5, 2018 · Operations

How Meituan Achieved Near‑Zero Downtime for Its Account Service

This article details Meituan's practical approaches to boosting account service reliability, covering MTBF/MTTR metrics, business‑level monitoring, flexible availability with circuit‑breaker patterns, cross‑region active‑active deployment, data synchronization techniques, and the measurable performance gains achieved.

Active-ActiveDistributed Systemscircuit breaker
0 likes · 13 min read
How Meituan Achieved Near‑Zero Downtime for Its Account Service
Programmer DD
Programmer DD
Jun 3, 2018 · Backend Development

Designing a China‑Style Microservice Stack 2.0: Practical Component Guide

This article presents a practical, China‑focused microservice reference stack built on Spring Cloud, detailing core support components such as Zuul, Eureka, Apollo, and Spring Boot, as well as monitoring tools like Kafka, ELK, CAT, KairosDB, ZMon, and Hystrix, and explains when and how to apply each in production environments.

ApolloBackend ArchitectureKafka
0 likes · 20 min read
Designing a China‑Style Microservice Stack 2.0: Practical Component Guide
Java Backend Technology
Java Backend Technology
May 31, 2018 · Backend Development

Designing a China‑Style Microservices Stack: 11 Essential Components

This article presents a practical, China‑centric microservices reference stack built on Spring Cloud, detailing eleven core components—including Zuul, Eureka, Apollo, Kafka, ELK, and Hystrix—while comparing them with alternatives and offering guidance for architects to avoid common pitfalls and accelerate production‑grade deployments.

Backend ArchitectureMicroservicesSpring Cloud
0 likes · 17 min read
Designing a China‑Style Microservices Stack: 11 Essential Components
Meituan Technology Team
Meituan Technology Team
May 31, 2018 · Mobile Development

High Availability Architecture for Meituan Waimai Mobile Client

Meituan Waimai’s mobile client employs a high‑availability architecture built on loosely‑coupled teams, comprehensive monitoring, encrypted logging, multi‑layer disaster recovery, gray‑release strategies, and an incident‑response workflow, enabling rapid detection and resolution of failures while supporting 20 million daily orders.

disaster recoveryhigh availabilitylogging
0 likes · 16 min read
High Availability Architecture for Meituan Waimai Mobile Client
ITPUB
ITPUB
May 24, 2018 · Operations

Mastering Modern Operations: From Deployment to Automation and High Availability

This article outlines the essential facets of modern IT operations, covering environment deployment, troubleshooting and performance tuning, backup strategies, high‑availability clustering, monitoring and alerting, security and auditing, as well as automation, DevOps practices, virtualization, and cloud services, providing practical insights and tool recommendations.

Deploymentautomationhigh availability
0 likes · 9 min read
Mastering Modern Operations: From Deployment to Automation and High Availability
Efficient Ops
Efficient Ops
May 23, 2018 · Operations

How Alibaba Guarantees High‑Availability Ops for New Retail

This article explains Alibaba's GOC‑driven operation‑assurance solution for new retail, covering the sector's evolution, unique reliability challenges, a four‑pillar support framework—including high‑availability, mobile ops, emergency response, and change control—and real‑world best practices from Hema Fresh.

AlibabaOperationsemergency response
0 likes · 19 min read
How Alibaba Guarantees High‑Availability Ops for New Retail
MaGe Linux Operations
MaGe Linux Operations
May 16, 2018 · Operations

How to Build an Automated Fault‑Healing System for Enterprise Ops

This article explores the end‑to‑end design of an enterprise‑grade fault‑self‑healing solution, covering the basic workflow, abstraction of alert handling, CMDB‑based resource mapping, internal gateway integration, monitoring platform adapters like Zabbix and Open‑Falcon, convergence logic, complex alarm orchestration, and the overall technical architecture.

CMDBaiopsfault automation
0 likes · 9 min read
How to Build an Automated Fault‑Healing System for Enterprise Ops
dbaplus Community
dbaplus Community
May 2, 2018 · Big Data

Why Big Data Clusters Need a Robust Automated Monitoring & Alerting System

The article explains the unique challenges of monitoring and alerting in large‑scale big‑data environments, outlines the evolution and architecture of such systems, and provides detailed guidance on data collection, time‑series storage, rule definition, and alert actions for reliable operations.

Operationsarchitecturemonitoring
0 likes · 17 min read
Why Big Data Clusters Need a Robust Automated Monitoring & Alerting System
System Architect Go
System Architect Go
May 1, 2018 · Operations

How to Set Up Real-Time Logging with Slack

This guide explains step‑by‑step how to configure Slack as a real‑time log channel by creating a workspace, setting up a channel, generating an incoming webhook URL, and posting JSON log messages via HTTP so you can monitor application logs instantly.

OperationsReal-time loggingSlack
0 likes · 2 min read
How to Set Up Real-Time Logging with Slack
Efficient Ops
Efficient Ops
Apr 23, 2018 · Operations

Unlocking Ops Automation: Real-World Architectures and Practical Insights

This article explores the essence of operations automation by presenting three real-world platform case studies, analyzing their architectures, tools, and implementation challenges, and then discusses universal automation principles, intelligent ops concepts, and career guidance, blending technical depth with personal motivation.

DeploymentInfrastructureOperations Automation
0 likes · 17 min read
Unlocking Ops Automation: Real-World Architectures and Practical Insights
ITPUB
ITPUB
Apr 23, 2018 · Operations

Diagnosing Linux CPU Spikes with top, Thread Dumps, and jstack

This guide walks through real‑world Linux performance troubleshooting, showing how to use top to pinpoint high‑CPU processes, convert thread IDs, capture multiple jstack thread dumps, and interpret key top metrics such as load average, task states, and memory usage.

jstackmonitoringthread-dump
0 likes · 7 min read
Diagnosing Linux CPU Spikes with top, Thread Dumps, and jstack
ITPUB
ITPUB
Apr 21, 2018 · Operations

Essential Ops Checklist: Avoid Disasters with Proven Practices

A seasoned operations engineer shares a comprehensive guide covering online operation standards, data handling, security hardening, daily monitoring, performance tuning, and the right mindset to prevent costly incidents and ensure stable, secure, and efficient production environments.

incident responsemonitoring
0 likes · 14 min read
Essential Ops Checklist: Avoid Disasters with Proven Practices
Meituan Technology Team
Meituan Technology Team
Apr 19, 2018 · Operations

How Meituan‑Dianping Built a 100% High‑Availability Core Transaction System

This article analyzes the rapid growth challenges of Meituan‑Dianping's core payment flow, explains key availability metrics such as MTBF and MTTR, and presents a comprehensive set of architectural, operational, and tooling strategies—including dependency decoupling, timeout tuning, circuit breaking, and full‑link stress testing—to achieve stable, fault‑tolerant transactions.

MicroservicesOperationscircuit breaker
0 likes · 20 min read
How Meituan‑Dianping Built a 100% High‑Availability Core Transaction System
Meituan Technology Team
Meituan Technology Team
Apr 19, 2018 · Big Data

Design and Implementation of Meituan Hotel Real-Time Operation Reach System

The article describes Meituan’s hotel real‑time reach platform, which replaces numerous hard‑coded Storm topologies with a unified Storm‑Aviator rule engine supporting time‑window and delayed triggers, offering configurable scenes, custom functions, monitoring, and alerting, and now processes nearly a billion daily events with improved conversion and scalability.

CEPMeituanmonitoring
0 likes · 16 min read
Design and Implementation of Meituan Hotel Real-Time Operation Reach System
Alibaba Cloud Native
Alibaba Cloud Native
Apr 11, 2018 · Cloud Native

How LXCFS Enables Accurate /proc Views in PouchContainer: A Deep Dive

Starting from version 0.3.0, PouchContainer integrates the open‑source LXCFS FUSE filesystem to isolate /proc views inside containers, allowing existing monitoring and deployment tools to read container‑specific metrics without modification, and the article details the use cases, command‑line integration, and stability improvements.

ContainerLXCFSLinux
0 likes · 10 min read
How LXCFS Enables Accurate /proc Views in PouchContainer: A Deep Dive
Efficient Ops
Efficient Ops
Apr 8, 2018 · Operations

Why ELK Is the Ultimate Solution for Log Management and Monitoring

This article introduces the ELK stack—Elasticsearch, Logstash, and Kibana—explaining its core components, architecture, comparison with databases and grep, typical use cases across security, networking, and application monitoring, deployment considerations, challenges, SaaS prospects, and recommended learning resources.

ELKElasticsearchLog Management
0 likes · 10 min read
Why ELK Is the Ultimate Solution for Log Management and Monitoring
Youzan Coder
Youzan Coder
Apr 8, 2018 · Fundamentals

Testing Asynchronous Systems: Strategies and Best Practices

Testing asynchronous systems requires specialized strategies—monitoring callbacks with synchronization primitives and reliable polling with timeouts, delays, and frequencies—to handle nondeterministic execution, avoid flaky assertions, and improve testability by decoupling business logic from periodic scheduling, as demonstrated by real‑world polling implementations for Elasticsearch and MySQL/Redis jobs.

Java TestingPollingasynchronous testing
0 likes · 6 min read
Testing Asynchronous Systems: Strategies and Best Practices
Suning Technology
Suning Technology
Apr 3, 2018 · Product Management

Revamping Monitoring Product UX: Redesign, Process Boost, Smart Alerts, Data Viz

This article examines how Suning’s monitoring product’s user experience was elevated through four key interventions—page reconstruction of call‑chain views, backend performance workflow optimization, intelligent alerting with real‑time and offline analytics, and a data‑visualization decision platform—illustrating the shift from basic usability to emotionally engaging, self‑actualizing interactions.

UX designmonitoringprocess optimization
0 likes · 8 min read
Revamping Monitoring Product UX: Redesign, Process Boost, Smart Alerts, Data Viz
转转QA
转转QA
Apr 3, 2018 · Backend Development

Overview of the Commercial Testing Platform and Its Future Roadmap

The article introduces a commercial testing platform used by an advertising team, detailing its architecture, core components, monitoring and scheduling mechanisms, current advantages and shortcomings, and outlines planned enhancements to improve data construction, result recording, and anti‑fraud coverage.

AdvertisingBackendautomation
0 likes · 8 min read
Overview of the Commercial Testing Platform and Its Future Roadmap
Efficient Ops
Efficient Ops
Apr 2, 2018 · Operations

How Bilibili Revamped Its Monitoring Architecture: From Zabbix to Dapper

An in‑depth look at Bilibili’s multi‑layer monitoring overhaul, detailing the shift from a monolithic Zabbix setup to micro‑service‑based ELK, Dapper, Misaka, Traceon and Lancer systems, and how layered observability improves fault detection across business, application, and infrastructure levels.

Distributed TracingMicroservicesOperations
0 likes · 10 min read
How Bilibili Revamped Its Monitoring Architecture: From Zabbix to Dapper
MaGe Linux Operations
MaGe Linux Operations
Mar 31, 2018 · Operations

Essential Linux Ops Tools: Monitoring, Performance, and Security Utilities

This article introduces a curated set of practical Linux operations tools—including Nethogs, IOZone, IOTop, IPtraf, IFTop, Fail2ban, Tmux, NMON, MultiTail, NMap, and Httperf—detailing their purpose, installation steps, key command‑line options, and usage examples to help system administrators monitor bandwidth, disk I/O, processes, logs, and security on Linux servers.

LinuxOperationsmonitoring
0 likes · 11 min read
Essential Linux Ops Tools: Monitoring, Performance, and Security Utilities
JD Tech
JD Tech
Mar 30, 2018 · Backend Development

Effective Logging Practices and Standards for Java Backend Systems

This article explains why proper logging is crucial for Java backend maintenance, defines useful log levels, outlines team rules and best‑practice implementations—including traceId usage, log file organization, and real‑time monitoring—to enable fast issue diagnosis and improve overall engineering quality.

BackendOperationsjava
0 likes · 10 min read
Effective Logging Practices and Standards for Java Backend Systems
Efficient Ops
Efficient Ops
Mar 15, 2018 · Operations

Mastering Large-Scale Command Execution: From Basics to Baidu’s Cluster Control System

This article explores the fundamentals of command execution, examines the challenges of scaling command delivery across hundreds of thousands of servers, and details Baidu’s Cluster Control System architecture that enables efficient, flexible, and extensible distributed command management for operations teams.

Command ExecutionDeploymentDistributed Systems
0 likes · 10 min read
Mastering Large-Scale Command Execution: From Basics to Baidu’s Cluster Control System