Tagged articles
2180 articles
Page 20 of 22
Efficient Ops
Efficient Ops
Mar 15, 2018 · Operations

Mastering Large-Scale Command Execution: From Basics to Baidu’s Cluster Control System

This article explores the fundamentals of command execution, examines the challenges of scaling command delivery across hundreds of thousands of servers, and details Baidu’s Cluster Control System architecture that enables efficient, flexible, and extensible distributed command management for operations teams.

Command ExecutionDeploymentDistributed Systems
0 likes · 10 min read
Mastering Large-Scale Command Execution: From Basics to Baidu’s Cluster Control System
ITPUB
ITPUB
Mar 14, 2018 · Operations

Top 7 Linux Ops Interview Questions and How to Answer Them

This article shares a Linux operations engineer’s interview experience, presenting seven common interview questions—self‑introduction, gray‑release implementation, MongoDB deployment, Jenkins‑based release and rollback, Tomcat work modes, monitoring solutions, and data backup—along with concise, practical answers and preparation tips.

JenkinsMongoDBNGINX
0 likes · 13 min read
Top 7 Linux Ops Interview Questions and How to Answer Them
Dada Group Technology
Dada Group Technology
Mar 8, 2018 · Operations

Effective Logging Practices and Standards for Java Backend Systems

This article explains why proper logging is crucial for rapid issue diagnosis, defines useful log levels, outlines team-wide logging rules, describes log format standardization, introduces traceId for request tracing, and presents monitoring and alerting strategies to improve overall system reliability.

best practicesjavalog levels
0 likes · 11 min read
Effective Logging Practices and Standards for Java Backend Systems
ITFLY8 Architecture Home
ITFLY8 Architecture Home
Mar 7, 2018 · Operations

8 Essential Metrics to Monitor During a Software Deployment

This article outlines eight critical aspects—error rates, web traffic, performance scores, server load, database queries, dependency health, internal communication, and regression testing—that developers should continuously monitor to ensure smooth and reliable software deployments.

Deploymentmonitoringsoftware
0 likes · 8 min read
8 Essential Metrics to Monitor During a Software Deployment
Efficient Ops
Efficient Ops
Mar 7, 2018 · Operations

Mastering Log Collection: From Daily Ops to the ELK Stack

This article explores the everyday challenges of operations teams handling system, access, runtime, error, and business logs, outlines the pain points of log collection and standardization, and provides a comprehensive guide to implementing the ELK (Elastic) stack—including Elasticsearch, Logstash, and Kibana—for effective monitoring and analysis.

ELKKibanaLogstash
0 likes · 13 min read
Mastering Log Collection: From Daily Ops to the ELK Stack
Efficient Ops
Efficient Ops
Feb 28, 2018 · Operations

How Meituan Scaled Delivery Ops with Automated Monitoring and Full‑Link Testing

This article explains how Meituan's food delivery platform built an automated operations system—covering complex workflows, traffic spikes, rapid growth, pain‑point analysis, core goals, system architecture, and automation techniques such as anomaly detection, service‑protection triggers, and full‑link testing—to improve reliability and reduce manual effort.

MeituanOperationsautomation
0 likes · 17 min read
How Meituan Scaled Delivery Ops with Automated Monitoring and Full‑Link Testing
JD Tech
JD Tech
Feb 28, 2018 · Operations

CallGraph: JD.com's Distributed Tracing and Service Governance Platform

CallGraph is JD.com's internally developed distributed tracing and service governance platform that addresses the challenges of monitoring complex microservice architectures by providing low‑intrusion, low‑latency tracing, real‑time analytics, configurable sampling, and integration with JMQ, Storm, Spark, HBase, and JimDB for both operational insight and performance optimization.

Big DataDistributed TracingMicroservices
0 likes · 12 min read
CallGraph: JD.com's Distributed Tracing and Service Governance Platform
MaGe Linux Operations
MaGe Linux Operations
Feb 10, 2018 · Operations

60+ Must-Have Open‑Source DevOps Tools for Seamless Automation

This article compiles over 60 top open‑source DevOps tools—from version control systems and build automation to CI/CD platforms, container runtimes, configuration managers, monitoring solutions, and log collectors—providing concise descriptions to help engineers streamline automation, deployment, and operations workflows.

Containerautomationmonitoring
0 likes · 14 min read
60+ Must-Have Open‑Source DevOps Tools for Seamless Automation
Efficient Ops
Efficient Ops
Feb 5, 2018 · Operations

How WeChat Scales Massive Real-Time Monitoring: Design & Practices

This article details the architecture and practical techniques behind WeChat's large‑scale monitoring system, covering lightweight data collection, classification of real‑time, non‑real‑time and user‑specific metrics, anomaly detection algorithms, automated configuration, and high‑performance storage solutions for billions of events per minute.

OperationsReal-Timedata collection
0 likes · 14 min read
How WeChat Scales Massive Real-Time Monitoring: Design & Practices
MaGe Linux Operations
MaGe Linux Operations
Feb 4, 2018 · Operations

Essential Operations Tools Every DevOps Engineer Should Master

This article outlines the key categories of operations tools—including process management, release automation, configuration handling, resource isolation, and comprehensive monitoring and alerting solutions—providing a practical guide for building reliable, automated infrastructure workflows.

InfrastructureOperationsautomation
0 likes · 8 min read
Essential Operations Tools Every DevOps Engineer Should Master
ITPUB
ITPUB
Feb 2, 2018 · Operations

Essential Unix/Linux Command‑Line Tools Every Sysadmin Should Know

This article compiles a curated list of 28 useful Unix/Linux command‑line utilities—including performance monitors, multiplexers, editors, network tools, backup solutions, and fun programs—providing brief descriptions, official website links, and usage examples to help system administrators discover and adopt valuable tools for daily operations.

Sysadmincommand-linemonitoring
0 likes · 13 min read
Essential Unix/Linux Command‑Line Tools Every Sysadmin Should Know
Efficient Ops
Efficient Ops
Jan 30, 2018 · Operations

Scaling Event Operations for Ten‑Million Online Securities Users

This article details how Ping An Securities built a technology‑first event‑handling team, created new reporting channels, developed a data‑construction platform, and implemented proactive monitoring to efficiently support over ten million internet securities users.

ITSMService Centerdata construction
0 likes · 21 min read
Scaling Event Operations for Ten‑Million Online Securities Users
dbaplus Community
dbaplus Community
Jan 29, 2018 · Operations

How Data‑Driven Monitoring Unlocks Real Value for Ops Teams

This article explains why quantifiable data is essential for evaluating the impact of operational changes, outlines common data‑collection stacks, defines core business and user‑centric metrics, and demonstrates practical monitoring techniques such as PCU analysis, simulated user flows, and intelligent scaling to turn ops work into measurable business value.

DevOpsOperationsbusiness metrics
0 likes · 15 min read
How Data‑Driven Monitoring Unlocks Real Value for Ops Teams
Meitu Technology
Meitu Technology
Jan 24, 2018 · Backend Development

Inside Meitu’s Backend: How 1.5 B Users Power Meipai’s High‑Concurrency Systems

The 8th Meitu Internet Technology Salon in Shenzhen showcased four expert talks covering Meipai’s high‑concurrency prop‑trading system, a comprehensive monitoring platform, the evolution of Meitu’s IM architecture, and live‑streaming optimization, revealing how the company supports over 1.5 billion users with robust backend engineering.

ArchitectureBackendMeitu
0 likes · 7 min read
Inside Meitu’s Backend: How 1.5 B Users Power Meipai’s High‑Concurrency Systems
Meitu Technology
Meitu Technology
Jan 24, 2018 · Operations

Meituan Monitoring Practice: Building a Holistic Monitoring System

Meituan’s Meipai service, serving over 150 million monthly users with a hybrid private‑public cloud architecture, spent three years building a comprehensive, three‑dimensional monitoring platform that unifies client‑to‑server metrics, alerts and reporting to ensure resilient, scalable operations and rapid business growth.

Cloud ServicesMeituanOperations
0 likes · 2 min read
Meituan Monitoring Practice: Building a Holistic Monitoring System
Efficient Ops
Efficient Ops
Jan 16, 2018 · Operations

How Tencent Secures Game Operations: Real Cases, Challenges, and Data‑Driven Solutions

This article shares a comprehensive overview of game operation security at Tencent, covering personal background, real‑world incident cases, the inherent challenges of large‑scale game services, past monitoring efforts, and a new data‑driven alerting framework that dramatically reduces false alarms while protecting game economies.

AlertingBig DataGame Security
0 likes · 25 min read
How Tencent Secures Game Operations: Real Cases, Challenges, and Data‑Driven Solutions
dbaplus Community
dbaplus Community
Jan 15, 2018 · Operations

How JD Finance Achieves Real-Time Capacity Assessment and Smart Alerting

This article explains JD Finance's operational challenges in a rapidly expanding micro‑service environment and presents a comprehensive approach that combines offline and online load testing, precise capacity calculations, and intelligent root‑cause alert analysis using both rule‑based and machine‑learning techniques.

Load TestingOperationsRoot Cause Analysis
0 likes · 15 min read
How JD Finance Achieves Real-Time Capacity Assessment and Smart Alerting
Architects' Tech Alliance
Architects' Tech Alliance
Jan 14, 2018 · Operations

Why Some Developers Keep Coding After 40 and How Grafana Powers Their Monitoring Projects

While many believe software development ends after age 40, the article highlights veteran programmers who treat coding as a lifelong passion and showcases Dennis’s Grafana‑based monitoring solutions for Huawei storage, illustrating how open‑source dashboards, SNMP data collection, and comparisons with Kibana empower modern ops.

DevOpsGrafanaKibana
0 likes · 7 min read
Why Some Developers Keep Coding After 40 and How Grafana Powers Their Monitoring Projects
dbaplus Community
dbaplus Community
Jan 11, 2018 · Cloud Native

Essential Docker Ecosystem Tools: A Comprehensive Guide for Developers and Ops

This article provides a detailed, curated list of the most popular Docker‑related tools across categories such as orchestration, CI/CD, monitoring, security, storage, networking and management, including brief descriptions, official links and cost information to help developers, DevOps engineers and platform architects choose the right solutions for every stage of the container lifecycle.

CI/CDContainerOrchestration
0 likes · 29 min read
Essential Docker Ecosystem Tools: A Comprehensive Guide for Developers and Ops
Efficient Ops
Efficient Ops
Jan 7, 2018 · Operations

How Tencent Leverages AI to Simplify Massive-Scale Service Monitoring and Root‑Cause Analysis

Tencent's SNG social platform team tackles billion‑scale traffic by integrating AI‑driven anomaly detection, multi‑dimensional monitoring, and decision‑tree based root‑cause analysis, turning complex backend architectures and massive alert volumes into streamlined, actionable insights for faster issue resolution.

AIOperationsanomaly detection
0 likes · 16 min read
How Tencent Leverages AI to Simplify Massive-Scale Service Monitoring and Root‑Cause Analysis
Taobao Frontend Technology
Taobao Frontend Technology
Jan 5, 2018 · Operations

Why Metrics Matter: A Deep Dive into Pandora.js’s Measurement System

Metrics act as health checks for applications, enabling developers to monitor performance, track changes, and assess stability; this article explains Pandora.js’s metric naming conventions, types like Gauge, Counter, Histogram, and Meter, and provides practical Node.js code examples for implementing these measurements.

MetricsObservabilityPerformance
0 likes · 13 min read
Why Metrics Matter: A Deep Dive into Pandora.js’s Measurement System
Efficient Ops
Efficient Ops
Jan 3, 2018 · Operations

How QQ Space Photo Album Handled a 4‑Fold Traffic Surge on New Year’s Day

On December 30, 2017, a sudden wave of users uploading and downloading their 18‑year‑old photos caused QQ Space's album service to experience a four‑times spike in download traffic and a twelve‑times surge in post activity, prompting the operations and development teams to employ capacity monitoring, elastic scaling, flexible architecture, and targeted optimizations to maintain service stability and user experience.

OperationsQQ Spacecapacity planning
0 likes · 10 min read
How QQ Space Photo Album Handled a 4‑Fold Traffic Surge on New Year’s Day
37 Interactive Technology Team
37 Interactive Technology Team
Dec 25, 2017 · Operations

Design and Implementation of a Unified Monitoring Dashboard – A Case Study from 37 Interactive Entertainment

In just over a month, 37 Interactive Entertainment transformed its fragmented monitoring wall into a unified, twelve‑screen dashboard by consolidating game and service data into Elasticsearch, creating a single API, and employing modular JavaScript, custom ECharts visualizations and a 3D map, delivering real‑time insights with a cohesive sci‑fi inspired UI.

DashboardData visualizationELK
0 likes · 7 min read
Design and Implementation of a Unified Monitoring Dashboard – A Case Study from 37 Interactive Entertainment
Dada Group Technology
Dada Group Technology
Dec 22, 2017 · Operations

Performance Testing Process, Plans, and Best Practices for High‑Traffic Events

This article explains the purpose of performance (stress) testing, compares four testing approaches, details the chosen proportional‑deployment strategy, and provides comprehensive preparation steps, script guidelines, metric analysis, and practical tips for ensuring system stability during large‑scale traffic spikes.

Load TestingOperationscapacity planning
0 likes · 10 min read
Performance Testing Process, Plans, and Best Practices for High‑Traffic Events
21CTO
21CTO
Dec 21, 2017 · Operations

Why We Switched to Nginx for L4 Load Balancing: A Practical Migration Guide

This article details a company's migration from commercial load balancers to an open‑source Nginx‑based Layer‑4 solution, covering project background, technical selection, architecture design, network and Nginx configurations, operational scripts, health‑check automation, performance testing, and data analysis using Elasticsearch and Grafana.

L4OSPFSystemd
0 likes · 11 min read
Why We Switched to Nginx for L4 Load Balancing: A Practical Migration Guide
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Dec 21, 2017 · Operations

Stability Monitoring Practices for Double 11 2017

The 2017 Double 11 stability monitoring project introduced a four‑layer monitoring architecture—including customer & sentiment, business, system water‑level, and infrastructure monitoring—along with data archiving and system‑level reliability measures to detect, respond to, and mitigate issues far faster than traditional manual processes.

Operationsbig-dataincident response
0 likes · 14 min read
Stability Monitoring Practices for Double 11 2017
Efficient Ops
Efficient Ops
Dec 18, 2017 · Operations

How WiFi Key Built a Million‑User Monitoring Platform: Architecture and Best Practices

This article describes how WiFi 万能钥匙 designed and implemented the Roma monitoring platform to handle billions of daily requests, covering background challenges, architectural principles, component design, data collection, transmission, storage, alerting, and future directions for large‑scale observability.

ArchitectureMicroservicesObservability
0 likes · 16 min read
How WiFi Key Built a Million‑User Monitoring Platform: Architecture and Best Practices
dbaplus Community
dbaplus Community
Dec 14, 2017 · Big Data

Scaling Vipshop’s Big Data Platform: Monitoring, Multi‑HDFS, Yarn Optimization & Capping

In 2017 Vipshop’s senior big‑data architect shares how the company grew its Hadoop‑based platform from zero to a thousand‑node cluster, detailing cluster health monitoring, multi‑HDFS deployment via Hive, Yarn container allocation improvements, and a hook‑driven Capping resource‑control system to boost stability and efficiency.

Big DataHDFScapping
0 likes · 15 min read
Scaling Vipshop’s Big Data Platform: Monitoring, Multi‑HDFS, Yarn Optimization & Capping
dbaplus Community
dbaplus Community
Dec 11, 2017 · Backend Development

How 58 Express Scaled from Startup to Industry Leader: Architecture, Sharding, and AI Dispatch

This article recounts the technical evolution of 58 Express from its early startup days through rapid growth to an intelligent dispatch era, detailing challenges, database sharding, service decomposition, big‑data analytics, AI‑driven order routing, monitoring, and lessons learned for building a high‑performance backend system.

System Architecturedatabase shardingintelligent dispatch
0 likes · 21 min read
How 58 Express Scaled from Startup to Industry Leader: Architecture, Sharding, and AI Dispatch
Efficient Ops
Efficient Ops
Dec 7, 2017 · Operations

How Multi-Dimensional Root Cause Analysis Boosts Monitoring Efficiency with AI

This article introduces the challenges of multi-dimensional monitoring, explains the limitations of traditional alerting, and presents the MDRCA algorithm—combining K‑means clustering, Explanatory Power, and Surprise metrics—to pinpoint root causes efficiently, while sharing practical AI integration experiences for large‑scale monitoring platforms.

AIBig DataKMeans
0 likes · 15 min read
How Multi-Dimensional Root Cause Analysis Boosts Monitoring Efficiency with AI
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
Dec 7, 2017 · Operations

How 360’s Private Cloud Powers Elasticsearch: Architecture, Security, and Scaling

This article explains how 360’s Hulk private cloud platform deploys Elasticsearch with a dedicated master architecture, load‑balancing, per‑business isolated clusters, SearchGuard security, dynamic tokenization, self‑service user features, and advanced monitoring to achieve high‑performance, scalable search services.

Elasticsearchmonitoringprivate cloud
0 likes · 6 min read
How 360’s Private Cloud Powers Elasticsearch: Architecture, Security, and Scaling
Node Underground
Node Underground
Dec 7, 2017 · Backend Development

Build a Node.js Performance Tracing Tool with Async Hooks and Performance API

This article explains how to combine Node.js's experimental Async Hooks and Performance Timing APIs to create a simple tracing and performance monitoring tool, eliminating manual timing and offering a foundation that can be extended into a custom solution, while also noting an open‑source Pandora.js utility.

BackendNode.jsPerformance API
0 likes · 3 min read
Build a Node.js Performance Tracing Tool with Async Hooks and Performance API
360 Quality & Efficiency
360 Quality & Efficiency
Nov 23, 2017 · Operations

Ten Micro‑Metrics to Strengthen Performance Testing Reports

This article explains why traditional macro performance metrics are insufficient, introduces ten essential micro‑metrics covering memory, thread, and network aspects, and shows how to capture them using GC logs, thread dumps, and tools like netstat or open‑source APM solutions.

APMThreadsmemory
0 likes · 8 min read
Ten Micro‑Metrics to Strengthen Performance Testing Reports
Tongcheng Travel Technology Center
Tongcheng Travel Technology Center
Nov 23, 2017 · Backend Development

Design and Implementation of a Search Open Platform for Rapid Interface Provision

The article describes the requirements, architecture, data‑sync strategy, monitoring, and operational workflow of a search open platform that enables fast, zero‑code creation of searchable interfaces, supporting real‑time indexing, customizable ranking, and extensible backend services.

Search Platformbackend-developmentdata synchronization
0 likes · 12 min read
Design and Implementation of a Search Open Platform for Rapid Interface Provision
UCloud Tech
UCloud Tech
Nov 22, 2017 · Backend Development

Master Go Microservices: gRPC, TLS, Tracing & Prometheus Monitoring

This article shares practical Go microservice building experiences, covering gRPC-based communication, TLS security, request tracing, and comprehensive monitoring with Prometheus, including metric selection, alerting, and log management using Logrus and Graylog, to help reduce coupling and improve system observability.

MicroservicesPrometheusgRPC
0 likes · 10 min read
Master Go Microservices: gRPC, TLS, Tracing & Prometheus Monitoring
Zhuanzhuan Tech
Zhuanzhuan Tech
Nov 21, 2017 · Frontend Development

Building an Efficient Operational Frontend Architecture: Component Platform, Node.js Middleware, and Monitoring

The article describes how a startup tackled rapid MVP development and frequent changes by establishing a component‑based frontend platform, a Node.js middleware layer for Java services, and performance and error monitoring to streamline operations and improve development efficiency.

monitoringnodejsoperational tools
0 likes · 10 min read
Building an Efficient Operational Frontend Architecture: Component Platform, Node.js Middleware, and Monitoring
dbaplus Community
dbaplus Community
Nov 19, 2017 · Operations

Designing Scalable Monitoring with ELK and GPE: A Practical Guide

This article outlines a large‑scale monitoring solution for distributed microservice environments, comparing traditional ELK logging with a custom GPE stack (Grafana, Prometheus, Exporter, Consul), detailing architecture, components, workflows, and practical considerations for reliable observability.

ELKGrafanaPrometheus
0 likes · 10 min read
Designing Scalable Monitoring with ELK and GPE: A Practical Guide
ITPUB
ITPUB
Nov 17, 2017 · Operations

Master Bash Scripting: Tips and Ready‑to‑Use Monitoring Scripts

This guide presents essential Bash scripting best practices and a collection of practical monitoring scripts—including random string generation, user creation, package checks, service status, host reachability, CPU/memory/disk usage, and website availability—complete with debugging tips and naming conventions for reliable automation.

BashSysadminautomation
0 likes · 5 min read
Master Bash Scripting: Tips and Ready‑to‑Use Monitoring Scripts
Qunar Tech Salon
Qunar Tech Salon
Nov 8, 2017 · Operations

Evolution of Ele.me's Operations Infrastructure: From 1.0 to 2.0 – Standardization, Automation, and Data‑Driven Management

The article recounts Ele.me's rapid growth and the resulting operational challenges, describing how the company progressed from ad‑hoc 1.0 practices to a standardized, automated 2.0 infrastructure built on ZStack private cloud, fine‑grained operations, and data‑driven management to improve quality, efficiency, and cost.

Resource Managementmonitoringstandardization
0 likes · 21 min read
Evolution of Ele.me's Operations Infrastructure: From 1.0 to 2.0 – Standardization, Automation, and Data‑Driven Management
MaGe Linux Operations
MaGe Linux Operations
Oct 31, 2017 · Operations

Build Custom Zabbix Dashboards with Python and ECharts

This tutorial walks through using the Zabbix API with the pyzabbix Python library to retrieve monitoring data, then visualizes it with ECharts, showing step‑by‑step how to create personalized monitoring pages for better operational insight.

APIEChartsPython
0 likes · 9 min read
Build Custom Zabbix Dashboards with Python and ECharts
Architecture Digest
Architecture Digest
Oct 27, 2017 · Operations

Key Practices and Principles of DevOps from the “Cloud Development and Operations Best Practices” Talk

The article summarizes a DevOps talk, outlining eight guiding principles—configuration over hard‑coding, redundancy over single points, restartability, whole‑stack delivery, statelessness, standardization, automation, and unattended operation—while sharing concrete tools, architectures, and real‑world experiences from a cloud provider.

InfrastructureOperationsautomation
0 likes · 16 min read
Key Practices and Principles of DevOps from the “Cloud Development and Operations Best Practices” Talk
Meituan Technology Team
Meituan Technology Team
Oct 26, 2017 · Operations

Evolution of Payment Channel Automation Management at Meituan-Dianping

Meituan‑Dianping’s payment team progressed from manual fault alerts to a fully automated channel management system that detects failures, disables affected banks, conducts controlled ramp‑up tests, and restores service, dramatically cutting response times, manpower costs, and secondary‑failure risks while boosting overall availability.

OperationsSystem Designfault management
0 likes · 14 min read
Evolution of Payment Channel Automation Management at Meituan-Dianping
Qunar Tech Salon
Qunar Tech Salon
Oct 26, 2017 · Operations

Evolution of Pinterest's Monitoring System: From Time-Series Metrics to Distributed Tracing

Over seven years, Pinterest’s monitoring team built and refined a three‑pronged observability platform—time‑series metrics, log search, and distributed tracing—scaling from a single‑machine system to handling millions of data points per second across tens of thousands of AWS VMs, while addressing reliability, cost, and usability challenges.

Distributed TracingObservabilitySRE
0 likes · 19 min read
Evolution of Pinterest's Monitoring System: From Time-Series Metrics to Distributed Tracing
MaGe Linux Operations
MaGe Linux Operations
Oct 25, 2017 · Operations

80+ Essential Linux Monitoring Tools Every Sysadmin Should Know

Discover a comprehensive collection of over 80 Linux monitoring and debugging utilities—including command‑line, system, network, and log tools—detailing their purpose, key features, and typical use cases to help you efficiently manage and troubleshoot server performance.

System Administrationmonitoringnetwork-tools
0 likes · 13 min read
80+ Essential Linux Monitoring Tools Every Sysadmin Should Know
Efficient Ops
Efficient Ops
Oct 24, 2017 · Operations

How Pinterest Scaled Its Monitoring, Logging, and Tracing Over Seven Years

This article chronicles Pinterest's seven‑year evolution from a single‑machine time‑series monitor to a multi‑component system that integrates metrics, log search, and distributed tracing, sharing architectural choices, scaling challenges, and lessons learned for building reliable, high‑performance operations platforms.

Distributed TracingOperationsSRE
0 likes · 24 min read
How Pinterest Scaled Its Monitoring, Logging, and Tracing Over Seven Years
MaGe Linux Operations
MaGe Linux Operations
Oct 20, 2017 · Operations

Essential Linux Ops Skills: 10 Must‑Master Tools for Every Sysadmin

This article shares a seasoned Linux sysadmin’s ten‑point roadmap—from mastering rsync, network services, and scripting to mastering sed/awk, MySQL, firewalls, monitoring tools, clustering, and backup—plus essential security and operational mindsets for thriving in modern infrastructure.

automationmonitoringtools
0 likes · 19 min read
Essential Linux Ops Skills: 10 Must‑Master Tools for Every Sysadmin
Efficient Ops
Efficient Ops
Oct 18, 2017 · Operations

How Bilibili Scaled Its Log System to 10TB Daily with Elastic Stack

This article details Bilibili's Billions log platform—from its fragmented origins and design goals to the elastic‑stack‑based architecture, shard management, log sampling, custom Go splitters, and monitoring enhancements—highlighting the challenges faced and the roadmap for future improvements.

Big DataElastic StackLog Management
0 likes · 17 min read
How Bilibili Scaled Its Log System to 10TB Daily with Elastic Stack
Qunar Tech Salon
Qunar Tech Salon
Oct 18, 2017 · Cloud Computing

Gome Group’s Cloud Computing and Operations Automation Practices

This article details Gome Group’s transition to cloud computing and operations automation, describing its corporate background, new operational strategies, the establishment of Gome Cloud, IAAS product architecture, monitoring solutions, automation standards, and deployment practices such as gray releases and Docker integration.

Cloud ComputingDevOpsIaS
0 likes · 15 min read
Gome Group’s Cloud Computing and Operations Automation Practices
MaGe Linux Operations
MaGe Linux Operations
Oct 17, 2017 · Operations

Step-by-Step Guide: Build a Zabbix Monitoring System from Scratch

This article walks you through the complete process of setting up Zabbix on a Linux server—including preparing the environment, installing LAMP, configuring the Zabbix server and agent, creating databases, defining templates, items, triggers, graphs, and custom script alerts—to achieve real‑time network traffic monitoring and automated notifications.

AlertingNetwork Trafficmonitoring
0 likes · 9 min read
Step-by-Step Guide: Build a Zabbix Monitoring System from Scratch
dbaplus Community
dbaplus Community
Oct 16, 2017 · Operations

How Ele.me Scaled Operations: Key Lessons from Incident Management and Capacity Planning

This article details Ele.me's rapid expansion challenges and shares a three‑stage technical operations journey—fine‑grained division, stability maintenance, and efficiency gains—highlighting real incidents, monitoring upgrades, capacity testing, and practical insights for reliable large‑scale delivery platforms.

Operationscapacity planningincident management
0 likes · 14 min read
How Ele.me Scaled Operations: Key Lessons from Incident Management and Capacity Planning
Efficient Ops
Efficient Ops
Oct 16, 2017 · Cloud Computing

How Gome Used Cloud Computing & Automation to Revolutionize IT Ops

At Gome Group, a traditional retailer with over 30,000 employees, the IT team built a unified cloud platform and automated operations, consolidating resources across dozens of subsidiaries to cut costs, boost efficiency, and enable rapid service delivery through IAAS, standardized processes, and custom monitoring tools.

Cloud ComputingOperations Automationenterprise IT
0 likes · 16 min read
How Gome Used Cloud Computing & Automation to Revolutionize IT Ops
ITFLY8 Architecture Home
ITFLY8 Architecture Home
Oct 12, 2017 · Backend Development

How Taobao Scaled Its Backend Architecture Over Time

This article outlines Taobao's learning objectives, traces the evolution of its backend architecture from V1.0 to V3.0, highlights the technical challenges faced at each stage, and explains the architectural decisions—such as modularization, service‑oriented frameworks, distributed storage, and large‑scale monitoring—that enabled massive scalability, reliability, and performance improvements.

ArchitectureBackendBig Data
0 likes · 6 min read
How Taobao Scaled Its Backend Architecture Over Time
58 Tech
58 Tech
Oct 12, 2017 · Cloud Computing

Design and Implementation of 58 Private Cloud Platform Using Container Technology

The article details 58's private cloud platform built on container technology, explaining the motivations, overall architecture, and core module designs such as container management, network model, image repository, logging, and monitoring, illustrating how Docker and Kubernetes enable efficient resource utilization, rapid scaling, and streamlined deployment.

Containercloud architecturemonitoring
0 likes · 12 min read
Design and Implementation of 58 Private Cloud Platform Using Container Technology
ITPUB
ITPUB
Oct 7, 2017 · Operations

13 Must‑Have Linux Ops Tools and Quick Installation Guides

This guide introduces thirteen essential Linux operation utilities—including Nethogs, IOZone, IOTop, IPtraf, IFTop, HTop, NMON, MultiTail, Fail2ban, Tmux, Agedu, NMap and Httperf—providing brief descriptions, download links and step‑by‑step commands to install and use each tool for monitoring, performance testing, security and session management.

LinuxSysadminmonitoring
0 likes · 12 min read
13 Must‑Have Linux Ops Tools and Quick Installation Guides
Tongcheng Travel Technology Center
Tongcheng Travel Technology Center
Sep 29, 2017 · Big Data

Evolution of Monitoring Architecture and Traffic Alert Algorithms at Tongcheng Travel

This article describes how Tongcheng Travel’s monitoring system evolved from a monolithic design to a distributed and big‑data‑based architecture, introducing real‑time processing with Storm, machine‑learning‑enhanced alerts, and a multivariate linear regression model that dramatically improves traffic anomaly detection accuracy.

Big DataReal-time Processingarchitecture evolution
0 likes · 10 min read
Evolution of Monitoring Architecture and Traffic Alert Algorithms at Tongcheng Travel
Dada Group Technology
Dada Group Technology
Sep 29, 2017 · Operations

Overwatch: A Distributed System Monitoring Platform for Real‑Time RPC Visibility

Overwatch is an open‑source distributed monitoring platform built by Dada‑Jingdong Home that collects, aggregates, and visualizes RPC traffic across thousands of micro‑services in real time, enabling engineers to quickly pinpoint the root cause of system failures using directed‑graph visualizations and CQRS‑based data queries.

CQRSKafkaRPC
0 likes · 10 min read
Overwatch: A Distributed System Monitoring Platform for Real‑Time RPC Visibility
Meitu Technology
Meitu Technology
Sep 28, 2017 · Operations

Inside Meipai’s 3‑D Monitoring System: Scaling 150M Users with Unified Observability

This article examines how Meipai, a popular live‑streaming and short‑video platform with over 150 million monthly active users, engineered a comprehensive, three‑dimensional monitoring architecture that spans client to server, integrates unified dashboards, and leverages both private and public cloud resources to ensure reliable, scalable operations.

DevOpsInfrastructureMeipai
0 likes · 3 min read
Inside Meipai’s 3‑D Monitoring System: Scaling 150M Users with Unified Observability
Meitu Technology
Meitu Technology
Sep 28, 2017 · Industry Insights

Inside Meitu’s 6th Tech Salon: Deep Dive into Meipai’s Recommendation, Monitoring, and Live‑Streaming Architecture

The sixth Meitu Internet Technology Salon in Beijing showcased Meipai’s evolution, with senior engineers detailing the platform’s recommendation system, real‑time background segmentation, monitoring framework, live‑streaming and bullet‑screen architecture, offering practical insights and best‑practice lessons for building and optimizing large‑scale video services.

Industry InsightsMeipaiVideo platform
0 likes · 7 min read
Inside Meitu’s 6th Tech Salon: Deep Dive into Meipai’s Recommendation, Monitoring, and Live‑Streaming Architecture
21CTO
21CTO
Sep 26, 2017 · Operations

Why You Should Never Trust Any Component in Your System—and How to Protect It

In programming and operations, every element—from services and dependencies to requests, machines, data centers, power, networks, and humans—can fail unexpectedly, so you must assume distrust and implement defensive measures such as monitoring, redundancy, rate limiting, fallback strategies, backups, and automated deployment.

OperationsReliabilityfault tolerance
0 likes · 9 min read
Why You Should Never Trust Any Component in Your System—and How to Protect It
Efficient Ops
Efficient Ops
Sep 25, 2017 · Operations

How Qunar Scaled Application Ops Automation from Hundreds to Tens of Thousands of Servers

This article details Qunar's journey of automating application operations, covering the evolution of their host‑management system, unified monitoring/alert platform, and data‑interchange mechanisms that enabled the company to grow from a few hundred to over ten thousand servers with a stable six‑person ops team.

Data IntegrationOperations AutomationQunar
0 likes · 25 min read
How Qunar Scaled Application Ops Automation from Hundreds to Tens of Thousands of Servers
Qunar Tech Salon
Qunar Tech Salon
Sep 18, 2017 · Operations

Integrated Code Quality Monitoring and Crash Management Solution

This article describes an integrated solution that combines code quality monitoring during development with automated crash issue tracking after deployment, using a custom platform, Jenkins, Gradle plugins, static analysis tools, and rule-based filtering to continuously improve project reliability and performance.

code qualitycontinuous integrationcrash management
0 likes · 13 min read
Integrated Code Quality Monitoring and Crash Management Solution
Architecture Digest
Architecture Digest
Sep 16, 2017 · Backend Development

Essential Backend Infrastructure and Services for Internet Companies

This article outlines the essential backend infrastructure components and best‑practice patterns—such as API gateways, service frameworks, caching, databases, search engines, message queues, authentication, configuration, service governance, scheduling, logging, and monitoring—required to build stable, scalable, and maintainable internet applications.

BackendInfrastructureMicroservices
0 likes · 31 min read
Essential Backend Infrastructure and Services for Internet Companies
Efficient Ops
Efficient Ops
Sep 10, 2017 · Operations

How We Built a Scalable, High‑Availability Monitoring Platform with Service Trees

This article details the challenges of traditional monitoring systems, the design and implementation of a custom high‑availability monitoring platform using a Golang‑based service tree, Raft‑backed storage, InfluxDB for time‑series data, and a modular architecture that supports Windows agents, third‑party reporting, and AI‑driven future enhancements.

InfluxDBOpsaiops
0 likes · 13 min read
How We Built a Scalable, High‑Availability Monitoring Platform with Service Trees
ITPUB
ITPUB
Sep 7, 2017 · Operations

Essential Command-Line Tools Every Linux Sysadmin Should Know

Sysadmins need reliable command-line utilities to keep services running 24/7, and this guide compiles the most commonly used networking, security, storage, logging, backup, performance, efficiency, package-management, and hardware inspection tools on Linux, explaining each command’s purpose and typical use cases.

CLI toolsSysadminmonitoring
0 likes · 15 min read
Essential Command-Line Tools Every Linux Sysadmin Should Know
Efficient Ops
Efficient Ops
Sep 3, 2017 · Operations

How to Design an Enterprise‑Grade Monitoring & Alerting System from Scratch

This article introduces the fundamental concepts, methods, types, goals, and product attributes of enterprise monitoring and alerting, explains the perspective differences between users and builders, and outlines a comprehensive monitoring system architecture for large‑scale operations.

AlertingEnterpriseOperations
0 likes · 14 min read
How to Design an Enterprise‑Grade Monitoring & Alerting System from Scratch
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
Aug 30, 2017 · Operations

Mastering Prometheus: From Metrics Basics to High‑Availability Monitoring

This article shares practical experiences of using Prometheus for monitoring complex services, covering metric types, PromQL query techniques, naming conventions, service discovery with file‑based configs, high‑availability sharding, alerting via Alertmanager, and visualisation with Grafana, providing actionable guidance for reliable observability.

GrafanaPromQLPrometheus
0 likes · 15 min read
Mastering Prometheus: From Metrics Basics to High‑Availability Monitoring
Efficient Ops
Efficient Ops
Aug 21, 2017 · Operations

How AI-Driven Automation Transforms Tencent Game Operations

This article explains how Tencent Game operations moved from manual, threshold‑based monitoring to an AI‑powered, data‑driven workflow that automates scaling, improves online‑curve monitoring, enables full‑dimensional analysis, and reduces time, labor, and cost while enhancing player experience.

GamingOperationsautomation
0 likes · 16 min read
How AI-Driven Automation Transforms Tencent Game Operations
Qunar Tech Salon
Qunar Tech Salon
Aug 18, 2017 · Operations

Hardware Automation Operations System at Qunar: Design, Implementation, and Lessons Learned

This article details Qunar's hardware automation operations platform, covering the hardware scope, pain points of manual processes, a five‑stage lifecycle, automated testing, data collection, fault handling, and the underlying Mesos‑Marathon‑Docker infrastructure that together improve efficiency, reliability, and cost control.

data collectionfault handlinghardware automation
0 likes · 21 min read
Hardware Automation Operations System at Qunar: Design, Implementation, and Lessons Learned
Ctrip Technology
Ctrip Technology
Aug 17, 2017 · Operations

Design, Evolution, and Future of Ctrip's Operations Workflow Platform

This article details the challenges, architectural evolution, key components, implementation experiences, and future directions of Ctrip's operations workflow platform, illustrating how a multi‑stage, layered design and standardized services have transformed manual IT operations into an automated, observable, and scalable system.

Operations AutomationProcess DesignService Integration
0 likes · 16 min read
Design, Evolution, and Future of Ctrip's Operations Workflow Platform
Efficient Ops
Efficient Ops
Aug 16, 2017 · Operations

How Qunar Built an Automated Hardware Operations Platform to Boost Efficiency

This article details Qunar's end‑to‑end hardware automation system, covering background challenges, lifecycle management, automated testing, data collection, fault detection, and visualized monitoring, and explains how the integrated platform reduces manual effort, improves reliability, and cuts operational costs.

CMDBOperationsfault management
0 likes · 22 min read
How Qunar Built an Automated Hardware Operations Platform to Boost Efficiency
Efficient Ops
Efficient Ops
Aug 13, 2017 · Operations

22 Essential Ops Manager Tips for Building Resilient Web Infrastructure

This article compiles 22 practical recommendations from an operations manager covering domain management, CDN usage, image servers, data center selection, monitoring, security, redundancy, high‑availability architecture, disaster‑recovery planning, and team coordination to help ensure stable and secure online services.

InfrastructureOperationsdisaster recovery
0 likes · 12 min read
22 Essential Ops Manager Tips for Building Resilient Web Infrastructure
Meituan Technology Team
Meituan Technology Team
Aug 10, 2017 · Frontend Development

Front-End Service Availability: Definition, Measurement, and Assurance Practices at Meituan-Dianping Checkout

The article outlines Meituan‑Dianping’s approach to front‑end service availability for its checkout system, defining availability across code, static resources, and network links, measuring failure duration, identifying typical bugs, and implementing a three‑stage assurance strategy using people processes, engineering tools, lightweight technology choices, and concrete practices such as TypeScript adoption, automated testing, health‑checks, DNS protection, and post‑incident monitoring.

AvailabilityFrontendSSR
0 likes · 15 min read
Front-End Service Availability: Definition, Measurement, and Assurance Practices at Meituan-Dianping Checkout
MaGe Linux Operations
MaGe Linux Operations
Aug 8, 2017 · Operations

Essential Automation Ops Resources: Books, Tools, and News Sources

This guide highlights the urgent need for automation in modern operations and curates essential books, documentation, and information sources covering Puppet, Nagios, Zabbix, Linux scripting, high‑availability servers, and Python‑based automation to help both seasoned engineers and newcomers alike.

Booksmonitoringtools
0 likes · 11 min read
Essential Automation Ops Resources: Books, Tools, and News Sources
High Availability Architecture
High Availability Architecture
Aug 8, 2017 · Big Data

Practical Big Data Architecture Evolution and Lessons Learned

The article reviews the evolution of big‑data architectures from a simple RDB‑centric pipeline to a SaaS‑based solution, highlighting common bottlenecks such as scaling, integration, cost, and operational complexity, and shares practical experiences and best‑practice recommendations for building efficient, maintainable data platforms.

ArchitectureBig DataSaaS
0 likes · 12 min read
Practical Big Data Architecture Evolution and Lessons Learned
Architecture Digest
Architecture Digest
Aug 7, 2017 · Operations

Website Availability and High‑Availability Architecture Overview

This article explains website availability metrics, fault‑weight scoring, layered high‑availability architecture, session management strategies, reusable service design, data redundancy, quality assurance processes, and monitoring practices essential for maintaining reliable large‑scale web systems.

AvailabilityOperationsSession Management
0 likes · 9 min read
Website Availability and High‑Availability Architecture Overview
ITFLY8 Architecture Home
ITFLY8 Architecture Home
Aug 6, 2017 · Backend Development

How Meizu Scales Real‑Time Push to 600 M Messages/min: Architecture, Pitfalls & Solutions

The article details Meizu's massive real‑time push system handling 25 million online users and 600 million messages per minute, explains its four‑layer architecture, and shares how the team tackled phone power consumption, mobile network instability, massive connections, monitoring, and gray‑release deployment.

Distributed SystemsMobile Optimizationgray release
0 likes · 13 min read
How Meizu Scales Real‑Time Push to 600 M Messages/min: Architecture, Pitfalls & Solutions
Efficient Ops
Efficient Ops
Aug 4, 2017 · Operations

How Tencent’s ZhiYun Platform Powered the “Military Photo” Campaign with 4,000 Servers

This article details how Tencent's SNG operations team leveraged the ZhiYun intelligent operations platform—through standardized processes, massive IaaS provisioning, CMDB management, automated workflows, and real‑time capacity monitoring—to support the high‑traffic “Military Photo” H5 campaign, scaling up to 4,000 servers and 24 GB bandwidth.

CMDBCloud ComputingIaS
0 likes · 10 min read
How Tencent’s ZhiYun Platform Powered the “Military Photo” Campaign with 4,000 Servers
Efficient Ops
Efficient Ops
Aug 2, 2017 · Operations

Essential Ops Playbook: 6 Key Practices to Prevent Disasters

Drawing from a year‑and‑a‑half of ops experience, this guide outlines six practical categories—online operation standards, data handling, security, daily monitoring, performance tuning, and mindset—to help engineers avoid costly mistakes and maintain stable, secure systems.

BackupOperationsPerformance Tuning
0 likes · 12 min read
Essential Ops Playbook: 6 Key Practices to Prevent Disasters
ITPUB
ITPUB
Jul 17, 2017 · Operations

Essential Linux Ops Tools Every Sysadmin Should Master

This guide outlines the core Linux system fundamentals, networking services, scripting languages, text‑processing utilities, database handling, firewall configuration, monitoring solutions, clustering, and backup techniques that form the essential toolkit for aspiring Linux operations engineers.

LinuxOperationsSysadmin
0 likes · 7 min read
Essential Linux Ops Tools Every Sysadmin Should Master
MaGe Linux Operations
MaGe Linux Operations
Jul 15, 2017 · Fundamentals

Master Python File Operations and System Automation with Practical Code Examples

This article presents a comprehensive collection of Python tutorials and scripts covering file I/O modes, directory traversal, log analysis, simple games, command‑line argument handling, process monitoring, port checking, authentication loops, and SNMP‑based CPU and network traffic monitoring, providing a solid foundation for automation and operations tasks.

Sysadminfile-iomonitoring
0 likes · 15 min read
Master Python File Operations and System Automation with Practical Code Examples
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
Jul 13, 2017 · Cloud Computing

Inside 360’s Ultron: How OpenStack Powers a Scalable Private Cloud

This article details the evolution, architecture, deployment, monitoring, and performance optimization of Ultron—360’s internal OpenStack‑based virtualization platform—covering its three development stages, technical stack, automation with Ansible, advanced features like VXLAN and Ceph, and lessons learned from large‑scale operations.

AnsibleCephDPDK
0 likes · 19 min read
Inside 360’s Ultron: How OpenStack Powers a Scalable Private Cloud
DevOps
DevOps
Jul 12, 2017 · Cloud Native

Container Monitoring: Challenges, Metrics Collection, and Best Practices

This article examines the unique challenges of monitoring containers, outlines three categories of metrics to collect, compares host‑centric and layered monitoring architectures, provides detailed methods for gathering CPU, memory, I/O and network data via cgroup files and Docker commands, and shares practical insights, tooling recommendations, and a Q&A session for effective container observability.

DockerOpsPrometheus
0 likes · 18 min read
Container Monitoring: Challenges, Metrics Collection, and Best Practices
MaGe Linux Operations
MaGe Linux Operations
Jul 9, 2017 · Operations

Mastering Game Operations: From Legacy Servers to Modern Cloud Strategies

An in‑depth look at the evolution of game operations—from early PC and web games to today’s mobile and cloud‑based titles—covering architecture, Tcaplus storage, CMDB building, automated deployment, performance monitoring, data warehousing, and the essential skills and challenges faced by game ops engineers.

CMDBgame operationsgame server architecture
0 likes · 27 min read
Mastering Game Operations: From Legacy Servers to Modern Cloud Strategies