Tagged articles
3281 articles
Page 31 of 33
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
Oct 14, 2016 · Operations

Mastering Rapid Incident Response: An Ops Engineer’s 4‑Step Method

This guide teaches operations engineers a four‑step “Wang‑Wen‑Wen‑Qie” methodology—observing system health, listening via alerts, questioning changes, and pulse‑checking with profiling tools—to rapidly diagnose and resolve high‑impact incidents while maintaining clear communication and post‑mortem learning.

Operationsincident responsemonitoring
0 likes · 5 min read
Mastering Rapid Incident Response: An Ops Engineer’s 4‑Step Method
ITFLY8 Architecture Home
ITFLY8 Architecture Home
Oct 11, 2016 · Operations

How to Gracefully Degrade Services When Server Load Spikes

This article explains various service degradation strategies—including interface and page refusal, delayed persistence, and persistent‑layer restrictions—along with management approaches and implementation points such as middleware control, NGINX+LUA page blocking, and data‑operation rules, to keep core functions running under high server pressure.

Operationsasynchronous queuecaching
0 likes · 4 min read
How to Gracefully Degrade Services When Server Load Spikes
Efficient Ops
Efficient Ops
Oct 8, 2016 · Operations

How to Boost Server Resource Utilization: Strategies, Trade‑offs, and Metrics

This article explains why servers often run far below their theoretical capacity, defines the concept of highest usable resource utilization, and offers practical and advanced techniques—such as multithreading, workload consolidation, resource layering, and overselling—to improve utilization while weighing performance, cost, and reliability impacts.

OperationsResource Efficiencyperformance optimization
0 likes · 9 min read
How to Boost Server Resource Utilization: Strategies, Trade‑offs, and Metrics
DevOps
DevOps
Oct 8, 2016 · Operations

What Is DevOps? Origins, Key Issues, Benefits, and Adoption

The article explains DevOps as the integration of development and operations, tracing its origins, outlining its cultural and technical challenges, detailing its benefits such as faster, more reliable releases, and reviewing the tools and global adoption trends, including a new Chinese survey initiative.

CultureDevOpsOperations
0 likes · 11 min read
What Is DevOps? Origins, Key Issues, Benefits, and Adoption
Java Backend Technology
Java Backend Technology
Oct 8, 2016 · Backend Development

Understanding Nginx: Core Concepts, Features, and Architecture

This article explains Nginx's role as a high‑performance HTTP and reverse‑proxy server, its event‑driven design, key features, internal process model, request handling flow, and real‑world deployments, providing a comprehensive overview for developers and operations engineers.

Operationsbackend-developmentreverse proxy
0 likes · 11 min read
Understanding Nginx: Core Concepts, Features, and Architecture
Efficient Ops
Efficient Ops
Oct 6, 2016 · Operations

How Ctrip Scales Application Operations: Practices, Automation, and Reliability

This talk details Ctrip's application operations framework, covering data‑center scale, multi‑application deployment on Windows, high availability goals, capacity‑prediction models, disaster‑recovery design, incident response, and the evolution from manual tooling to automated, intelligent operations.

Operationsautomationcapacity planning
0 likes · 21 min read
How Ctrip Scales Application Operations: Practices, Automation, and Reliability
Efficient Ops
Efficient Ops
Oct 5, 2016 · Operations

5 Must‑Read Ops ‘Prescriptions’ to Boost Your Infrastructure Skills

This article curates five technical reads—covering network operations, Google’s production environment, massive cost‑saving strategies, IDC automation, and Docker‑based RDS—each presented as a “medicine” with a brief description and a link for deeper insight.

Cost OptimizationDockerOperations
0 likes · 5 min read
5 Must‑Read Ops ‘Prescriptions’ to Boost Your Infrastructure Skills
ITPUB
ITPUB
Oct 4, 2016 · Operations

How to Build a Resilient High‑Traffic Website: A Complete Operations Guide

This guide outlines a step‑by‑step strategy for designing a highly available, secure, and scalable website architecture, covering domain acquisition, CDN deployment, image caching, data center selection, monitoring, DDoS mitigation, redundancy, server configuration, database replication, testing environments, and operational best practices.

Operationshigh availabilitysecurity
0 likes · 14 min read
How to Build a Resilient High‑Traffic Website: A Complete Operations Guide
Meituan Technology Team
Meituan Technology Team
Oct 1, 2016 · Operations

How Meituan Scaled Its Mobile Deal System for Mega‑Promotions: Traffic Modeling & Capacity Planning

This article details Meituan's technical approach to handling massive traffic spikes during large‑scale promotions, covering background of the O2O deal platform, traffic‑model construction, capacity‑budget calculations, micro‑service architecture evolution, pressure‑test strategies, and the PTP performance‑testing environment used to validate system limits.

Load TestingMicroservicesOperations
0 likes · 18 min read
How Meituan Scaled Its Mobile Deal System for Mega‑Promotions: Traffic Modeling & Capacity Planning
Efficient Ops
Efficient Ops
Sep 28, 2016 · Operations

Can Precise Operations Transform IT Service Delivery? A Deep Dive

This article explains the concept of precise operations, detailing how integrating business demand as a variable can make IT maintenance more proactive, value‑driven, and synchronized with business needs, and outlines a step‑by‑step framework with real‑world examples.

IT-serviceOperationsbusiness-alignment
0 likes · 14 min read
Can Precise Operations Transform IT Service Delivery? A Deep Dive
Ctrip Technology
Ctrip Technology
Sep 23, 2016 · Backend Development

Reconstructing Ctrip's Payment Engine: Architecture, Services, and Operational Practices

This article presents Hong Guangming's detailed overview of Ctrip's payment engine reconstruction, covering the legacy system's challenges, the new split‑service architecture, channel‑specific services, settlement reconciliation, and the multi‑layered strategies employed to achieve high availability and operational stability.

BackendCtripOperations
0 likes · 5 min read
Reconstructing Ctrip's Payment Engine: Architecture, Services, and Operational Practices
Art of Distributed System Architecture Design
Art of Distributed System Architecture Design
Sep 22, 2016 · Industry Insights

How WhatsApp Scaled to 450 Million Users with Erlang: Architecture and Lessons

This article dissects WhatsApp’s high‑reliability architecture that supports 450 million users, detailing its Erlang‑based backend, hardware choices, scaling techniques, monitoring tools, and the engineering lessons learned from pushing a single server to two‑million concurrent connections.

Backend ArchitectureErlangOperations
0 likes · 19 min read
How WhatsApp Scaled to 450 Million Users with Erlang: Architecture and Lessons
dbaplus Community
dbaplus Community
Sep 22, 2016 · Operations

How Microsoft and Xiaomi Mastered DevOps: Practical Lessons for Global Scale

This article summarizes Ouyang Chen's GDevOps 2016 talk, covering the definition of DevOps, four personal viewpoints, Microsoft's three‑phase transformation, Xiaomi's rapid release pipeline, key principles, metrics such as time‑to‑detect, and essential tools for building an efficient DevOps culture.

DevOpsMicrosoftOperations
0 likes · 19 min read
How Microsoft and Xiaomi Mastered DevOps: Practical Lessons for Global Scale

WhatsApp Scaling Architecture: Lessons from Two Years of Growth

Over the past two years WhatsApp has dramatically expanded its user base, hardware, and traffic while maintaining a tiny engineering team, highlighting the challenges of massive scalability, Erlang‑based distributed design, Mnesia database bottlenecks, decoupling strategies, and operational patches required to keep the service reliable.

ErlangMnesiaOperations
0 likes · 15 min read
WhatsApp Scaling Architecture: Lessons from Two Years of Growth
Practical DevOps Architecture
Practical DevOps Architecture
Sep 20, 2016 · Operations

Troubleshooting PPPoE Dial‑up Failure on Huawei AR2240 Gateway

The article explains why a Huawei AR2240 gateway using PPPoE on an on‑demand dial‑bundle fails to obtain an IP address, identifies the misconfiguration of the internal Ethernet interface, and provides step‑by‑step commands to switch the PPPoE mode to permanent online for successful dialing.

Dial-up ConfigurationHuawei AR2240Operations
0 likes · 4 min read
Troubleshooting PPPoE Dial‑up Failure on Huawei AR2240 Gateway
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
Sep 18, 2016 · Artificial Intelligence

How Linear Regression Can Tame Your Nighttime Alert Fatigue

This article explores how historical monitoring alerts can be analyzed and predicted using linear regression, guiding operations engineers to preprocess data, build regression models, and forecast future alert trends to reduce manual alarm handling and improve system stability.

Operationsalert predictionlinear regression
0 likes · 8 min read
How Linear Regression Can Tame Your Nighttime Alert Fatigue
Qunar Tech Salon
Qunar Tech Salon
Sep 18, 2016 · Operations

Analyzing Nginx Access Logs for Traffic, Performance, and Optimization

This article explains how to extract valuable performance and traffic insights from Nginx access logs using shell commands and awk, covering request volume, peak rates, bandwidth usage, slow‑query detection, URL normalization, and practical optimization recommendations for web operations.

OperationsPerformance MonitoringShell scripting
0 likes · 13 min read
Analyzing Nginx Access Logs for Traffic, Performance, and Optimization
Efficient Ops
Efficient Ops
Sep 13, 2016 · Operations

How Google SRE Principles Compare Across Industries

This article, excerpted from the upcoming Chinese edition of “SRE: Google Site Reliability Engineering”, examines how Google’s SRE guiding philosophies—disaster planning, post‑mortem culture, automation, and data‑driven decision‑making—are adopted, adapted, or contrasted in sectors such as manufacturing, aerospace, nuclear, telecommunications, healthcare, and finance, highlighting key similarities, differences, and lessons for Google and the broader tech industry.

OperationsSREautomation
0 likes · 21 min read
How Google SRE Principles Compare Across Industries
Architecture Digest
Architecture Digest
Sep 11, 2016 · Operations

Designing and Operating High‑Scale E‑commerce Systems: Insights from Dangdang

The article details Dangdang's 15‑year evolution from a monolithic platform to a distributed, SOA‑based architecture, describing system tiering, front‑end and back‑end scaling techniques, asynchronous processing, data‑flow optimization, and operational practices that enable stable handling of ten‑fold traffic spikes during major sales events.

OperationsSOAScalability
0 likes · 17 min read
Designing and Operating High‑Scale E‑commerce Systems: Insights from Dangdang
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
Sep 8, 2016 · Operations

How to Diagnose App “Unavailable” Issues with Ping&DNS: A Step‑by‑Step Guide

This article walks readers through practical techniques for identifying and resolving common “unavailable” app problems—such as white screens, slow loading, or error codes—by using the lightweight Ping&DNS Android tool to check domain names, IP connectivity, DNS resolution, and traceroute data, empowering both novice users and professionals.

DNSOperationsmobile app
0 likes · 8 min read
How to Diagnose App “Unavailable” Issues with Ping&DNS: A Step‑by‑Step Guide
Qunar Tech Salon
Qunar Tech Salon
Sep 7, 2016 · Databases

Principles and Practices of MySQL Database Sharding

The article explains when MySQL sharding is needed, outlines five practical principles for deciding to split databases or tables, provides real‑world examples, and shares operational tips for implementing horizontal partitioning to improve performance, availability, and manageability.

OperationsPartitioningdatabase scaling
0 likes · 8 min read
Principles and Practices of MySQL Database Sharding
Architects' Tech Alliance
Architects' Tech Alliance
Sep 7, 2016 · Operations

How Agentless Backup Works in Cloud Environments and Its Trade‑offs

The article examines agentless backup technology, comparing its implementation in virtualized and physical environments, detailing supported interfaces, evaluating a real‑world Asigra Cloud Backup case, and discussing security risks, performance impacts, and when traditional agents remain necessary.

Cloud BackupData ProtectionOperations
0 likes · 7 min read
How Agentless Backup Works in Cloud Environments and Its Trade‑offs
Meituan Technology Team
Meituan Technology Team
Sep 2, 2016 · Databases

Automated Database Operations Platform Overview

An automated database operations platform shifts routine DBA tasks to developer self‑service and fully automates processes such as cluster provisioning, scaling, backup, migration, and sharding, using a stateless workflow center, job queue, and unified SqlEditor to improve efficiency, safety, and auditability.

Data MigrationJob CenterOperations
0 likes · 8 min read
Automated Database Operations Platform Overview
Efficient Ops
Efficient Ops
Aug 28, 2016 · Operations

Six Proven Methods to Optimize Server Capacity and Cut Costs in Large‑Scale Social Networks

Tencent's SNG team shares six practical capacity‑management techniques—performance, density, feature, fragmentation, barrel, and hardware selection methods—that helped reduce operational expenses by over a hundred million yuan annually while supporting hundreds of millions of daily active users.

Cost OptimizationOperationscapacity management
0 likes · 10 min read
Six Proven Methods to Optimize Server Capacity and Cut Costs in Large‑Scale Social Networks
Qunar Tech Salon
Qunar Tech Salon
Aug 26, 2016 · Backend Development

Reconstructing Ctrip's Payment Engine: Architecture, Design, and Operational Practices

This presentation details Ctrip's payment engine reconstruction, describing the limitations of the legacy system, the new service‑oriented architecture separating payment planning and execution, channel‑specific services, the reconciliation system, and the operational measures taken to achieve high payment availability.

CtripOperationsSystem Architecture
0 likes · 5 min read
Reconstructing Ctrip's Payment Engine: Architecture, Design, and Operational Practices
Ctrip Technology
Ctrip Technology
Aug 26, 2016 · Information Security

Automated Firewall Operations and Management System at Ctrip

The article describes how Ctrip’s network security team built an automated, centralized firewall management platform that handles multi‑brand firewalls, streamlines policy queries, generation, and deployment, integrates with change‑ticket workflows, and dramatically improves operational efficiency while reducing human error.

CtripInfrastructureOperations
0 likes · 14 min read
Automated Firewall Operations and Management System at Ctrip
Efficient Ops
Efficient Ops
Aug 23, 2016 · Mobile Development

How Tencent Cut Mobile QQ/Qzone Lag with Network & Client Optimizations

This article details Tencent's practical approaches to reducing user‑perceived latency in mobile QQ and Qzone by analyzing server, network, and client delays, employing private protocols, multi‑path connection strategies, real‑time monitoring, and big‑data clustering to identify and fix performance bottlenecks.

Operationsbig data analysisclient monitoring
0 likes · 16 min read
How Tencent Cut Mobile QQ/Qzone Lag with Network & Client Optimizations
ITPUB
ITPUB
Aug 21, 2016 · Backend Development

How to Diagnose and Prevent Redis Data Loss in Production

This article examines common causes of Redis data loss, walks through a real‑world incident where 90,000 keys vanished, and provides concrete monitoring, configuration, and operational safeguards to detect and avoid such failures.

BackendData lossOperations
0 likes · 11 min read
How to Diagnose and Prevent Redis Data Loss in Production
Efficient Ops
Efficient Ops
Aug 18, 2016 · Operations

How Ant Financial Scales to 86,000 TPS: Cloud‑Native Operations Lessons

This article details Ant Financial's evolution from supporting 20,000 transactions per minute in 2010 to 86,000 transactions per second in 2015, describing its multi‑active architecture, financial‑grade operation platform, and organizational mechanisms that enable high‑availability, automated capacity management and fault handling in a cloud‑native environment.

Operationsfinancial technology
0 likes · 15 min read
How Ant Financial Scales to 86,000 TPS: Cloud‑Native Operations Lessons
Baidu Intelligent Testing
Baidu Intelligent Testing
Aug 18, 2016 · Mobile Development

Establishing an Effective Mobile App Quality Monitoring System: Standards, Metrics, and Data Utilization

This article explains how to build a comprehensive mobile app quality monitoring framework by defining quality standards, setting capability indicators, and leveraging data acquisition, analysis, and visualization to continuously improve product reliability and user experience across different development stages.

MetricsOperationsQuality Monitoring
0 likes · 11 min read
Establishing an Effective Mobile App Quality Monitoring System: Standards, Metrics, and Data Utilization
ITPUB
ITPUB
Aug 16, 2016 · Databases

Achieving Seamless MySQL HA with Pacemaker and MHA: Lessons from DTCC 2016

This article details a MySQL high‑availability solution built on Pacemaker, Corosync and MHA, explains why earlier keepalived‑based designs suffered split‑brain issues, and walks through the architecture, quorum handling, resource agents, failover workflow, testing methodology, and practical lessons learned.

MHAOperationsPacemaker
0 likes · 16 min read
Achieving Seamless MySQL HA with Pacemaker and MHA: Lessons from DTCC 2016
Efficient Ops
Efficient Ops
Aug 7, 2016 · Operations

Automated Operations Platforms: Stages, Pain Points, and Design Blueprint

This article outlines the evolution of enterprise operations through four stages, identifies seven common operational pain points, and presents a comprehensive model for building an automated operations platform that integrates design, deployment, monitoring, optimization, and troubleshooting.

CMDBIT infrastructureOperations
0 likes · 12 min read
Automated Operations Platforms: Stages, Pain Points, and Design Blueprint
Architecture Digest
Architecture Digest
Aug 7, 2016 · Operations

Designing a Three‑Dimensional High‑Availability Architecture for Alibaba's Game Integration System

The article describes how Alibaba's game integration platform achieved business‑oriented high availability by abandoning traditional system‑centric designs and implementing a three‑dimensional architecture that combines clear HA goals, multi‑active deployment, client‑side retries, functional isolation, automated monitoring, and rapid fault recovery, ultimately meeting a 3‑minute issue‑location and 5‑minute business‑recovery target.

OperationsSystem Architecturebusiness‑oriented HA
0 likes · 21 min read
Designing a Three‑Dimensional High‑Availability Architecture for Alibaba's Game Integration System
Efficient Ops
Efficient Ops
Aug 3, 2016 · Operations

How Harbor Enables Seamless Container Image Replication Across Registries

This article explains the design and implementation of Harbor's policy‑based Docker image replication, detailing its architecture, job service workflow, state‑machine handling, and how it reduces storage‑specific dependencies while simplifying large‑scale container registry synchronization.

Cloud NativeContainer RegistryHarbor
0 likes · 8 min read
How Harbor Enables Seamless Container Image Replication Across Registries
MaGe Linux Operations
MaGe Linux Operations
Aug 1, 2016 · Operations

Mastering MHA Core Parameters: Complete Guide to MySQL HA Configuration

This article provides a detailed walkthrough of MHA's core configuration parameters—including server scopes, connection settings, candidate master rules, failover scripts, and monitoring options—explaining where each setting belongs and how to fine‑tune MySQL high‑availability behavior.

ConfigurationMHAOperations
0 likes · 11 min read
Mastering MHA Core Parameters: Complete Guide to MySQL HA Configuration
Efficient Ops
Efficient Ops
Aug 1, 2016 · Operations

How Tencent Shifted 70M Users During Tianjin Explosion – A Multi‑Active Ops Playbook

This article details how Tencent's operations team orchestrated a seamless, zero‑impact migration of over 70 million users across three data centers during the 2015 Tianjin explosion, highlighting the four key capabilities—distribution, scheduling, data synchronization, and automated operations—that enabled multi‑active disaster recovery at massive scale.

Distributed SystemsOperationsdata synchronization
0 likes · 22 min read
How Tencent Shifted 70M Users During Tianjin Explosion – A Multi‑Active Ops Playbook
21CTO
21CTO
Jul 30, 2016 · Operations

Building a 3‑Minute Fault Detection, 5‑Minute Recovery HA System for Games

This article explains how Alibaba’s NineGame platform achieved ultra‑high availability by shifting from system‑centric to business‑centric design, defining measurable goals (3‑minute issue detection, 5‑minute recovery, bi‑monthly incidents) and implementing a layered, automated, visual monitoring, client‑side retry, HTTP‑DNS, functional isolation, and multi‑site active‑active architecture.

Operationsbusiness‑centric designfault tolerance
0 likes · 22 min read
Building a 3‑Minute Fault Detection, 5‑Minute Recovery HA System for Games
Baidu Intelligent Testing
Baidu Intelligent Testing
Jul 28, 2016 · Operations

Ensuring Store Data Quality in O2O Products: Processes and Rules

This article outlines the importance of store data in O2O products and presents a comprehensive workflow—including single‑attribute rules, multi‑attribute cross‑validation, and auxiliary checks—to detect and remediate low‑quality or erroneous store information, thereby improving user experience.

Data QualityO2OOperations
0 likes · 8 min read
Ensuring Store Data Quality in O2O Products: Processes and Rules
Efficient Ops
Efficient Ops
Jul 27, 2016 · Operations

Mastering Service Degradation: Strategies to Keep High‑Traffic Systems Alive

This article explores practical service‑degradation techniques—including automatic and manual switches, read/write fallback, and multi‑level strategies—to ensure core functionality remains available during traffic spikes, failures, or resource constraints in high‑concurrency systems for.

BackendOperationsfallback strategies
0 likes · 11 min read
Mastering Service Degradation: Strategies to Keep High‑Traffic Systems Alive
Architects' Tech Alliance
Architects' Tech Alliance
Jul 20, 2016 · Operations

How Distributed Indexing Improves Backup Performance and Scalability

The article explains how traditional centralized backup indexes become performance bottlenecks as data grows, and details Simpana's two‑level distributed indexing architecture—primary and secondary indexes—showing how it enhances backup speed, reduces network load, and simplifies recovery across multi‑site environments.

BackupData RecoveryOperations
0 likes · 7 min read
How Distributed Indexing Improves Backup Performance and Scalability
Efficient Ops
Efficient Ops
Jul 19, 2016 · Operations

How Alibaba Games Built a 4‑9 High‑Availability System: Architecture, HTTP‑DNS & Ops Practices

This article details Alibaba Games' journey to achieve four‑nine reliability through a business‑focused high‑availability architecture, including system analysis, a four‑layer design, HTTP‑DNS client retry, service decoupling, multi‑active deployment, comprehensive monitoring, and measurable operational goals.

OperationsSystem Architecturehttp-dns
0 likes · 21 min read
How Alibaba Games Built a 4‑9 High‑Availability System: Architecture, HTTP‑DNS & Ops Practices
Architecture Digest
Architecture Digest
Jul 19, 2016 · Operations

Designing a Multi‑Dimensional High‑Availability Architecture for a Game Access System

The article presents a business‑oriented, three‑layer high‑availability architecture for a large‑scale game access platform, detailing measurable goals, client‑side retry with HTTP‑DNS, functional separation and degradation, multi‑region active‑active deployment, and automated, visual monitoring to achieve rapid fault detection, isolation, and recovery.

Operationsdistributed-systemsfault-tolerance
0 likes · 20 min read
Designing a Multi‑Dimensional High‑Availability Architecture for a Game Access System
Efficient Ops
Efficient Ops
Jul 11, 2016 · Operations

How Tencent's Intelligent Monitoring Transforms Ops Automation

Leveraging Tencent's extensive experience in social platform operations, this talk explores intelligent monitoring practices—covering active, passive, and side‑channel techniques, full‑link observability, data processing pipelines, and alert convergence—to enhance reliability, availability, and user experience while reducing noise for ops teams.

Alert ManagementBig DataOperations
0 likes · 22 min read
How Tencent's Intelligent Monitoring Transforms Ops Automation
Efficient Ops
Efficient Ops
Jul 10, 2016 · Operations

How We Cut Game Server Downtime from 1.5 Hours to 0.3 Hours

This article details how a Tencent game operations team reduced a major online game's scheduled maintenance window from 1.5 hours to just 0.3 hours by redesigning the checklist, separating pre‑ and post‑maintenance tasks, and switching to a rename‑based update method across thousands of servers.

Operationsdowntime reductiongame server
0 likes · 10 min read
How We Cut Game Server Downtime from 1.5 Hours to 0.3 Hours
Baidu Intelligent Testing
Baidu Intelligent Testing
Jul 5, 2016 · Operations

O2O Data Quality Assurance Process for Online Movie Seat Selection

The article outlines a comprehensive O2O data quality assurance workflow for online movie seat selection, detailing background challenges, a three‑stage process, evaluation metrics, and a concrete case study that demonstrates how real‑time data monitoring and issue handling improve user experience.

Data QualityO2OOperations
0 likes · 6 min read
O2O Data Quality Assurance Process for Online Movie Seat Selection
Efficient Ops
Efficient Ops
Jul 3, 2016 · Operations

Memory Myths, Subnet Mask Mistakes, and Telnet Tricks: Ops Lessons

This article shares real‑world ops stories about a disputed memory upgrade, explains how Linux calculates usable memory, clarifies common subnet‑mask misunderstandings, and demonstrates why Telnet cannot test UDP ports, highlighting practical troubleshooting lessons for system administrators.

Linux MemoryOperationsSubnet Mask
0 likes · 12 min read
Memory Myths, Subnet Mask Mistakes, and Telnet Tricks: Ops Lessons
Qunar Tech Salon
Qunar Tech Salon
Jul 1, 2016 · Operations

Optimizing Jenkins CI/CD Architecture with Docker and Container Orchestration

The article explains Jenkins' single‑node and master‑slave deployment models, outlines the scalability and resource challenges of traditional setups, and proposes replacing test machines with Docker containers managed by Kubernetes or Swarm to improve efficiency, maintainability, and resource utilization.

DockerJenkinsKubernetes
0 likes · 7 min read
Optimizing Jenkins CI/CD Architecture with Docker and Container Orchestration
ITPUB
ITPUB
Jun 28, 2016 · Operations

Seamless Tomcat Webapp Migration with Docker and Layered Configuration

This guide explains how to simplify and accelerate Tomcat web application migration by separating static binaries from external configurations, using Docker containers or Juju packages, applying layered configuration, managing persistent data with volumes, and automating deployment, scaling, and rollback operations.

Application MigrationConfiguration ManagementContainers
0 likes · 9 min read
Seamless Tomcat Webapp Migration with Docker and Layered Configuration
Efficient Ops
Efficient Ops
Jun 20, 2016 · Operations

From Ops Soldier to DevOps General: How to Start Reading Open‑Source Code

This guide shows ops engineers how to shift from routine maintenance to DevOps expertise by adopting the right mindset, mastering open‑source community resources, contributing code, and understanding design patterns, concurrency, modularity, data structures, algorithms, and system calls.

Design PatternsOperationsSystem Calls
0 likes · 14 min read
From Ops Soldier to DevOps General: How to Start Reading Open‑Source Code
Meituan Technology Team
Meituan Technology Team
Jun 17, 2016 · Operations

How to Prevent and Recover from Cache‑Induced Service Overload

Service overload caused by cache failures can cripple dependent systems, but by adopting smart cache get patterns, proactive client‑side checks, traffic throttling, service degradation, and dynamic scaling, developers can both prevent overload and recover gracefully when it occurs.

BackendCacheOperations
0 likes · 22 min read
How to Prevent and Recover from Cache‑Induced Service Overload
21CTO
21CTO
Jun 15, 2016 · Operations

How JD.com Leverages User Experience to Beat the Competition

The article examines JD.com's strategic focus on user experience—covering pricing, logistics, service, and product quality—and explains how its integrated systems and "no‑no" policy drive operational efficiency and competitive advantage in China's e‑commerce market.

Business strategyLogisticsOperations
0 likes · 7 min read
How JD.com Leverages User Experience to Beat the Competition
Efficient Ops
Efficient Ops
Jun 14, 2016 · Operations

Automate Fault Root‑Cause Detection in Massive IT Operations

This article explains how large‑scale internet companies can reduce alarm storms and speed up incident resolution by creating an operations ecosystem centered on automated fault root‑cause localization, detailing the challenges, architecture, decision‑tree algorithms, and a four‑step implementation guide.

IT infrastructureOperationsRoot Cause Analysis
0 likes · 11 min read
Automate Fault Root‑Cause Detection in Massive IT Operations
Qunar Tech Salon
Qunar Tech Salon
Jun 12, 2016 · Operations

18 Command‑Line Tools to Monitor Linux Performance

This article presents a curated list of 18 lesser‑known command‑line utilities for Linux/Unix administrators, explaining their purpose, typical usage scenarios, and how they help monitor system resources, network activity, and security events.

LinuxOperationsPerformance Monitoring
0 likes · 11 min read
18 Command‑Line Tools to Monitor Linux Performance
Efficient Ops
Efficient Ops
Jun 6, 2016 · Operations

How a Single Space in ifconfig Crashed an Oracle RAC Cluster

A tiny typo in an ifconfig command set all IPs to 0.0.0.0, causing an Oracle RAC 10.2.0.4 cluster on Solaris 10 to collapse instantly, illustrating the critical need for meticulous command‑level precision in system operations.

OperationsRACifconfig
0 likes · 5 min read
How a Single Space in ifconfig Crashed an Oracle RAC Cluster
Efficient Ops
Efficient Ops
Jun 2, 2016 · Databases

Mastering Redis Cluster in Production: Real-World Practices from VIPShop

This article shares VIPShop's extensive production experience with Redis Cluster, covering use cases, storage architecture evolution, detailed best‑practice guidelines, common pitfalls, operational automation, monitoring strategies, and useful open‑source tools for large‑scale deployments.

OperationsRedis Clusterbest-practices
0 likes · 19 min read
Mastering Redis Cluster in Production: Real-World Practices from VIPShop
DevOps
DevOps
May 31, 2016 · Operations

Understanding the DevOps Toolchain: SCM, Automation, and Cloud

This article explains the DevOps toolchain by breaking it into three core components—SCM, automation, and cloud—detailing their roles, typical tools, and how they interoperate to enable continuous delivery and scalable, self‑service infrastructure.

DevOpsOperationsSCM
0 likes · 6 min read
Understanding the DevOps Toolchain: SCM, Automation, and Cloud
Efficient Ops
Efficient Ops
May 26, 2016 · Operations

12 Essential Linux Command-Line Tools for Performance Monitoring

This article presents a curated list of twelve powerful command-line utilities—such as lsof, htop, iotop, IPTraf, Monit, netHogs, iftop, and Monitorix—that Linux system administrators can use to monitor, diagnose, and optimize system and network performance.

LinuxOperationsPerformance Monitoring
0 likes · 9 min read
12 Essential Linux Command-Line Tools for Performance Monitoring
Efficient Ops
Efficient Ops
May 23, 2016 · Operations

Mastering strace: Diagnose Linux Process Issues with Real-World Examples

This article explains what strace is, how it works, and provides step‑by‑step examples—including fixing a failed service start, tracing nginx, diagnosing process crashes, shared‑memory errors, and performance analysis—to help operations engineers quickly locate and resolve Linux system problems.

LinuxOperationsdebugging
0 likes · 18 min read
Mastering strace: Diagnose Linux Process Issues with Real-World Examples
Efficient Ops
Efficient Ops
May 17, 2016 · Operations

When a Single Cable Crashes a Network: Real Ops Incident Lessons

This article recounts two real‑world operations incidents—a network outage caused by an improperly configured portfast on a trunk link and an NFS failure that crippled an API service—then distills practical lessons on pre‑incident procedures, monitoring, fault handling, recovery, and post‑mortem practices.

ITILNFSOperations
0 likes · 11 min read
When a Single Cable Crashes a Network: Real Ops Incident Lessons
ITPUB
ITPUB
May 17, 2016 · Operations

Master Linux File Search: locate and find Commands Explained

This guide explains how to use the Linux locate and find commands, compares their speed and database usage, details common options, pattern syntax, file‑type and size filters, time‑based searches, permission checks, and shows practical examples of combining conditions and actions.

File SearchLinuxOperations
0 likes · 9 min read
Master Linux File Search: locate and find Commands Explained
21CTO
21CTO
May 16, 2016 · Operations

How to Centralize Logs from Dockerized Services Using Flume and Kafka

This article explains a practical architecture for aggregating logs from distributed Docker containers by employing Flume NG as a lightweight log collector, Kafka as a high‑throughput message bus, and custom sinks to store logs per service, module and day with low latency and minimal resource impact.

DockerFlumeKafka
0 likes · 17 min read
How to Centralize Logs from Dockerized Services Using Flume and Kafka
360 Quality & Efficiency
360 Quality & Efficiency
May 13, 2016 · Operations

Practical Thoughts on Applying ELK for Log Monitoring

This article shares the author's experience and lessons learned while building a log‑monitoring framework with the ELK stack, discussing performance issues, configuration of Logstash filters using Grok, and practical tips for deploying ElasticSearch, Logstash, and Kibana in production environments.

ELKElasticsearchKibana
0 likes · 8 min read
Practical Thoughts on Applying ELK for Log Monitoring
Efficient Ops
Efficient Ops
May 11, 2016 · Operations

How to Build an Automated Operations Platform: Insights from Tencent's Experience

This article shares Peng Lihang's practical insights on operations automation, covering the essential trio of configuration, state, and change management, the evolution of ops practices, platform design principles, and concrete steps for building scalable, business‑driven ops platforms.

Configuration ManagementOperationsautomation
0 likes · 24 min read
How to Build an Automated Operations Platform: Insights from Tencent's Experience
MaGe Linux Operations
MaGe Linux Operations
May 10, 2016 · Operations

10 Essential Practices to Prevent Operational Failures in Database Management

This article outlines ten practical guidelines for operations engineers—ranging from mandatory rollback testing and cautious handling of destructive commands to robust backup verification, vigilant monitoring, and disciplined handover procedures—to dramatically reduce system outages and improve overall reliability.

BackupOperationsautomation
0 likes · 18 min read
10 Essential Practices to Prevent Operational Failures in Database Management
21CTO
21CTO
May 10, 2016 · Operations

7 Proven Scalability Practices from eBay’s Architecture

This article shares eBay’s seven core scalability best practices—including functional partitioning, horizontal sharding, avoiding distributed transactions, asynchronous decoupling, stream processing, virtualization, and smart caching—to help architects design highly available, cost‑effective systems that can handle billions of daily requests.

OperationsSystem Designbest practices
0 likes · 15 min read
7 Proven Scalability Practices from eBay’s Architecture
Baidu Intelligent Testing
Baidu Intelligent Testing
May 5, 2016 · Operations

Preventing Avalanche Effect in Distributed Storage Systems: Replication Strategies, Flow Control, and Safety Mode

The article analyzes distributed storage replication methods, explains how large‑scale replica recovery can trigger an avalanche effect, and proposes operational safeguards such as cross‑rack replica selection, flow‑control mechanisms, predictive fault handling, and a safety mode to maintain system stability.

Flow ControlOperationsReplication
0 likes · 15 min read
Preventing Avalanche Effect in Distributed Storage Systems: Replication Strategies, Flow Control, and Safety Mode
ITPUB
ITPUB
May 4, 2016 · Operations

Why a Wrong mount_maxsize Crashed Our TFS Cluster and How We Fixed It

A misconfigured mount_maxsize limited each Data Server to 20 GB, causing 96% storage usage, and after correction led to block corruption that required a custom script to clean up, illustrating the importance of proper storage settings and automated remediation in TFS operations.

LinuxOperationsTFS
0 likes · 7 min read
Why a Wrong mount_maxsize Crashed Our TFS Cluster and How We Fixed It
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
May 4, 2016 · Cloud Computing

Alibaba Zeus Resource Scheduling System: Architecture, Virtualization, and Operational Practices

The article examines Alibaba's Zeus resource scheduling platform, detailing its background, problem analysis, container‑based virtualization, distributed architecture, strategies for improving resource utilization such as overselling and hybrid deployment, as well as stability measures and automation for large‑scale operations.

AlibabaOperationscloud computing
0 likes · 12 min read
Alibaba Zeus Resource Scheduling System: Architecture, Virtualization, and Operational Practices
High Availability Architecture
High Availability Architecture
Apr 27, 2016 · Operations

Mesos Architecture and Its Deployment at Qunar: Framework Unification and Operational Strategies

This article explains the Mesos distributed system kernel, its master‑slave architecture, fine‑grained resource scheduling, and how Qunar leverages Mesos and Marathon for log processing, Spark, Alluxio, and multi‑tenant services while addressing framework unification, HA, service discovery, and operational challenges.

Cluster ManagementFrameworkMarathon
0 likes · 14 min read
Mesos Architecture and Its Deployment at Qunar: Framework Unification and Operational Strategies
Qunar Tech Salon
Qunar Tech Salon
Apr 23, 2016 · Operations

Linux Shell Tips and Tricks: 73 Useful Commands

This article compiles 73 practical Linux shell tips covering network checks, process control, file manipulation, system monitoring, version control, and various command-line shortcuts, providing concise examples and commands to enhance productivity and troubleshooting for system administrators and developers.

LinuxOperationsShell
0 likes · 12 min read
Linux Shell Tips and Tricks: 73 Useful Commands
21CTO
21CTO
Apr 20, 2016 · Operations

How Spotify Scaled Machine Management: From Ops Chaos to Cloud Automation

This article chronicles Spotify's evolution in server operations—from a manual Ops team and ad‑hoc tools in the early years, through automated DNS, provisioning, and self‑service platforms, to a hybrid cloud strategy that reduced resource‑request turnaround from weeks to minutes.

DevOpsInfrastructureOperations
0 likes · 14 min read
How Spotify Scaled Machine Management: From Ops Chaos to Cloud Automation
Architecture Digest
Architecture Digest
Apr 20, 2016 · Operations

Evolution of Machine Management at Spotify: From ServerDb to Cloud Migration

This article chronicles Spotify's journey from a manual, fire‑fighting Ops team and the early ServerDb tool to automated DNS updates, provisioning systems like provcannon, Neep and Sid, and finally a cloud‑native migration using Google Cloud Platform, highlighting the challenges, solutions, and impact on resource delivery speed.

Infrastructure AutomationOperationsSpotify
0 likes · 13 min read
Evolution of Machine Management at Spotify: From ServerDb to Cloud Migration
ITPUB
ITPUB
Apr 19, 2016 · Operations

What the Worst WTF Moments Reveal About Software Operations

A collection of real‑world programming mishaps—from mixing test and production data to dangerous rm commands—illustrates why strict environment separation, cautious command execution, and disciplined code management are essential for reliable software operations.

DevOpsOperationsSystem Administration
0 likes · 10 min read
What the Worst WTF Moments Reveal About Software Operations
Baidu Intelligent Testing
Baidu Intelligent Testing
Apr 14, 2016 · Operations

Choosing and Analyzing Operational Metrics for Product Success

The article explains why operators should start from clear goals rather than events, defines meaningful metrics such as user retention and API call volume, shows how to break down and evaluate these metrics, and offers practical advice on data collection, benchmarking, and continuous improvement.

KPIsMetricsOperations
0 likes · 6 min read
Choosing and Analyzing Operational Metrics for Product Success
21CTO
21CTO
Apr 13, 2016 · Operations

Designing a Highly Available Transaction System: Real‑World Evolution

This article examines how a large‑scale e‑commerce transaction platform achieved high availability through iterative architectural evolution—from early .NET monoliths to vertical and horizontal micro‑service splits—highlighting practical strategies for fault detection, rapid recovery, scaling, and operational best‑practices.

MicroservicesOperationsSystem Architecture
0 likes · 15 min read
Designing a Highly Available Transaction System: Real‑World Evolution
ITPUB
ITPUB
Apr 12, 2016 · Operations

Essential Linux Daemons: Functions and Use Cases Explained

A comprehensive overview of common Linux daemon processes, detailing each service’s purpose, typical use cases, and key configuration notes for system administrators seeking to understand and manage background services effectively.

LinuxOperationsSystem Administration
0 likes · 12 min read
Essential Linux Daemons: Functions and Use Cases Explained
Architecture Digest
Architecture Digest
Apr 8, 2016 · Operations

Practical Fault‑Tolerance Practices in a Large‑Scale Activity Operations Platform

This article shares the author’s experience building fault‑tolerance for Tencent’s activity operations platform, covering retry strategies, automatic removal of unhealthy machines, timeout tuning, asynchronous processing, anti‑replay mechanisms, service degradation, service decoupling, and business‑level safeguards to reduce manual alarm handling and improve system robustness.

Distributed SystemsOperationsRetry
0 likes · 21 min read
Practical Fault‑Tolerance Practices in a Large‑Scale Activity Operations Platform
Efficient Ops
Efficient Ops
Apr 7, 2016 · Cloud Computing

How LeTV E‑Commerce Cloud Scales High‑Traffic Shopping with Microservices

This article, based on the Efficient Operations Community Talk, outlines the evolution of e‑commerce systems, the challenges faced during rapid growth, and how LeTV’s e‑commerce cloud leverages micro‑service architecture, container technology, and hybrid cloud solutions to address scalability, security, and operational efficiency.

ContainerMicroservicesOperations
0 likes · 30 min read
How LeTV E‑Commerce Cloud Scales High‑Traffic Shopping with Microservices
Efficient Ops
Efficient Ops
Apr 5, 2016 · Operations

How to Define and Implement Effective Deployment Standards

This article explains what deployment specifications are, outlines the key components of a good spec, shares a real-world CodeDeploy example, and provides practical steps for designing, building, and rolling out deployment standards that balance flexibility, non‑intrusiveness, and ease of use.

DeploymentOperationscode deploy
0 likes · 13 min read
How to Define and Implement Effective Deployment Standards