Tagged articles

Operations

3329 articles · Page 32 of 34

Jul 19, 2016 · Operations

Designing a Multi‑Dimensional High‑Availability Architecture for a Game Access System

The article presents a business‑oriented, three‑layer high‑availability architecture for a large‑scale game access platform, detailing measurable goals, client‑side retry with HTTP‑DNS, functional separation and degradation, multi‑region active‑active deployment, and automated, visual monitoring to achieve rapid fault detection, isolation, and recovery.

Operationsdistributed-systemsfault-tolerance

0 likes · 20 min read

Designing a Multi‑Dimensional High‑Availability Architecture for a Game Access System

Efficient Ops

Jul 11, 2016 · Operations

How Tencent's Intelligent Monitoring Transforms Ops Automation

Leveraging Tencent's extensive experience in social platform operations, this talk explores intelligent monitoring practices—covering active, passive, and side‑channel techniques, full‑link observability, data processing pipelines, and alert convergence—to enhance reliability, availability, and user experience while reducing noise for ops teams.

Alert ManagementAutomationBig Data

0 likes · 22 min read

How Tencent's Intelligent Monitoring Transforms Ops Automation

Efficient Ops

Jul 10, 2016 · Operations

How We Cut Game Server Downtime from 1.5 Hours to 0.3 Hours

This article details how a Tencent game operations team reduced a major online game's scheduled maintenance window from 1.5 hours to just 0.3 hours by redesigning the checklist, separating pre‑ and post‑maintenance tasks, and switching to a rename‑based update method across thousands of servers.

Operationsdowntime reductiongame server

0 likes · 10 min read

How We Cut Game Server Downtime from 1.5 Hours to 0.3 Hours

Baidu Intelligent Testing

Jul 5, 2016 · Operations

O2O Data Quality Assurance Process for Online Movie Seat Selection

The article outlines a comprehensive O2O data quality assurance workflow for online movie seat selection, detailing background challenges, a three‑stage process, evaluation metrics, and a concrete case study that demonstrates how real‑time data monitoring and issue handling improve user experience.

Data QualityO2OOperations

0 likes · 6 min read

O2O Data Quality Assurance Process for Online Movie Seat Selection

Efficient Ops

Jul 3, 2016 · Operations

Memory Myths, Subnet Mask Mistakes, and Telnet Tricks: Ops Lessons

This article shares real‑world ops stories about a disputed memory upgrade, explains how Linux calculates usable memory, clarifies common subnet‑mask misunderstandings, and demonstrates why Telnet cannot test UDP ports, highlighting practical troubleshooting lessons for system administrators.

Linux MemoryOperationsSystem Performance

0 likes · 12 min read

Memory Myths, Subnet Mask Mistakes, and Telnet Tricks: Ops Lessons

Qunar Tech Salon

Jul 1, 2016 · Operations

Optimizing Jenkins CI/CD Architecture with Docker and Container Orchestration

The article explains Jenkins' single‑node and master‑slave deployment models, outlines the scalability and resource challenges of traditional setups, and proposes replacing test machines with Docker containers managed by Kubernetes or Swarm to improve efficiency, maintainability, and resource utilization.

CI/CDDockerJenkins

0 likes · 7 min read

Optimizing Jenkins CI/CD Architecture with Docker and Container Orchestration

ITPUB

Jun 28, 2016 · Operations

Seamless Tomcat Webapp Migration with Docker and Layered Configuration

This guide explains how to simplify and accelerate Tomcat web application migration by separating static binaries from external configurations, using Docker containers or Juju packages, applying layered configuration, managing persistent data with volumes, and automating deployment, scaling, and rollback operations.

Application MigrationContainersOperations

0 likes · 9 min read

Seamless Tomcat Webapp Migration with Docker and Layered Configuration

Efficient Ops

Jun 27, 2016 · Operations

Mastering tcpdump: From Basics to Automated Capture and Business Architecture Mapping

This article guides operations engineers through tcpdump fundamentals, advanced filtering techniques, the design of an automated packet‑capture tool, and how to transform captured traffic data into a visual business architecture tree for efficient fault isolation and resource optimization.

AutomationNetwork MonitoringOperations

0 likes · 13 min read

Mastering tcpdump: From Basics to Automated Capture and Business Architecture Mapping

Efficient Ops

Jun 23, 2016 · Operations

How to Diagnose and Resolve Complex Server Outages: A Step‑by‑Step Ops Guide

This article walks through a systematic, multi‑stage approach to identifying, reproducing, and fixing server‑side problems using Linux tools such as strace, lsof, and netstat, illustrated with real‑world case studies and practical command examples.

OperationsPerformancedebugging

0 likes · 18 min read

How to Diagnose and Resolve Complex Server Outages: A Step‑by‑Step Ops Guide

Efficient Ops

Jun 20, 2016 · Operations

From Ops Soldier to DevOps General: How to Start Reading Open‑Source Code

This guide shows ops engineers how to shift from routine maintenance to DevOps expertise by adopting the right mindset, mastering open‑source community resources, contributing code, and understanding design patterns, concurrency, modularity, data structures, algorithms, and system calls.

Design PatternsOperationsSource Code

0 likes · 14 min read

From Ops Soldier to DevOps General: How to Start Reading Open‑Source Code

21CTO

Jun 20, 2016 · Backend Development

How We Scaled the Duosuo English App: Architecture Lessons from Day One to Four Months

This article details the technical background and evolution of the Duosuo English learning app, covering initial architecture, bandwidth estimation, risk control, database sharding, code refactoring, and operational lessons learned over four months of scaling.

CloudOperationsarchitecture

0 likes · 8 min read

How We Scaled the Duosuo English App: Architecture Lessons from Day One to Four Months

Meituan Technology Team

Jun 17, 2016 · Operations

How to Prevent and Recover from Cache‑Induced Service Overload

Service overload caused by cache failures can cripple dependent systems, but by adopting smart cache get patterns, proactive client‑side checks, traffic throttling, service degradation, and dynamic scaling, developers can both prevent overload and recover gracefully when it occurs.

CacheOperationsSystem Design

0 likes · 22 min read

How to Prevent and Recover from Cache‑Induced Service Overload

Ctrip Technology

Jun 16, 2016 · Operations

Technical Overview of Multi‑Site Active‑Active Call Center Architecture and Implementation

The article explains how modern call centers evolve to multi‑site active‑active deployments, detailing the network design, disaster‑recovery mechanisms, login and monitoring logic, key features supporting automatic failover for thousands of agents, and future expansion possibilities.

Active-ActiveDisaster RecoveryHigh Availability

0 likes · 6 min read

Technical Overview of Multi‑Site Active‑Active Call Center Architecture and Implementation

21CTO

Jun 15, 2016 · Operations

How JD.com Leverages User Experience to Beat the Competition

The article examines JD.com's strategic focus on user experience—covering pricing, logistics, service, and product quality—and explains how its integrated systems and "no‑no" policy drive operational efficiency and competitive advantage in China's e‑commerce market.

E‑CommerceOperationsbusiness strategy

0 likes · 7 min read

How JD.com Leverages User Experience to Beat the Competition

Efficient Ops

Jun 14, 2016 · Operations

Automate Fault Root‑Cause Detection in Massive IT Operations

This article explains how large‑scale internet companies can reduce alarm storms and speed up incident resolution by creating an operations ecosystem centered on automated fault root‑cause localization, detailing the challenges, architecture, decision‑tree algorithms, and a four‑step implementation guide.

AutomationDecision TreeIT infrastructure

0 likes · 11 min read

Automate Fault Root‑Cause Detection in Massive IT Operations

Qunar Tech Salon

Jun 12, 2016 · Operations

18 Command‑Line Tools to Monitor Linux Performance

This article presents a curated list of 18 lesser‑known command‑line utilities for Linux/Unix administrators, explaining their purpose, typical usage scenarios, and how they help monitor system resources, network activity, and security events.

Command-line ToolsLinuxOperations

0 likes · 11 min read

18 Command‑Line Tools to Monitor Linux Performance

21CTO

Jun 11, 2016 · Operations

Uncovering Hidden PHP‑CGI Deadlocks: Why Disk Space Stalls and How to Fix Them

A deep dive into a long‑standing PHP‑CGI deadlock that left deleted log files occupying disk space, explaining how signal‑unsafe functions caused the lock, how the issue was diagnosed with lsof, strace and gdb, and the practical steps to eliminate the deadlock.

CGIDeadlockLinux

0 likes · 8 min read

Uncovering Hidden PHP‑CGI Deadlocks: Why Disk Space Stalls and How to Fix Them

Efficient Ops

Jun 9, 2016 · Operations

How to Scale Internet Operations with Standardization, Config Management, and Monitoring

This article explores how large‑scale internet operations can achieve order and efficiency by applying entropy theory, standardizing configuration and monitoring, adopting automated deployment practices, and leveraging open‑source tools like Open‑Falcon to build a fully automated, resilient infrastructure.

AutomationOperationsconfiguration-management

0 likes · 13 min read

How to Scale Internet Operations with Standardization, Config Management, and Monitoring

Efficient Ops

Jun 6, 2016 · Operations

How a Single Space in ifconfig Crashed an Oracle RAC Cluster

A tiny typo in an ifconfig command set all IPs to 0.0.0.0, causing an Oracle RAC 10.2.0.4 cluster on Solaris 10 to collapse instantly, illustrating the critical need for meticulous command‑level precision in system operations.

OperationsRACifconfig

0 likes · 5 min read

How a Single Space in ifconfig Crashed an Oracle RAC Cluster

Efficient Ops

Jun 2, 2016 · Databases

Mastering Redis Cluster in Production: Real-World Practices from VIPShop

This article shares VIPShop's extensive production experience with Redis Cluster, covering use cases, storage architecture evolution, detailed best‑practice guidelines, common pitfalls, operational automation, monitoring strategies, and useful open‑source tools for large‑scale deployments.

OperationsProductionRedis Cluster

0 likes · 19 min read

Mastering Redis Cluster in Production: Real-World Practices from VIPShop

DevOps

May 31, 2016 · Operations

Understanding the DevOps Toolchain: SCM, Automation, and Cloud

This article explains the DevOps toolchain by breaking it into three core components—SCM, automation, and cloud—detailing their roles, typical tools, and how they interoperate to enable continuous delivery and scalable, self‑service infrastructure.

AutomationCloudOperations

0 likes · 6 min read

Understanding the DevOps Toolchain: SCM, Automation, and Cloud

Java High-Performance Architecture

May 27, 2016 · Operations

How Twitter Handles Massive Traffic Surges with Stress Testing and Preparedness

Twitter keeps its platform stable during massive traffic spikes by regularly performing large‑scale stress and extreme tests, analyzing performance metrics, and maintaining detailed contingency plans that guide rapid response to unexpected events such as the record‑breaking “Sky City” incident.

OperationsTwittercontingency planning

0 likes · 4 min read

How Twitter Handles Massive Traffic Surges with Stress Testing and Preparedness

Efficient Ops

May 26, 2016 · Operations

12 Essential Linux Command-Line Tools for Performance Monitoring

This article presents a curated list of twelve powerful command-line utilities—such as lsof, htop, iotop, IPTraf, Monit, netHogs, iftop, and Monitorix—that Linux system administrators can use to monitor, diagnose, and optimize system and network performance.

Command-line ToolsLinuxOperations

0 likes · 9 min read

12 Essential Linux Command-Line Tools for Performance Monitoring

Efficient Ops

May 23, 2016 · Operations

Mastering strace: Diagnose Linux Process Issues with Real-World Examples

This article explains what strace is, how it works, and provides step‑by‑step examples—including fixing a failed service start, tracing nginx, diagnosing process crashes, shared‑memory errors, and performance analysis—to help operations engineers quickly locate and resolve Linux system problems.

LinuxOperationsdebugging

0 likes · 18 min read

Mastering strace: Diagnose Linux Process Issues with Real-World Examples

Efficient Ops

May 17, 2016 · Operations

When a Single Cable Crashes a Network: Real Ops Incident Lessons

This article recounts two real‑world operations incidents—a network outage caused by an improperly configured portfast on a trunk link and an NFS failure that crippled an API service—then distills practical lessons on pre‑incident procedures, monitoring, fault handling, recovery, and post‑mortem practices.

ITILIncident ManagementNFS

0 likes · 11 min read

When a Single Cable Crashes a Network: Real Ops Incident Lessons

ITPUB

May 17, 2016 · Operations

Master Linux File Search: locate and find Commands Explained

This guide explains how to use the Linux locate and find commands, compares their speed and database usage, details common options, pattern syntax, file‑type and size filters, time‑based searches, permission checks, and shows practical examples of combining conditions and actions.

File SearchLinuxOperations

0 likes · 9 min read

Master Linux File Search: locate and find Commands Explained

21CTO

May 16, 2016 · Operations

How to Centralize Logs from Dockerized Services Using Flume and Kafka

This article explains a practical architecture for aggregating logs from distributed Docker containers by employing Flume NG as a lightweight log collector, Kafka as a high‑throughput message bus, and custom sinks to store logs per service, module and day with low latency and minimal resource impact.

DockerFlumeKafka

0 likes · 17 min read

How to Centralize Logs from Dockerized Services Using Flume and Kafka

360 Quality & Efficiency

May 13, 2016 · Operations

Practical Thoughts on Applying ELK for Log Monitoring

This article shares the author's experience and lessons learned while building a log‑monitoring framework with the ELK stack, discussing performance issues, configuration of Logstash filters using Grok, and practical tips for deploying ElasticSearch, Logstash, and Kibana in production environments.

ELKElasticsearchKibana

0 likes · 8 min read

Practical Thoughts on Applying ELK for Log Monitoring

Efficient Ops

May 11, 2016 · Operations

How to Build an Automated Operations Platform: Insights from Tencent's Experience

This article shares Peng Lihang's practical insights on operations automation, covering the essential trio of configuration, state, and change management, the evolution of ops practices, platform design principles, and concrete steps for building scalable, business‑driven ops platforms.

AutomationChange ManagementOperations

0 likes · 24 min read

How to Build an Automated Operations Platform: Insights from Tencent's Experience

MaGe Linux Operations

May 10, 2016 · Operations

10 Essential Practices to Prevent Operational Failures in Database Management

This article outlines ten practical guidelines for operations engineers—ranging from mandatory rollback testing and cautious handling of destructive commands to robust backup verification, vigilant monitoring, and disciplined handover procedures—to dramatically reduce system outages and improve overall reliability.

AutomationBest PracticesOperations

0 likes · 18 min read

10 Essential Practices to Prevent Operational Failures in Database Management

21CTO

May 10, 2016 · Operations

7 Proven Scalability Practices from eBay’s Architecture

This article shares eBay’s seven core scalability best practices—including functional partitioning, horizontal sharding, avoiding distributed transactions, asynchronous decoupling, stream processing, virtualization, and smart caching—to help architects design highly available, cost‑effective systems that can handle billions of daily requests.

Best PracticesOperationsSystem Design

0 likes · 15 min read

7 Proven Scalability Practices from eBay’s Architecture

Baidu Intelligent Testing

May 5, 2016 · Operations

Preventing Avalanche Effect in Distributed Storage Systems: Replication Strategies, Flow Control, and Safety Mode

The article analyzes distributed storage replication methods, explains how large‑scale replica recovery can trigger an avalanche effect, and proposes operational safeguards such as cross‑rack replica selection, flow‑control mechanisms, predictive fault handling, and a safety mode to maintain system stability.

Distributed storageFlow ControlOperations

0 likes · 15 min read

Preventing Avalanche Effect in Distributed Storage Systems: Replication Strategies, Flow Control, and Safety Mode

ITPUB

May 4, 2016 · Operations

Why a Wrong mount_maxsize Crashed Our TFS Cluster and How We Fixed It

A misconfigured mount_maxsize limited each Data Server to 20 GB, causing 96% storage usage, and after correction led to block corruption that required a custom script to clean up, illustrating the importance of proper storage settings and automated remediation in TFS operations.

LinuxOperationsTFS

0 likes · 7 min read

Why a Wrong mount_maxsize Crashed Our TFS Cluster and How We Fixed It

Alibaba Cloud Infrastructure

May 4, 2016 · Cloud Computing

Alibaba Zeus Resource Scheduling System: Architecture, Virtualization, and Operational Practices

The article examines Alibaba's Zeus resource scheduling platform, detailing its background, problem analysis, container‑based virtualization, distributed architecture, strategies for improving resource utilization such as overselling and hybrid deployment, as well as stability measures and automation for large‑scale operations.

AlibabaCloud ComputingOperations

0 likes · 12 min read

Alibaba Zeus Resource Scheduling System: Architecture, Virtualization, and Operational Practices

High Availability Architecture

Apr 27, 2016 · Operations

Mesos Architecture and Its Deployment at Qunar: Framework Unification and Operational Strategies

This article explains the Mesos distributed system kernel, its master‑slave architecture, fine‑grained resource scheduling, and how Qunar leverages Mesos and Marathon for log processing, Spark, Alluxio, and multi‑tenant services while addressing framework unification, HA, service discovery, and operational challenges.

MarathonOperationsResource Scheduling

0 likes · 14 min read

Mesos Architecture and Its Deployment at Qunar: Framework Unification and Operational Strategies

Qunar Tech Salon

Apr 23, 2016 · Operations

Linux Shell Tips and Tricks: 73 Useful Commands

This article compiles 73 practical Linux shell tips covering network checks, process control, file manipulation, system monitoring, version control, and various command-line shortcuts, providing concise examples and commands to enhance productivity and troubleshooting for system administrators and developers.

LinuxOperationsShell

0 likes · 12 min read

Linux Shell Tips and Tricks: 73 Useful Commands

Java High-Performance Architecture

Apr 20, 2016 · Operations

Why Modern Systems Need Log Analysis Platforms – Ctrip’s ELK Case Study

This article explains why log analysis platforms are essential as systems grow, outlines the benefits of centralized logging, presents Ctrip’s real‑world requirements and challenges, and introduces the ELK stack as a scalable solution for collecting, storing, and visualizing massive log data.

Case StudyCtripELK

0 likes · 6 min read

Why Modern Systems Need Log Analysis Platforms – Ctrip’s ELK Case Study

21CTO

Apr 20, 2016 · Operations

How Spotify Scaled Machine Management: From Ops Chaos to Cloud Automation

This article chronicles Spotify's evolution in server operations—from a manual Ops team and ad‑hoc tools in the early years, through automated DNS, provisioning, and self‑service platforms, to a hybrid cloud strategy that reduced resource‑request turnaround from weeks to minutes.

AutomationCloud MigrationOperations

0 likes · 14 min read

How Spotify Scaled Machine Management: From Ops Chaos to Cloud Automation

Architecture Digest

Apr 20, 2016 · Operations

Evolution of Machine Management at Spotify: From ServerDb to Cloud Migration

This article chronicles Spotify's journey from a manual, fire‑fighting Ops team and the early ServerDb tool to automated DNS updates, provisioning systems like provcannon, Neep and Sid, and finally a cloud‑native migration using Google Cloud Platform, highlighting the challenges, solutions, and impact on resource delivery speed.

Cloud MigrationInfrastructure AutomationOperations

0 likes · 13 min read

Evolution of Machine Management at Spotify: From ServerDb to Cloud Migration

ITPUB

Apr 19, 2016 · Operations

What the Worst WTF Moments Reveal About Software Operations

A collection of real‑world programming mishaps—from mixing test and production data to dangerous rm commands—illustrates why strict environment separation, cautious command execution, and disciplined code management are essential for reliable software operations.

OperationsTestingdevops

0 likes · 10 min read

What the Worst WTF Moments Reveal About Software Operations

Efficient Ops

Apr 18, 2016 · Operations

Designing the Blue Whale Ops Platform: Architecture, PaaS, and Automation Insights

An in‑depth overview of Tencent’s Blue Whale system reveals its positioning, design philosophy, PaaS and SaaS components, and how it enables scalable, unmanned operations across cloud and on‑premise environments, illustrating practical automation stages from scripting to intelligent orchestration.

AutomationOperationsPaaS

0 likes · 17 min read

Designing the Blue Whale Ops Platform: Architecture, PaaS, and Automation Insights

Baidu Intelligent Testing

Apr 14, 2016 · Operations

Choosing and Analyzing Operational Metrics for Product Success

The article explains why operators should start from clear goals rather than events, defines meaningful metrics such as user retention and API call volume, shows how to break down and evaluate these metrics, and offers practical advice on data collection, benchmarking, and continuous improvement.

KPIsOperationsProduct Management

0 likes · 6 min read

Choosing and Analyzing Operational Metrics for Product Success

21CTO

Apr 13, 2016 · Operations

Designing a Highly Available Transaction System: Real‑World Evolution

This article examines how a large‑scale e‑commerce transaction platform achieved high availability through iterative architectural evolution—from early .NET monoliths to vertical and horizontal micro‑service splits—highlighting practical strategies for fault detection, rapid recovery, scaling, and operational best‑practices.

High AvailabilityMicroservicesOperations

0 likes · 15 min read

Designing a Highly Available Transaction System: Real‑World Evolution

dbaplus Community

Apr 12, 2016 · Operations

Choosing the Right Docker Monitoring Solution: Self‑Hosted vs SaaS

This article explains why Docker services need monitoring, distinguishes black‑box and white‑box approaches, compares self‑hosted and SaaS monitoring stacks, and reviews key components and popular tools such as Prometheus, InfluxDB, Grafana, Datadog, and Sysdig.

ContainersDatadogDocker

0 likes · 13 min read

Choosing the Right Docker Monitoring Solution: Self‑Hosted vs SaaS

ITPUB

Apr 12, 2016 · Operations

Essential Linux Daemons: Functions and Use Cases Explained

A comprehensive overview of common Linux daemon processes, detailing each service’s purpose, typical use cases, and key configuration notes for system administrators seeking to understand and manage background services effectively.

DaemonLinuxOperations

0 likes · 12 min read

Essential Linux Daemons: Functions and Use Cases Explained

Architecture Digest

Apr 8, 2016 · Operations

Practical Fault‑Tolerance Practices in a Large‑Scale Activity Operations Platform

This article shares the author’s experience building fault‑tolerance for Tencent’s activity operations platform, covering retry strategies, automatic removal of unhealthy machines, timeout tuning, asynchronous processing, anti‑replay mechanisms, service degradation, service decoupling, and business‑level safeguards to reduce manual alarm handling and improve system robustness.

Operationsdistributed systemsfault tolerance

0 likes · 21 min read

Practical Fault‑Tolerance Practices in a Large‑Scale Activity Operations Platform

Efficient Ops

Apr 7, 2016 · Cloud Computing

How LeTV E‑Commerce Cloud Scales High‑Traffic Shopping with Microservices

This article, based on the Efficient Operations Community Talk, outlines the evolution of e‑commerce systems, the challenges faced during rapid growth, and how LeTV’s e‑commerce cloud leverages micro‑service architecture, container technology, and hybrid cloud solutions to address scalability, security, and operational efficiency.

E‑CommerceMicroservicesOperations

0 likes · 30 min read

How LeTV E‑Commerce Cloud Scales High‑Traffic Shopping with Microservices

Efficient Ops

Apr 5, 2016 · Operations

How to Define and Implement Effective Deployment Standards

This article explains what deployment specifications are, outlines the key components of a good spec, shares a real-world CodeDeploy example, and provides practical steps for designing, building, and rolling out deployment standards that balance flexibility, non‑intrusiveness, and ease of use.

DeploymentOperationscode deploy

0 likes · 13 min read

How to Define and Implement Effective Deployment Standards

21CTO

Apr 5, 2016 · Operations

How Tencent’s AMS Achieved Fault Tolerance at Billion‑Request Scale

This article shares Tencent’s experience building fault‑tolerant mechanisms for the AMS activity platform, covering retry strategies, automatic machine exclusion, timeout tuning, service isolation, asynchronous processing, anti‑replay safeguards, and operational best practices that transformed a million‑request service into an 800‑million‑request system.

OperationsSystem Designasynchronous processing

0 likes · 24 min read

How Tencent’s AMS Achieved Fault Tolerance at Billion‑Request Scale

Java High-Performance Architecture

Mar 30, 2016 · Operations

Monitor Nginx in Real-Time with ngxtop: Quick Guide & Practical Commands

This guide explains how to use the lightweight ngxtop tool to monitor Nginx in real time, showing frequent requests, high‑traffic IPs, and providing useful command examples for filtering logs, ordering results, and installing the utility via pip.

NginxOperationslog analysis

0 likes · 4 min read

Monitor Nginx in Real-Time with ngxtop: Quick Guide & Practical Commands

Efficient Ops

Mar 29, 2016 · Operations

From the Stirrup to AI: How Automation Transforms Operations

At the GOPS2016 conference, speaker Cui Xiaochun likens the invention of the horse stirrup to modern automation, tracing the evolution of operations from manual scripts to AI-driven intelligent systems, and argues that embracing AI is the next revolutionary step for ops teams.

AIAutomationHistorical analogy

0 likes · 7 min read

From the Stirrup to AI: How Automation Transforms Operations

21CTO

Mar 23, 2016 · Operations

When a Trading Glitch Costs Billions: Lessons from Japan’s Mizuho Fat‑Finger Disaster

A 2005 fat‑finger error by a Mizuho trader triggered a ¥400 billion loss, leading to a landmark court case that clarified liability for software bugs in financial systems and highlighted the need for rigorous testing and evidence preservation.

Operationsfinancial losslegal case

0 likes · 12 min read

When a Trading Glitch Costs Billions: Lessons from Japan’s Mizuho Fat‑Finger Disaster

21CTO

Mar 22, 2016 · Operations

Build a Scalable Unified Monitoring & Alert Platform with Ganglia & Centreon

This article explains how to design and implement a unified operations monitoring and alerting platform by combining Ganglia for data collection with Centreon for alerting, covering architecture layers, module functions, integration steps, and practical Q&A for large‑scale deployments.

AlertingAutomationCentreon

0 likes · 20 min read

Build a Scalable Unified Monitoring & Alert Platform with Ganglia & Centreon

21CTO

Mar 22, 2016 · Operations

Inside Facebook’s ‘Hotfix Bar’: Secrets of Massive Deployments

During an exclusive visit to Facebook’s Menlo Park campus, the author uncovers the company’s sophisticated release engineering practices—including the HipHop optimizer, a custom BitTorrent‑based deployment system, continuous testing, and a unique “Hotfix Bar” culture—revealing how billions of daily requests are reliably delivered at massive scale.

DeploymentFacebookOperations

0 likes · 18 min read

Inside Facebook’s ‘Hotfix Bar’: Secrets of Massive Deployments

Efficient Ops

Mar 21, 2016 · Operations

How to Build a High‑Performance Unified Monitoring & Alerting Platform

This article outlines a comprehensive design for a high‑performance, unified operations monitoring platform, detailing a six‑layer architecture, the roles of data collection (using Ganglia), data extraction, and alerting modules (with Centreon), and provides practical integration tips, deployment diagrams, and Q&A for large‑scale environments.

AlertingCentreonGanglia

0 likes · 24 min read

How to Build a High‑Performance Unified Monitoring & Alerting Platform

21CTO

Mar 17, 2016 · Backend Development

Turn Java Enterprise Performance Tuning into a Scientific Process

This article explains a systematic, waiting‑point‑based approach to enterprise Java performance tuning, covering load‑test design, analysis of existing versus new applications, hierarchical and technical waiting points, pool and cache sizing, and a back‑tuning workflow to achieve measurable improvements.

JavaOperationsPerformance Tuning

0 likes · 17 min read

Turn Java Enterprise Performance Tuning into a Scientific Process

21CTO

Mar 17, 2016 · Operations

How Vipshop’s Three‑Tier Monitoring System Keeps Services Running Smoothly

This article explains Vipshop’s multi‑layer monitoring architecture, detailing system‑level metrics, application‑level tracing with the Mercury platform, and business‑level KPI dashboards, while describing the data pipelines that collect, process, and alert on distributed logs to ensure reliable operations.

OperationsVipshopdistributed systems

0 likes · 4 min read

How Vipshop’s Three‑Tier Monitoring System Keeps Services Running Smoothly

Baidu Intelligent Testing

Mar 15, 2016 · Operations

Establishing an Operations Evaluation Model: Steps, Metrics, and Key Considerations

This article explains how to build an operations evaluation model within a quality competitiveness framework, detailing a three‑step process for defining metrics, evaluation methods, and quantification, and highlighting essential evaluation points, attention areas, and data collection practices for product operations.

OperationsProduct Managementevaluation model

0 likes · 8 min read

Establishing an Operations Evaluation Model: Steps, Metrics, and Key Considerations

Alibaba Cloud Infrastructure

Mar 15, 2016 · Operations

Innovative Cooling Techniques for Data Centers: Lake Water, Sea Water, Oil, and Hot Water Solutions

The article examines various unconventional data‑center cooling methods—including lake‑water, seawater, waste‑water, mineral‑oil, and hot‑water systems—highlighting their impact on Power Usage Effectiveness (PUE) and overall energy efficiency across major tech companies.

CoolingOperationsPUE

0 likes · 7 min read

Innovative Cooling Techniques for Data Centers: Lake Water, Sea Water, Oil, and Hot Water Solutions

Java High-Performance Architecture

Mar 15, 2016 · Operations

Building a 3-Dimensional Automated Visual Monitoring System for High-Availability

The article describes a three-dimensional, automated, visual monitoring approach for high-availability systems, detailing a five-layer monitoring model, automated log collection using Logstash-Redis-Elasticsearch, and visualization techniques that together reduce fault-locating time and improve operational efficiency.

AutomationOperationsSystem Design

0 likes · 5 min read

Building a 3-Dimensional Automated Visual Monitoring System for High-Availability

Architecture Digest

Mar 12, 2016 · Operations

Stack Overflow Architecture Overview: Hardware, Scaling, and Infrastructure (2015)

This article provides a detailed overview of Stack Overflow's 2015 architecture, covering daily traffic growth, hardware upgrades, redundancy principles, DNS and ISP routing, HAProxy load balancing, IIS/ASP.NET web layer, Redis caching, WebSocket services, Elasticsearch search, SQL Server databases, and the open‑source tools that support the platform.

OperationsSQL Serverload balancing

0 likes · 17 min read

Stack Overflow Architecture Overview: Hardware, Scaling, and Infrastructure (2015)

Java High-Performance Architecture

Mar 11, 2016 · Operations

Ensuring High Availability: Functional Separation and Degradation Strategies

The article explains how functional separation and degradation techniques—distinguishing core from non‑core services, isolating them logically and physically, and implementing manual or automatic fallback mechanisms—help maintain high availability in distributed systems during traffic spikes or component failures.

OperationsSystem Designdegradation

0 likes · 3 min read

Ensuring High Availability: Functional Separation and Degradation Strategies

Meitu Technology

Mar 11, 2016 · Databases

Designing Scalable Database Architecture for Rapid Product Growth: Lessons from Meipai

This talk explores how Meipai designed a scalable, reliable database architecture that balances SQL and NoSQL technologies to support rapid product iteration and high‑peak traffic, highlighting key considerations for storage service reliability and performance optimization.

Database ArchitectureNoSQLOperations

0 likes · 3 min read

Designing Scalable Database Architecture for Rapid Product Growth: Lessons from Meipai

Baidu Intelligent Testing

Mar 10, 2016 · Operations

Performance Testing Methodology and Case Study for a High‑Traffic Wallet System

This article outlines a complete performance‑testing workflow—including requirement analysis, overall and detailed design, environment selection, execution, result collection, and bottleneck analysis—using a wallet "red‑packet" project that must handle tens of thousands of requests per second.

Operationsload testingperformance testing

0 likes · 19 min read

Performance Testing Methodology and Case Study for a High‑Traffic Wallet System

21CTO

Mar 8, 2016 · R&D Management

Surviving Startup Chaos: Key Strategies for Project, Code, and Team Management

This article examines the common pitfalls faced by engineers in fast‑growing startups—from poor project planning and rushed code refactoring to unclear product requirements, weak organizational processes, hasty technology choices, operations overload, and people‑related challenges—offering practical guidance to navigate each issue.

OperationsProduct Developmentstartup

0 likes · 14 min read

Surviving Startup Chaos: Key Strategies for Project, Code, and Team Management

21CTO

Mar 6, 2016 · Operations

Inside Stack Overflow’s 2016 Architecture: Handling 61 Million Daily Requests

The article details Stack Overflow’s 2016 infrastructure upgrades—including hardware, networking, load balancing, caching, database, and service layers—that enabled the site to process over 61 million daily requests while reducing processing time by hundreds of hours.

CachingOperationsarchitecture

0 likes · 12 min read

Inside Stack Overflow’s 2016 Architecture: Handling 61 Million Daily Requests

MaGe Linux Operations

Mar 6, 2016 · Operations

Master Linux Performance: Using sar to Identify System Bottlenecks

This guide explains how to install, configure, and use the sar (System Activity Reporter) tool on Linux to monitor CPU, memory, I/O, and load metrics, helping you pinpoint performance bottlenecks through various command options and detailed reports.

LinuxOperationsSystem Monitoring

0 likes · 7 min read

Master Linux Performance: Using sar to Identify System Bottlenecks

Architecture Digest

Mar 5, 2016 · Operations

Dianping Operations Architecture Overview and Best Practices

This article presents a comprehensive overview of Dianping's operations architecture, detailing team organization, multi‑data‑center infrastructure, monitoring layers, automation tools, configuration management systems, incident analysis, lessons learned, and future directions such as Docker and PaaS adoption.

AutomationDockerOperations

0 likes · 16 min read

Dianping Operations Architecture Overview and Best Practices

21CTO

Mar 5, 2016 · Backend Development

How to Choose, Use, and Extend Open‑Source Projects Without Reinventing the Wheel

This article explores the DRY principle in software development, explains why many open‑source projects violate it, and provides practical guidance on selecting, using, and customizing open‑source solutions through real‑world case studies, focusing on business fit, maturity, operational capability, and safe integration.

Best PracticesOperationsopen source

0 likes · 12 min read

How to Choose, Use, and Extend Open‑Source Projects Without Reinventing the Wheel

Qunar Tech Salon

Mar 5, 2016 · Operations

Common Linux Commands for Java Developers

This article provides Java developers with a concise reference of essential Linux shell commands, covering process inspection, file manipulation, permission changes, compression, networking checks, remote access, and other common operations needed for interacting with Linux servers during development and deployment.

CommandLineLinuxOperations

0 likes · 7 min read

Common Linux Commands for Java Developers

dbaplus Community

Mar 3, 2016 · Operations

Why Every Developer Must Master Core Ops Skills

The article explains why developers need to understand operations—covering resource usage, fault handling, platform basics, and essential ops tools—so they can write maintainable code, avoid common pitfalls, and collaborate effectively with ops teams for reliable, high‑performance services.

Operationscoding standardsmonitoring

0 likes · 14 min read

Why Every Developer Must Master Core Ops Skills

DevOps

Mar 2, 2016 · Operations

Understanding DevOps: Principles, Practices, and Implementation

This article provides a comprehensive overview of DevOps, explaining its purpose, cultural challenges, core principles such as automation, standardization, and configuration, its relationship with cloud, lean and agile, practical steps, metrics, and how it transforms IT delivery into an end‑to‑end business value pipeline.

AgileAutomationCloud

0 likes · 17 min read

Understanding DevOps: Principles, Practices, and Implementation

Efficient Ops

Feb 24, 2016 · Operations

Is Operations Automation Overhyped? A Pragmatic Look at Real‑World Practices

The article critiques the hype around operations automation, arguing that many tasks can be handled with simple shell scripts, that automation should solve error‑prone manual work rather than replace thoughtful architecture, and that choosing the most convenient tool is more valuable than chasing trendy solutions.

AutomationOperationsShell Scripting

0 likes · 13 min read

Is Operations Automation Overhyped? A Pragmatic Look at Real‑World Practices

Architecture Digest

Feb 24, 2016 · Backend Development

Lessons from 14 Years of Website Architecture Evolution

Drawing on fourteen years of hands‑on experience, the article chronicles how a website’s architecture matures from a simple personal homepage to a billion‑page‑view enterprise system, highlighting the essential principles, design patterns, operational practices, and scalability strategies that underpin successful large‑scale web platforms.

Backend DevelopmentOperationsPerformance Optimization

0 likes · 30 min read

Lessons from 14 Years of Website Architecture Evolution

ITPUB

Feb 18, 2016 · Operations

Building a Custom RPC Stress‑Testing Tool: Insights from Meituan

Meituan’s internal RPC services, largely built on Thrift, required a streamlined pressure‑testing solution, leading to the development of a custom tool that automates traffic capture, provides an intuitive UI, aggregates metrics via InfluxDB, and supports both Thrift and HTTP workloads, addressing the shortcomings of existing open‑source options.

Backend ToolsOperationsRPC

0 likes · 8 min read

Building a Custom RPC Stress‑Testing Tool: Insights from Meituan

Architects' Tech Alliance

Feb 17, 2016 · Cloud Computing

Overview of Hyper-V Features, Management, and Storage Capabilities

This article provides a comprehensive overview of Hyper-V, covering its extensive operating system support, virtual networking, management integration with System Center, dynamic memory, storage options, VM conversion tools, and key SMB 3.0 features for high‑availability and performance in virtualized environments.

Cloud ComputingHyper-VOperations

0 likes · 9 min read

Overview of Hyper-V Features, Management, and Storage Capabilities

Architecture Digest

Feb 17, 2016 · Backend Development

Evolution of VIP (Vipshop) Business Model and System Architecture

The article outlines VIP's transition from a simple outlet‑style e‑commerce platform to a multi‑brand flash‑sale service, detailing each architectural phase—from a monolithic LAMP stack through vertical silo and distributed service‑oriented designs—to a cloud‑native, platform‑plus‑application model that supports scalable, high‑availability operations.

Backend DevelopmentOperationsVipshop

0 likes · 11 min read

Evolution of VIP (Vipshop) Business Model and System Architecture

Java High-Performance Architecture

Feb 17, 2016 · Operations

Master Linux System Monitoring with dstat: A Quick Guide

This article introduces dstat, a versatile Linux monitoring tool that consolidates vmstat, iostat, and ifstat functions, demonstrates its default and customizable outputs, shows how to identify top resource‑consuming processes, and provides simple installation steps for CentOS.

LinuxOperationsPerformance

0 likes · 3 min read

Master Linux System Monitoring with dstat: A Quick Guide

Efficient Ops

Feb 16, 2016 · Operations

Rethinking Platform Capability: A Holistic Model for Business, Development, and Operations

This article explores a new platform capability model that integrates business support, development support, and operational control, arguing that a platform should be treated like a product with clear responsibilities, measurable outcomes, and a closed‑loop approach to sustain long‑term value.

Operationsbusiness supportsoftware engineering

0 likes · 9 min read

Rethinking Platform Capability: A Holistic Model for Business, Development, and Operations

Efficient Ops

Feb 15, 2016 · Operations

Can Operations Survive the Cloud Revolution? Strategies for the Next Decade

As cloud computing reshapes IT, traditional operations roles face unprecedented disruption, but by embracing cloud‑focused responsibilities, niche industry needs, or even a complete career pivot, ops professionals can secure their future within the next five to ten years.

IT infrastructureOperationscareer development

0 likes · 9 min read

Can Operations Survive the Cloud Revolution? Strategies for the Next Decade

MaGe Linux Operations

Feb 15, 2016 · Operations

Why Does Redis Return “Server Went Away”? Diagnosing Timeout and TCP Keepalive Issues

This article walks through a real‑world Redis timeout problem, examining client‑side socket settings, server configuration, system metrics, and kernel TCP parameters to pinpoint why connections are dropped and how adjusting timeout and tcp‑keepalive resolves the latency.

OperationsTroubleshootingtcp-keepalive

0 likes · 6 min read

Why Does Redis Return “Server Went Away”? Diagnosing Timeout and TCP Keepalive Issues

Efficient Ops

Feb 3, 2016 · Operations

Why Human Errors Still Plague Modern Ops and How to Prevent Them

This article examines recent high‑profile internet outages caused by human error, explores why operations teams are especially prone to mistakes despite automation and standards, and offers practical strategies—such as hiring the right people, fostering safety awareness, and turning professionalism into habit—to reduce future incidents.

AutomationBest PracticesIncident Management

0 likes · 14 min read

Why Human Errors Still Plague Modern Ops and How to Prevent Them

Efficient Ops

Feb 3, 2016 · Operations

Putting People First: Building a Human‑Centred Efficient Operations System

This article explores how a people‑centric mindset can transform operations by defining a three‑layer framework, clarifying why human factors matter, and offering concrete process, technology, and organizational practices such as streamlined approval flows, voice‑alert systems, and Docker‑based continuous deployment.

AutomationOperationspeople‑centric

0 likes · 12 min read

Putting People First: Building a Human‑Centred Efficient Operations System

Efficient Ops

Feb 2, 2016 · Operations

How Ops Professionals Can Boost Happiness and Efficiency: 4 Common Pitfalls and Practical Solutions

This article examines why many operations engineers feel unhappy, identifies four personal‑management problems—over‑pursuing tech, mis‑prioritizing tasks, poor communication, and chronic complaining—and offers concrete, actionable suggestions to improve productivity, satisfaction, and team collaboration.

Operationscommunicationpersonal development

0 likes · 16 min read

How Ops Professionals Can Boost Happiness and Efficiency: 4 Common Pitfalls and Practical Solutions

Efficient Ops

Feb 2, 2016 · Operations

Unlocking Efficient Operations: 7 Secrets to Happy SysAdmins

This article explores why efficient operations are hard to achieve, identifies common pitfalls such as unclear responsibilities, communication gaps, and resource mismatches, and presents a practical framework—including clear roles, professional processes, and a good service interface—to help operations teams become more effective and satisfied.

AutomationOperationscommunication

0 likes · 16 min read

Unlocking Efficient Operations: 7 Secrets to Happy SysAdmins

Efficient Ops

Feb 2, 2016 · Operations

Operations 2.0: The Final Opportunity to Transform IT Ops in the Cloud Era

The article argues that traditional IT operations are facing a crisis and proposes Operations 2.0—a service‑oriented, business‑aware model that leverages cloud, open‑source and automation to shift focus from technical output to reliable, value‑adding services, outlining why it is essential and how to implement it.

AutomationIT transformationOperations

0 likes · 14 min read

Operations 2.0: The Final Opportunity to Transform IT Ops in the Cloud Era

Efficient Ops

Jan 28, 2016 · Operations

Unlocking Performance: Practical Strategies for Application and Architecture Optimization

This article explores the benefits and trade‑offs of performance optimization, outlines single‑application and structural optimization approaches, details bottleneck identification methods, common tuning techniques, and illustrates architectural evolution with diagrams to guide effective ops improvements.

Operationsapplication scalingarchitecture

0 likes · 6 min read

Unlocking Performance: Practical Strategies for Application and Architecture Optimization

21CTO

Jan 28, 2016 · Operations

How to Build High‑Availability Systems: Lessons from a Transaction Platform Evolution

This article shares practical insights on achieving high availability by understanding goals, decomposing requirements, designing resilient architectures, ensuring operability, testing rigorously, and reducing release risk, illustrated through the multi‑stage evolution of a transaction system.

High AvailabilityMicroservicesOperations

0 likes · 14 min read

How to Build High‑Availability Systems: Lessons from a Transaction Platform Evolution

Architect

Jan 26, 2016 · Operations

Evolution of Image Server Architecture: From Single‑Node to Distributed File System and CDN

The article examines how large‑scale web sites handle massive image resources, tracing the progression from simple single‑machine storage to clustered virtual directories, shared UNC storage, and finally a FastDFS‑based distributed file system combined with CDN acceleration, highlighting the architectural trade‑offs and operational considerations.

CDNFastDFSOperations

0 likes · 13 min read

Evolution of Image Server Architecture: From Single‑Node to Distributed File System and CDN

Efficient Ops

Jan 25, 2016 · Operations

Why You Still Need a Dedicated Deployment System Beyond Jenkins

While Jenkins offers powerful deployment plugins, this article explains why a standalone deployment system remains essential for continuous delivery, covering decoupling builds, managing complex environments, supporting varied deployment strategies, enforcing standards, gathering operational data, and enabling service-oriented deployment across teams.

CI/CDJenkinsOperations

0 likes · 9 min read

Why You Still Need a Dedicated Deployment System Beyond Jenkins

MaGe Linux Operations

Jan 25, 2016 · Operations

How to Transfer a 200GB File Over LAN Fast: Comparing scp, rsync, wget, and bbcp

The article documents a LAN test copying a single 200 GB file with scp, rsync, wget, and bbcp, compares their speeds, provides installation steps for bbcp, explains key options, and concludes which tool works best for large file transfers.

Network PerformanceOperationsSCP

0 likes · 6 min read

How to Transfer a 200GB File Over LAN Fast: Comparing scp, rsync, wget, and bbcp

Node Underground

Jan 19, 2016 · Operations

Why Front‑End Developers Should Care About Docker: A Beginner’s Guide

This article explains how Docker’s build‑ship‑run model bridges front‑end development and containerization, covering Docker’s history, core concepts, a sample Dockerfile for a Node.js app, and practical scenarios where Docker improves environment consistency, resource efficiency, and scalability.

DockerOperationscontainerization

0 likes · 11 min read

Why Front‑End Developers Should Care About Docker: A Beginner’s Guide

Efficient Ops

Jan 18, 2016 · Operations

How Tencent Migrated 200M QQ Users After a Tianjin Explosion

When a massive container explosion threatened Tencent's Tianjin data center, the operations team executed a 24‑hour, continent‑wide user migration that moved over 200 million QQ users to Shenzhen and Shanghai without service interruption, showcasing unprecedented disaster‑recovery capabilities.

Disaster RecoveryLarge-Scale MigrationOperations

0 likes · 10 min read

How Tencent Migrated 200M QQ Users After a Tianjin Explosion

21CTO

Jan 18, 2016 · Operations

Why Immutable Infrastructure Is the Future of Reliable Deployments

Immutable Infrastructure treats every server or container as a read‑only unit that is replaced rather than modified, offering repeatable configuration, faster CI/CD, easier rollback, and reduced operational complexity, while requiring stateless applications and automated provisioning templates to succeed.

AutomationDeploymentOperations

0 likes · 9 min read

Why Immutable Infrastructure Is the Future of Reliable Deployments

Qunar Tech Salon

Jan 16, 2016 · Backend Development

From Zero to One: The Evolution of WeChat’s Backend System Architecture

This article chronicles the two‑month development of WeChat’s backend from its inception, detailing the design of its message model, data‑sync protocol, three‑tier architecture, asynchronous queues, rapid scaling, platformization, multi‑data‑center deployment, disaster‑recovery strategies, performance optimizations, security hardening, and emerging resource‑scheduling challenges.

Data synchronizationOperationsWeChat

0 likes · 28 min read

From Zero to One: The Evolution of WeChat’s Backend System Architecture

Efficient Ops

Jan 13, 2016 · Operations

Incremental vs Full Deployment: Which Strategy Wins for Modern Ops?

The article examines the trade‑offs between incremental and full deployment, outlining their workflows, advantages, and challenges, and concludes that full deployment is generally preferable for stateless units while incremental methods remain useful for stateful components like databases.

DeploymentOperationsfull deployment

0 likes · 9 min read

Incremental vs Full Deployment: Which Strategy Wins for Modern Ops?

Efficient Ops

Jan 6, 2016 · Operations

How Natural Cooling Can Cut Data Center Energy Costs by Over 20%

This article explains China's green data‑center policies, the importance of PUE, and demonstrates through calculations and real‑world Dalian case studies how natural cooling can halve cooling energy use, lower PUE from 2.5 to 2.0, and save millions in electricity bills.

Data CenterOperationsPUE

0 likes · 11 min read

How Natural Cooling Can Cut Data Center Energy Costs by Over 20%

21CTO

Jan 6, 2016 · Backend Development

Essential Best Practices for Accurate HTTP Load Testing

This article outlines ten practical guidelines—ranging from test environment consistency and dedicated hardware to network capacity checks, OS tuning, realistic workloads, proper test duration, and comprehensive result reporting—to ensure reliable and reproducible HTTP server performance benchmarks.

OperationsPerformancebackend

0 likes · 13 min read

Essential Best Practices for Accurate HTTP Load Testing

Java High-Performance Architecture

Jan 5, 2016 · Operations

How to Automate Daily Nginx Log Rotation with a Bash Script

This guide explains how to automatically rotate Nginx logs each day by renaming the current log file at midnight, creating dated backup directories, and signaling the Nginx master process with USR1 using a simple Bash script scheduled via a cron job.

NginxOperationsbash

0 likes · 2 min read

How to Automate Daily Nginx Log Rotation with a Bash Script

Java High-Performance Architecture

Jan 5, 2016 · Operations

How Service Degradation Keeps E‑commerce Platforms Stable During Traffic Surges

The article explains why service degradation is essential for large‑scale shopping events, outlines its different dimensions such as page, business module, and remote service downgrade, and describes both manual and automatic implementation methods to maintain system availability under heavy load.

E‑CommerceOperationsservice degradation

0 likes · 3 min read

How Service Degradation Keeps E‑commerce Platforms Stable During Traffic Surges