Tagged articles
3281 articles
Page 30 of 33
21CTO
21CTO
Apr 10, 2017 · Operations

Alibaba’s Secret to Scaling GitLab: Distributed Sharding and Performance Boosts

This article details how Alibaba Group transformed its GitLab deployment from a single‑node bottleneck into a horizontally scalable, sharded architecture that handles millions of daily requests with high availability, improved performance, and robust data safety.

GitLabOperationsdistributed-systems
0 likes · 15 min read
Alibaba’s Secret to Scaling GitLab: Distributed Sharding and Performance Boosts
ITPUB
ITPUB
Apr 4, 2017 · Operations

Real‑World Ops Pitfalls and Proven Ways to Avoid Them

This article compiles practical experiences from system administrators about common operational pitfalls, their root causes, and concrete mitigation steps, ranging from misconfigured HAProxy timeouts and risky rm commands to ansible async quirks and cron‑job failures.

AnsibleDevOpsLinux
0 likes · 8 min read
Real‑World Ops Pitfalls and Proven Ways to Avoid Them
Efficient Ops
Efficient Ops
Mar 30, 2017 · Operations

Why Ops Engineers Are Always the Scapegoat—and How to Turn That Into Value

The article reflects on the challenges faced by operations engineers in small companies, illustrating why they often become scapegoats, and offers practical advice on learning, risk control, communication, and disaster‑recovery drills to increase their value and effectiveness.

Operationslearningrisk management
0 likes · 18 min read
Why Ops Engineers Are Always the Scapegoat—and How to Turn That Into Value
dbaplus Community
dbaplus Community
Mar 29, 2017 · Operations

Why Does Server IO Spike at 3 AM? Diagnose RAID Battery and Self‑Test Issues

This guide explains why server IO utilization spikes above 60% during early‑morning hours, covering hardware self‑test, RAID battery failures, cache policy misconfigurations, and step‑by‑step commands for MegaRAID and HP servers, plus BIOS adjustments and best‑practice recommendations to prevent performance degradation.

HardwareMegaCliOperations
0 likes · 16 min read
Why Does Server IO Spike at 3 AM? Diagnose RAID Battery and Self‑Test Issues
Alibaba Cloud Developer
Alibaba Cloud Developer
Mar 29, 2017 · Operations

How Alibaba Built the ‘Nuclear Weapon’ Full‑Link Stress Test for Double 11

This article chronicles Alibaba's evolution of the full‑link pressure testing platform—from its 2013 inception tackling massive Double 11 traffic, through data construction, isolation, traffic generation, and platform upgrades—to a mature, automated, cloud‑native solution that safeguards large‑scale e‑commerce stability.

AlibabaOperationsPerformance Testing
0 likes · 13 min read
How Alibaba Built the ‘Nuclear Weapon’ Full‑Link Stress Test for Double 11
Efficient Ops
Efficient Ops
Mar 28, 2017 · Operations

How We Scaled Server Authentication with OpenLDAP: A Real‑World Operations Journey

This article walks through a vehicle‑networking company's four‑stage journey—selection, requirement analysis, implementation, and evolution—to replace fragmented SSH passwords with a centralized OpenLDAP authentication platform, covering cost decisions, deployment steps, security hardening, and management automation.

AuthenticationOpenLDAPOperations
0 likes · 13 min read
How We Scaled Server Authentication with OpenLDAP: A Real‑World Operations Journey
Baidu Intelligent Testing
Baidu Intelligent Testing
Mar 27, 2017 · Operations

Gray Release (Canary Deployment) Strategies and Practices

The article explains gray release as a smooth, risk‑mitigating deployment method, outlines why it is needed, describes its limitations, and compares four practical gray‑release solutions—including code‑level flags, pre‑release machines, SET isolation, and dynamic routing—before recommending a combined approach.

Deployment StrategyOperationscanary deployment
0 likes · 11 min read
Gray Release (Canary Deployment) Strategies and Practices
DevOps
DevOps
Mar 26, 2017 · Operations

DevOps Survey Findings: Adoption Rates, Benefits, Challenges, and Tool Usage

Based on a survey of 300 IT professionals, this report reveals growing DevOps adoption, key motivations such as quality and cost reduction, major obstacles like resource shortages, measurable benefits including cost savings and faster releases, preferred tools, error‑handling practices, and future investment plans.

ChallengesDevOpsOperations
0 likes · 11 min read
DevOps Survey Findings: Adoption Rates, Benefits, Challenges, and Tool Usage
MaGe Linux Operations
MaGe Linux Operations
Mar 23, 2017 · Operations

Why Operations Engineering Is the Hottest Career Path in 2024

The article reflects on eight years of operations experience, highlights the bright industry outlook, and outlines four key career paths—operations development, platform R&D, database engineering, and management—showing why skilled ops engineers are increasingly in demand.

IT jobsOperations
0 likes · 5 min read
Why Operations Engineering Is the Hottest Career Path in 2024
DevOps
DevOps
Mar 21, 2017 · Operations

DevOps Evolution: Software Engineering Development, Transformation Pitfalls, Core Practices, and Ecosystem

This article traces the evolution of software engineering tools leading to DevOps, highlights common transformation pitfalls, outlines core DevOps practices such as autonomous small teams, traceable toolchains, real‑time metrics, and describes the surrounding ecosystem, offering practical guidance for organizations adopting DevOps.

Continuous DeliveryDevOpsMicroservices
0 likes · 19 min read
DevOps Evolution: Software Engineering Development, Transformation Pitfalls, Core Practices, and Ecosystem
Baidu Intelligent Testing
Baidu Intelligent Testing
Mar 21, 2017 · Operations

Server Monitoring Solution: Requirements, Design Decisions, and Implementation Details

This article presents a comprehensive server‑side monitoring solution covering functional and performance requirements, monitoring objects, design choices between self‑monitoring and centralized reporting, system architecture, API definitions, key challenges such as key collisions, data formats, storage options, and operational considerations.

AlertingMetricsOperations
0 likes · 12 min read
Server Monitoring Solution: Requirements, Design Decisions, and Implementation Details
DevOps
DevOps
Mar 20, 2017 · Operations

What DevOps Really Is (and Isn’t): History, Principles, Tools, and Culture

This article explains the origins and background of DevOps, clarifies common misconceptions about its role and title, outlines its cultural principles, surveys the essential toolchain, and discusses how organizations can adopt DevOps practices beyond just development and operations.

Continuous DeliveryCultureDevOps
0 likes · 13 min read
What DevOps Really Is (and Isn’t): History, Principles, Tools, and Culture
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
Mar 20, 2017 · Operations

How 360’s DoctorStarange Boosts Ops with AI‑Driven Prediction, Correlation, and Resource Optimization

This article explains how 360’s DoctorStarange system combines time‑series forecasting, neural‑network predictions, alarm correlation, and a machine‑health scoring model to reduce false alerts, automate remediation, and maximize resource utilization across thousands of production servers.

ARIMANeural NetworksOperations
0 likes · 14 min read
How 360’s DoctorStarange Boosts Ops with AI‑Driven Prediction, Correlation, and Resource Optimization
High Availability Architecture
High Availability Architecture
Mar 15, 2017 · Operations

Highlights from SRECon17 Americas 2023 in San Francisco

The article reports on the SRECon17 Americas conference in San Francisco, summarizing keynote talks, panel sessions, and practical insights from industry leaders such as Stripe, Netflix, Google, and IBM on topics ranging from traffic control and container management to on‑call practices and cost considerations for Site Reliability Engineering.

DevOpsGoogleNetflix
0 likes · 6 min read
Highlights from SRECon17 Americas 2023 in San Francisco
Efficient Ops
Efficient Ops
Mar 12, 2017 · Operations

How Tencent Saved 8 Million QQ Users by Migrating Legacy Services

This article recounts how Tencent's operations team tackled the urgent migration of aging data‑center infrastructure to preserve service for 8 million legacy QQ users, detailing the challenges, strategic choices, IP‑level network relocation, and the DevOps practices that ensured a successful cut‑over.

Legacy MigrationOperationsTencent
0 likes · 15 min read
How Tencent Saved 8 Million QQ Users by Migrating Legacy Services
ITPUB
ITPUB
Mar 9, 2017 · Operations

How the Four‑Eyes Principle Saves IT Ops from Costly Mistakes

The article shares frontline IT operations experiences, emphasizing careful command execution, mandatory operation logs, two‑person verification, and backup strategies to prevent disastrous errors, illustrated by real incidents like a massive Deutsche Bank loss caused by a simple input mistake.

IT best practicesOperationsbackup strategy
0 likes · 4 min read
How the Four‑Eyes Principle Saves IT Ops from Costly Mistakes
MaGe Linux Operations
MaGe Linux Operations
Mar 8, 2017 · Operations

Master Linux ‘top’ Command: Real‑Time Process Monitoring Guide

This article explains how to use the Linux top command for real‑time system and process monitoring, covering its interface, statistical and process sections, interactive shortcuts, command‑line options, and internal commands to customize and sort the displayed information.

Operationsprocess managementsystem-monitoring
0 likes · 8 min read
Master Linux ‘top’ Command: Real‑Time Process Monitoring Guide
DevOps
DevOps
Mar 5, 2017 · Operations

Controlling Work‑in‑Progress: Delay Start and Focus on Completion

The article explains how to control work‑in‑progress by postponing new starts and concentrating on finishing existing tasks, emphasizing that WIP should be measured in delivered user value rather than task count, and outlines practical control techniques for lean product development.

KanbanLeanOperations
0 likes · 7 min read
Controlling Work‑in‑Progress: Delay Start and Focus on Completion
Architecture Digest
Architecture Digest
Mar 3, 2017 · Operations

High-Concurrency Architecture: Strategies, Testing, and Practical Solutions

This article outlines the design and implementation of high‑concurrency systems, covering server architecture, load balancing, database clustering, caching strategies, message‑queue based asynchronous processing, static data handling, and operational best practices such as monitoring, redundancy, and automation.

Message QueueOperationsServer Architecture
0 likes · 18 min read
High-Concurrency Architecture: Strategies, Testing, and Practical Solutions
DevOps
DevOps
Feb 28, 2017 · Operations

Designing a Team Kanban Wall and System: Step-by-Step Guide

This article walks readers through a three-step process for designing a team’s Kanban wall and system, teaching how to analyze value streams, select appropriate visual elements, and create a customized board that supports efficient workflow management.

KanbanOperationsProcess Design
0 likes · 3 min read
Designing a Team Kanban Wall and System: Step-by-Step Guide
Efficient Ops
Efficient Ops
Feb 28, 2017 · Operations

Prepare Your E‑Commerce System for Mega‑Sales: Proactive Prevention & Rapid Response

This article outlines a comprehensive PDCA‑based methodology for e‑commerce platforms to proactively prevent issues, quickly detect anomalies, and execute rapid decisions during large‑scale promotions, covering system goal definition, performance evaluation, capacity planning, SLA management, and team/process maturity.

Operationscapacity planninge‑commerce
0 likes · 18 min read
Prepare Your E‑Commerce System for Mega‑Sales: Proactive Prevention & Rapid Response
Efficient Ops
Efficient Ops
Feb 26, 2017 · Operations

How Alibaba Scales Massive Data Platforms: Lessons in Automated Operations

This article explores the challenges of operating Alibaba's large‑scale data platforms, describes the automation platform built to address them, and shares data‑driven, fine‑grained operational practices that enable stable, efficient, and cost‑effective service delivery.

Big DataOperationsScalability
0 likes · 22 min read
How Alibaba Scales Massive Data Platforms: Lessons in Automated Operations
DevOps
DevOps
Feb 23, 2017 · Operations

Comparing ITIL and DevOps: Principles, Automation, and Integration Models

The article examines the conflict and convergence between ITIL and DevOps in modern operations, outlining DevOps principles, automation in deployment and operations, and three integration models that balance management and execution, while highlighting the distinct values and scenarios for each approach.

Continuous DeliveryDevOpsITIL
0 likes · 12 min read
Comparing ITIL and DevOps: Principles, Automation, and Integration Models
Efficient Ops
Efficient Ops
Feb 21, 2017 · Mobile Development

How Alibaba Scales Mobile App Ops: Gray Release, Monitoring, and Rapid Fixes

This article details Alibaba's mobile app operational practices, covering the challenges of client-side maintenance, their high‑frequency release pipeline, gray‑release mechanisms, monitoring, trace systems, remote logging, and rapid issue resolution to ensure stability and performance at massive scale.

MobileOperationsgray release
0 likes · 21 min read
How Alibaba Scales Mobile App Ops: Gray Release, Monitoring, and Rapid Fixes
Ctrip Technology
Ctrip Technology
Feb 16, 2017 · Operations

Application‑Based Automated Capacity Management and Utilization Evaluation

The article presents a comprehensive, application‑centric approach to automated capacity management that analyzes why server utilization is low, defines safe usage thresholds, describes a load‑balancer‑driven stress‑testing workflow with regression modeling, and explains how this practice improves resource efficiency, cost savings, and developer‑ops collaboration.

DevOpsOperationsPerformance Testing
0 likes · 14 min read
Application‑Based Automated Capacity Management and Utilization Evaluation
Efficient Ops
Efficient Ops
Feb 15, 2017 · Operations

Mastering the One‑Second Rule: Boost Mobile User Experience in 2024

This article explains how mobile network characteristics, the one‑second rule, and targeted optimizations in access scheduling, protocols, and business logic can dramatically improve download success, startup speed, and overall user experience for mobile services.

MobileOperationsnetwork
0 likes · 24 min read
Mastering the One‑Second Rule: Boost Mobile User Experience in 2024
Qunar Tech Salon
Qunar Tech Salon
Feb 14, 2017 · Operations

Application‑Based Automated Capacity Management and Utilization Evaluation

This article explains how to automate application‑centric capacity assessment, identify the safe utilization thresholds, use load‑balancer‑driven stress testing and regression modeling to pinpoint resource bottlenecks, and improve server usage while maintaining service reliability through close DevOps collaboration.

DevOpsOperationsPerformance Testing
0 likes · 15 min read
Application‑Based Automated Capacity Management and Utilization Evaluation
转转QA
转转QA
Feb 13, 2017 · Databases

Redis Connection Pool Saturation: A Debugging Tale

A developer recounts how a Redis connection pool overflow across dozens of clusters was traced to a single misbehaving service, diagnosed with netstat and ps commands, and resolved by adjusting configuration and stopping the offending process, illustrating practical troubleshooting of connection limits.

Connection PoolOperationsmonitoring
0 likes · 4 min read
Redis Connection Pool Saturation: A Debugging Tale
Efficient Ops
Efficient Ops
Feb 9, 2017 · Operations

Automating Application‑Based Capacity Management to Boost Resource Utilization

This article explains how to automate capacity management focused on application performance, identifies common causes of low resource utilization, proposes safe utilization thresholds, describes a testing framework that uses load‑balancer weighting and real‑time monitoring to pinpoint bottlenecks, and outlines how ops and developers can collaborate to improve efficiency.

OperationsPerformance Testingautomation
0 likes · 18 min read
Automating Application‑Based Capacity Management to Boost Resource Utilization
Efficient Ops
Efficient Ops
Feb 6, 2017 · Operations

Building Billion‑Scale Web Systems That Auto‑Extinguish Failures

The article shares Tencent’s practical fault‑tolerance journey for a billion‑scale activity platform, covering retry strategies, automatic removal of faulty nodes, timeout tuning, business‑level safeguards, service degradation, and decoupling techniques that together reduce manual firefighting and improve system resilience.

Operationsfault tolerancelarge-scale systems
0 likes · 25 min read
Building Billion‑Scale Web Systems That Auto‑Extinguish Failures
21CTO
21CTO
Feb 2, 2017 · Operations

What GitLab’s 300 GB Data Loss Teaches About Backup and Ops Discipline

The GitLab production database was mistakenly deleted during a manual fix, exposing gaps in backup strategies, PostgreSQL configuration, and operational practices, and prompting a detailed post‑mortem that highlights the need for automated recovery, proper tooling, and transparent incident handling.

Data lossDatabase BackupOperations
0 likes · 15 min read
What GitLab’s 300 GB Data Loss Teaches About Backup and Ops Discipline
Efficient Ops
Efficient Ops
Jan 24, 2017 · Databases

Essential DBA Holiday Checklist: Keep Your Databases Safe During Chinese New Year

This guide outlines the critical tasks DBA teams should perform before, during, and after the Chinese New Year holiday, including daily security practices, pre‑holiday inspections, on‑call rotations, post‑holiday reviews, and detailed checklist scripts to ensure database reliability and prevent incidents.

DBAHolidayOperations
0 likes · 13 min read
Essential DBA Holiday Checklist: Keep Your Databases Safe During Chinese New Year
Efficient Ops
Efficient Ops
Jan 22, 2017 · Operations

What 2016 Ops Teams Learned About Monitoring Tools and Alert Patterns

The 2016 Ops Alert Report reveals Zabbix’s dominance, preferred notification channels, monthly and daily alert trends, peak alert times, regional distribution, and quirky usage statistics, offering valuable insights for operations teams to optimize monitoring and incident response.

OperationsZabbixalerts
0 likes · 5 min read
What 2016 Ops Teams Learned About Monitoring Tools and Alert Patterns

Building a Scalable Business Monitoring System: Architecture, Modules & Lessons

This article presents a comprehensive case study of a business monitoring system, covering its background, architectural analysis, module design, time‑series database selection, visualization with Grafana, alerting strategies, decision‑making logic, and intelligent monitoring experiments, followed by key takeaways and lessons learned.

GrafanaInfluxDBOperations
0 likes · 12 min read
Building a Scalable Business Monitoring System: Architecture, Modules & Lessons
MaGe Linux Operations
MaGe Linux Operations
Jan 8, 2017 · Operations

Master Ansible: From Basics to Advanced Modules for Efficient Operations

This guide introduces Ansible for operations, covering its core features, installation, host preparation, key management, essential modules, playbook structure, YAML syntax, handlers, tags, variables, templates, loops, and conditional execution, with practical command examples and visual illustrations.

AnsibleConfiguration ManagementDevOps
0 likes · 8 min read
Master Ansible: From Basics to Advanced Modules for Efficient Operations
Efficient Ops
Efficient Ops
Jan 8, 2017 · Operations

Why Global Server Load Balancing (GSLB) Is Hard: Technical Challenges and Solutions

This article explains what GSLB (Global Server Load Balancing) is, why achieving high availability, low latency, and accurate traffic distribution is difficult due to DNS limitations, caching, and routing constraints, and explores architectural and network‑level techniques such as feedback loops, anycast, and BGP routing to mitigate these challenges.

AnycastDNSGSLB
0 likes · 16 min read
Why Global Server Load Balancing (GSLB) Is Hard: Technical Challenges and Solutions
DevOps
DevOps
Jan 4, 2017 · Operations

The Third Way of DevOps: Continuous Learning and Docker as Lab Equipment

The article explains the Third Way of DevOps—continuous learning through Kaizen and the PDSA cycle—showing how Docker serves as laboratory equipment that enables rapid, reproducible experiments, illustrated with examples from a financial institution and a personal baseball‑statistics project.

DevOpsDockerLean
0 likes · 8 min read
The Third Way of DevOps: Continuous Learning and Docker as Lab Equipment
21CTO
21CTO
Jan 4, 2017 · Operations

How to Build Truly High‑Availability Systems: Principles and Practices

This article explains what high availability means for distributed systems, outlines common availability tiers, and describes how redundancy, load balancing, and automatic failover across a typical Internet architecture can achieve reliable, scalable services.

Distributed SystemsOperationsReliability
0 likes · 6 min read
How to Build Truly High‑Availability Systems: Principles and Practices
DevOps
DevOps
Jan 3, 2017 · Operations

Applying the DevOps “Second Way” with Docker: Accelerating Feedback Loops

This article explains the DevOps “Second Way,” emphasizing faster, bidirectional feedback loops, and shows how Docker’s immutable containers, streamlined packaging, and embedded metadata reduce variation, accelerate defect detection, and shorten lead times in service delivery.

Continuous DeliveryDevOpsDocker
0 likes · 7 min read
Applying the DevOps “Second Way” with Docker: Accelerating Feedback Loops
Efficient Ops
Efficient Ops
Dec 29, 2016 · Operations

Meet the Top Operations Experts of 2016: Profiles and Must‑Read Articles

This article introduces the standout operations professionals featured by the High‑Efficiency Operations community in 2016, summarizing each expert’s background, key achievements, and a curated list of their most influential technical articles for readers seeking deep insights into modern ops practices.

Operationsautomationcloud computing
0 likes · 12 min read
Meet the Top Operations Experts of 2016: Profiles and Must‑Read Articles
Efficient Ops
Efficient Ops
Dec 28, 2016 · Operations

Transforming Financial Application Operations: Lessons from a European Rollout

This article shares a detailed case study of how a financial services team restructured European application operations, applied lean retrospectives, built a top‑down monitoring system, and introduced systematic stakeholder collaboration to dramatically improve incident response, system robustness, and user satisfaction.

DevOpsOperationsapplication monitoring
0 likes · 14 min read
Transforming Financial Application Operations: Lessons from a European Rollout
ITFLY8 Architecture Home
ITFLY8 Architecture Home
Dec 27, 2016 · Operations

How Dangdang Scaled Its E‑Commerce Platform for 10× Traffic Peaks

This article details Dangdang's 15‑year evolution from a monolithic system to a distributed, SOA‑based architecture, outlining the challenges of high‑traffic e‑commerce events and the strategies—system grading, decoupling, asynchronous processing, batching, and rate limiting—used to achieve reliable, scalable operations.

OperationsSOAe‑commerce
0 likes · 19 min read
How Dangdang Scaled Its E‑Commerce Platform for 10× Traffic Peaks
Efficient Ops
Efficient Ops
Dec 26, 2016 · Operations

How Tencent Scaled Social Data Storage While Cutting Costs

Facing massive user growth, Tencent’s social network team redesigned its KV storage architecture—introducing CKV and Grocery, automating capacity planning, data migration, and backup reuse—to dramatically lower costs, improve operational efficiency, and maintain high service quality across millions of devices.

Cost OptimizationOperationsScalability
0 likes · 21 min read
How Tencent Scaled Social Data Storage While Cutting Costs
Alibaba Cloud Developer
Alibaba Cloud Developer
Dec 26, 2016 · Operations

How Alibaba’s SunFire Powers Real‑Time Monitoring for Billion‑Scale Transactions

Alibaba’s SunFire platform delivers massive‑scale, real‑time log collection, processing, and visualization for e‑commerce spikes like Double 11, using low‑overhead agents, asynchronous Map/Reduce pipelines, fault‑tolerant task scheduling, and shared inputs to ensure accurate, low‑latency monitoring across billions of transactions.

AlibabaOperationsReal-Time
0 likes · 18 min read
How Alibaba’s SunFire Powers Real‑Time Monitoring for Billion‑Scale Transactions
Efficient Ops
Efficient Ops
Dec 21, 2016 · Operations

Measure Your Continuous Delivery Maturity with a 47‑Item Checklist

Learn how to assess your Continuous Delivery maturity using a 47‑item checklist, understand its purpose for aligning goals, improving processes, and boosting value delivery, and calculate your score as a percentage to guide technical and organizational improvements.

Operationsmaturity checklistsoftware delivery
0 likes · 2 min read
Measure Your Continuous Delivery Maturity with a 47‑Item Checklist
Efficient Ops
Efficient Ops
Dec 19, 2016 · Operations

What 16 Major 2016 Outages Teach Us About Disaster Recovery

This article reviews sixteen notable 2016 service outages across finance, cloud, and entertainment, analyzes their causes—ranging from power failures to DDoS attacks—and highlights the critical need for robust disaster‑recovery and information‑security practices.

Operationsincident managementinformation security
0 likes · 11 min read
What 16 Major 2016 Outages Teach Us About Disaster Recovery
DevOps
DevOps
Dec 18, 2016 · Operations

Introduction to DevOps and Docker: Concepts, Components, and Implementation

This article explains the principles of DevOps, its technical, process, and organizational considerations, and introduces Docker as a key tool, detailing its architecture, components, native utilities, suitable scenarios, and how it enables continuous integration, delivery, and efficient operations.

Cloud NativeDevOpsDocker
0 likes · 14 min read
Introduction to DevOps and Docker: Concepts, Components, and Implementation
DevOps
DevOps
Dec 13, 2016 · Operations

DevOps Is Not About Automation Tools, But They Are a Prerequisite

DevOps is a methodology that emphasizes collaboration between development and operations to accelerate software delivery, and while tools alone don’t constitute DevOps, automation and container technologies are essential prerequisites that reduce manual hand‑offs, enable self‑service, and improve feedback loops.

Continuous DeliveryDevOpsOperations
0 likes · 7 min read
DevOps Is Not About Automation Tools, But They Are a Prerequisite
DevOps
DevOps
Dec 11, 2016 · Operations

The Evolution of DevOps: From Agile Foundations to CALMS, Containerization, and Enterprise Best Practices

From its origins at the 2008 Agile conference to the modern CALMS framework, this article traces DevOps’s evolution, compares traditional, DevOps 1.0 and 2.0 approaches, and outlines key Chinese practices such as containers, continuous deployment, micro‑services, and enterprise best‑practice recommendations.

CALMSContinuous DeliveryDevOps
0 likes · 11 min read
The Evolution of DevOps: From Agile Foundations to CALMS, Containerization, and Enterprise Best Practices
ITFLY8 Architecture Home
ITFLY8 Architecture Home
Dec 8, 2016 · Operations

How CAT Enables Scalable Real‑Time Monitoring for Distributed Systems

This article introduces CAT, an open‑source Java‑based distributed real‑time monitoring platform, detailing its design goals, architecture, message processing pipeline, logging instrumentation, API, real‑time analysis, report modeling, storage challenges, and key takeaways for building highly available, scalable monitoring solutions.

Distributed MonitoringOperationsSystem Architecture
0 likes · 13 min read
How CAT Enables Scalable Real‑Time Monitoring for Distributed Systems
Alibaba Cloud Developer
Alibaba Cloud Developer
Dec 7, 2016 · Operations

How Alibaba Automates Its Network for Double 11 Traffic Surges

This article outlines Alibaba researcher Zhang Ming’s presentation on the network automation system that enables Alibaba’s infrastructure to handle the massive traffic and rapid fault recovery required during the Double 11 shopping festival, highlighting the challenges, detection methods, and automated tools used across routers, switches, and L4‑L7 devices.

AlibabaOperationsfault detection
0 likes · 3 min read
How Alibaba Automates Its Network for Double 11 Traffic Surges
Efficient Ops
Efficient Ops
Dec 5, 2016 · Operations

From PHP Monolith to Java Microservices: Mogujie's Ops Evolution and Lessons

This article recounts Mogujie's journey from a small PHP‑based LNMP stack to a Java‑driven micro‑service architecture, detailing the operational challenges, standardization efforts, continuous integration pipeline, and full‑link tracing techniques that enabled scalable, reliable e‑commerce services.

Full‑Link TracingJava migrationOperations
0 likes · 17 min read
From PHP Monolith to Java Microservices: Mogujie's Ops Evolution and Lessons
Efficient Ops
Efficient Ops
Dec 4, 2016 · Operations

How Ctrip Built a Seamless Multi‑Region Dual‑Active Call Center

This article details Ctrip's evolution from a single‑site call‑center to a fully dual‑active, multi‑region architecture, covering the overall system design, public network, application, and client layers, unified login mechanisms, heartbeat monitoring, and future software‑only and mobile‑first directions.

Dual-ActiveOperationsSRE
0 likes · 27 min read
How Ctrip Built a Seamless Multi‑Region Dual‑Active Call Center
Qunar Tech Salon
Qunar Tech Salon
Dec 1, 2016 · Backend Development

How to Prevent Service Failures: Suspect Third‑Party, Guard Users, and Perfect Your Own Service

The article shares practical strategies for preventing service failures by doubting third‑party services, protecting against misuse by consumers, and improving one’s own code and architecture, covering fallback plans, timeout settings, retry policies, API design, traffic control, and resource limits.

API-designOperationsReliability
0 likes · 16 min read
How to Prevent Service Failures: Suspect Third‑Party, Guard Users, and Perfect Your Own Service
Efficient Ops
Efficient Ops
Nov 27, 2016 · Operations

When Ops Heroes Burn Out: Tackling Personal Heroism in Operations

The article explores personal heroism in operations, defining it as reliance on individual effort to keep flawed systems appearing normal, examines its short‑term benefits and long‑term drawbacks for companies, teams, and the heroes themselves, and offers practical strategies to eliminate this risky mindset.

OperationsSLATeam Culture
0 likes · 10 min read
When Ops Heroes Burn Out: Tackling Personal Heroism in Operations
dbaplus Community
dbaplus Community
Nov 23, 2016 · Operations

How to Rapidly Deploy DCOS Services with Ansible and Docker

This guide walks through an automated, fast‑track deployment of DCOS components—including service selection, Docker‑based containers, host initialization, system checks, Ansible provisioning, Consul service discovery, HAProxy load balancing, MySQL HA, and Zookeeper/Marathon integration—providing concrete commands, configuration snippets, and practical tips.

AnsibleConsulDCOS
0 likes · 12 min read
How to Rapidly Deploy DCOS Services with Ansible and Docker
Efficient Ops
Efficient Ops
Nov 21, 2016 · Operations

7 Proven Bandwidth Optimization Strategies to Cut Social Platform Costs by 2 Billion

This article shares Tencent's seven practical bandwidth‑saving techniques—ranging from disabling auto‑play to intelligent pre‑push, file compression, on‑demand usage, segmented download, technical breakthroughs, and content compliance—to dramatically reduce operational costs while maintaining user experience.

Cost reductionOperationsbandwidth optimization
0 likes · 9 min read
7 Proven Bandwidth Optimization Strategies to Cut Social Platform Costs by 2 Billion
dbaplus Community
dbaplus Community
Nov 20, 2016 · Operations

Top Insights from the 2016 Global Agile Operations Summit

The 2016 Global Agile Operations Summit in Shanghai concluded with a series of expert sessions covering agile DevOps trends, cloud‑native automation platforms, database performance tuning, container orchestration, and real‑world case studies from leading companies, followed by the award ceremony honoring ten MVPs who drove innovation across operations and infrastructure.

ContainerDevOpsMVP
0 likes · 15 min read
Top Insights from the 2016 Global Agile Operations Summit
Qunar Tech Salon
Qunar Tech Salon
Nov 18, 2016 · Operations

Design and Implementation of Ctrip's Predictive Outbound Call Platform

This article describes Ctrip's large‑scale predictive outbound call platform, covering its underlying algorithms, SoftPBX integration, system architecture, concurrency enhancements, deployment experience, and measurable improvements in call success rates and agent efficiency.

Operationscall centeroutbound algorithm
0 likes · 8 min read
Design and Implementation of Ctrip's Predictive Outbound Call Platform
Efficient Ops
Efficient Ops
Nov 14, 2016 · Operations

What Ancient Medicine Teaches About Modern IT Risk Management

Using the classic tale of Bian Que, this article explains how proactive, mid‑stage, and reactive risk controls in IT operations prevent small issues from becoming catastrophic failures, illustrated with real‑world storage, cloud, and equipment‑selection case studies.

IT infrastructureOperationspreventive control
0 likes · 7 min read
What Ancient Medicine Teaches About Modern IT Risk Management
StarRing Big Data Open Lab
StarRing Big Data Open Lab
Nov 14, 2016 · Operations

Master Real-Time Hadoop Alerts with Transwarp Manager

Deploying the Transwarp Manager alert system within Hadoop clusters enables operators to monitor resource shortages, failures, and health issues in real time, offering browsing, configurable thresholds, and instant email or script notifications to quickly identify and resolve problems before they impact services.

Alert MonitoringHadoopOperations
0 likes · 9 min read
Master Real-Time Hadoop Alerts with Transwarp Manager
Architecture Digest
Architecture Digest
Nov 10, 2016 · Operations

Interview with Lu Pengcheng on Mogu Street’s Monitoring System Architecture and Evolution

In this interview, Lu Pengcheng, a platform architect at Mogu Street, discusses the company’s large‑scale e‑commerce architecture, the evolution of its monitoring platform, design choices for high‑availability distributed systems, and future open‑source plans, providing practical insights for engineers and technical managers.

C++Distributed SystemsOperations
0 likes · 9 min read
Interview with Lu Pengcheng on Mogu Street’s Monitoring System Architecture and Evolution
Efficient Ops
Efficient Ops
Nov 9, 2016 · Operations

How to Design Effective SLOs and SLAs: A Technical Deep Dive

This article explains the definitions of service, SLI, SLO, and SLA, outlines how to choose and measure appropriate indicators, shares best practices for setting and improving SLOs, and shows how SLAs combine objectives with consequences to manage service reliability.

OperationsSLASLI
0 likes · 11 min read
How to Design Effective SLOs and SLAs: A Technical Deep Dive
Node Underground
Node Underground
Nov 9, 2016 · Operations

4 Common Node.js Ops Issues and How to Fix Them

This article outlines four frequent Node.js operational problems—memory leaks, CPU bottlenecks, back‑pressure, and security risks—and provides practical solutions such as heap‑dump analysis, CPU profiling, APM monitoring, and using private npm registries with tools like Snyk to secure dependencies.

Node.jsOperationsmemory leak
0 likes · 4 min read
4 Common Node.js Ops Issues and How to Fix Them
ITPUB
ITPUB
Nov 9, 2016 · Operations

Diagnosing and Resolving High CPU Usage in a Linux Gateway Process

This article walks through a real‑world remote debugging session where a high‑CPU issue in a gateway service was reproduced, analyzed with top, gstack, gcore, strace and gdb, and traced to a buffer overflow causing an infinite loop, then fixed.

CPUOperationsgdb
0 likes · 7 min read
Diagnosing and Resolving High CPU Usage in a Linux Gateway Process
Efficient Ops
Efficient Ops
Nov 7, 2016 · Operations

How to Train New SREs Effectively: Proven Practices and Playbooks

This article outlines a systematic approach to onboarding and training new Site Reliability Engineers, covering trust building, readiness assessment, diverse learning methods, structured curricula, on‑call milestones, project‑focused work, reverse‑engineering skills, statistical thinking, and improvisation techniques to develop high‑performing SRE teams.

On-CallOperationsSRE
0 likes · 17 min read
How to Train New SREs Effectively: Proven Practices and Playbooks
ITPUB
ITPUB
Nov 2, 2016 · Operations

Monitor Linux System Resources with Simple Shell Scripts

This guide shows how to write Bash functions that retrieve process IDs, CPU, memory, file‑descriptor usage, port status, system load and disk space on a Linux server, and how to combine them with conditional checks to generate alerts when thresholds are exceeded.

LinuxOperationsShell
0 likes · 16 min read
Monitor Linux System Resources with Simple Shell Scripts

JEN: JD Extended Nginx Platform for Scalable Management and Automation

The article introduces JEN, JD's extended Nginx platform that centralizes configuration, monitoring, traffic splitting, rate limiting and automated operations through a web console and Ansible integration, addressing the complexity, restart requirements, and scaling challenges of large‑scale Nginx deployments.

Configuration ManagementNginxOperations
0 likes · 14 min read
JEN: JD Extended Nginx Platform for Scalable Management and Automation
ITFLY8 Architecture Home
ITFLY8 Architecture Home
Oct 31, 2016 · Cloud Computing

How Taobao Scaled from LAMP to Cloud: Lessons in Cloud Migration Architecture

This article examines the evolution of Taobao's technical architecture—from a LAMP stack through Oracle‑based mainframes to a cloud‑native platform—highlighting the performance, scalability, and cost challenges of traditional IT and offering best‑practice strategies for migrating enterprise systems to the cloud.

Big DataOperationsarchitecture migration
0 likes · 15 min read
How Taobao Scaled from LAMP to Cloud: Lessons in Cloud Migration Architecture
Efficient Ops
Efficient Ops
Oct 29, 2016 · Databases

Why Your System Slows Down: Uncover Hidden Database Bottlenecks

The article explains how unnoticed database issues often cause system slowness, outlines key diagnostic questions for operations teams, and presents a three‑step approach—discover, solve, prevent—to regularly health‑check and optimize databases for reliable performance.

Operationsdatabasesmonitoring
0 likes · 8 min read
Why Your System Slows Down: Uncover Hidden Database Bottlenecks
Efficient Ops
Efficient Ops
Oct 23, 2016 · Operations

How Google’s SRE Postmortems Drive System Reliability

This article explains Google’s SRE postmortem philosophy, the criteria for writing postmortems, best practices for a blame‑free culture, and how collaborative knowledge‑sharing and incentives improve incident handling and overall system reliability.

OperationsSREincident management
0 likes · 14 min read
How Google’s SRE Postmortems Drive System Reliability
Architecture Digest
Architecture Digest
Oct 21, 2016 · Operations

Dynamic Configuration Management for Distributed Systems: Concepts, Challenges, and Practices

The article explains the importance of configuration in software, distinguishes static and dynamic configuration, discusses the challenges of managing configuration in large distributed systems, and describes the evolution, design principles, and practical solutions of configuration centers such as Alibaba's Diamond.

ConfigurationDiamondOperations
0 likes · 21 min read
Dynamic Configuration Management for Distributed Systems: Concepts, Challenges, and Practices
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
Oct 19, 2016 · Operations

Wonder Monitoring: Scaling Ops with Open‑Falcon‑Powered Automation

This article explains how the internally built Wonder monitoring system, based on Open‑Falcon, tackles large‑scale operational challenges by offering automated agent updates, customizable metrics, log and port monitoring, persistent alarm storage, enhanced alert content, and comprehensive dashboards for thousands of devices.

AlertingInfrastructureOpen-Falcon
0 likes · 7 min read
Wonder Monitoring: Scaling Ops with Open‑Falcon‑Powered Automation
Efficient Ops
Efficient Ops
Oct 17, 2016 · Operations

How Shanda Games Built a Scalable Automated Operations System

This article details Shanda Games' journey in designing and implementing a comprehensive automated operations platform—including installation, deployment, security, client and server updates, data analysis, backup, and monitoring—to efficiently manage hundreds of games across diverse hardware and operating systems.

DeploymentOperationsSystem Design
0 likes · 22 min read
How Shanda Games Built a Scalable Automated Operations System