Tagged articles
137 articles
Page 2 of 2
MaGe Linux Operations
MaGe Linux Operations
Dec 5, 2019 · Operations

When Alipay Crashed: Lessons on High Availability and Disaster Recovery

On December 5th Alipay experienced a brief outage that sent users into panic, prompting a humorous recount of personal losses, meme images, and a reminder of the critical importance of high‑availability architecture and disaster‑recovery planning for large‑scale financial services.

Alipay outageFinancial ServicesOperations
0 likes · 3 min read
When Alipay Crashed: Lessons on High Availability and Disaster Recovery
JD Retail Technology
JD Retail Technology
Oct 22, 2019 · Industry Insights

How JD.com Prepares Its Systems for 11.11: Stress Tests, Forcebot Evolution, and Quality Controls

JD.com's Retail Technology and Data Platform orchestrated a full‑chain, four‑entry‑point stress test for the 11.11 shopping festival, introduced an upgraded Forcebot traffic‑recording tool, and implemented a "Quality Month" with ten safeguards to ensure system stability and prevent incidents during the massive sales event.

DevOpsOperationse‑commerce
0 likes · 7 min read
How JD.com Prepares Its Systems for 11.11: Stress Tests, Forcebot Evolution, and Quality Controls
ITFLY8 Architecture Home
ITFLY8 Architecture Home
Aug 16, 2019 · Operations

Building Scalable Degradation Plans: Lessons from Tong‑Cheng Yilong

At QCon Beijing 2019, senior architect Wang Junxiang shared Tong‑Cheng Yilong’s end‑to‑end degradation‑plan architecture, covering system design, data collection, metric computation, resource recovery, link‑level pre‑plan management, fault diagnosis, strategy extensibility, and high‑availability platform construction, offering practical insights for complex distributed systems.

Distributed Systemsdegradationhigh availability
0 likes · 4 min read
Building Scalable Degradation Plans: Lessons from Tong‑Cheng Yilong
Alibaba Cloud Developer
Alibaba Cloud Developer
Jun 18, 2019 · Operations

Why Designing for Failure Is the Key to Resilient Systems

The article explains how anticipating and engineering for diverse failure scenarios—from hardware faults and software bugs to traffic spikes and external attacks—can dramatically improve system reliability, reduce downtime, and protect business continuity in modern distributed and cloud environments.

disaster recoveryfailure designmonitoring
0 likes · 12 min read
Why Designing for Failure Is the Key to Resilient Systems
Efficient Ops
Efficient Ops
May 14, 2019 · Operations

How to Master Multi‑Cloud Operations: Lessons from a Gaming Company’s Hybrid Architecture

This talk shares a senior director’s experience building a hybrid multi‑cloud infrastructure for a game company, covering stability, efficiency, cost challenges, design‑for‑failure principles, standardization, resource automation, and the cultural and organizational factors that affect successful cloud operations.

Cost OptimizationDevOpsOperations
0 likes · 20 min read
How to Master Multi‑Cloud Operations: Lessons from a Gaming Company’s Hybrid Architecture
Efficient Ops
Efficient Ops
May 5, 2019 · Operations

How Qunar Uses AI-Driven Fault Prediction to Boost System Reliability

This article outlines Qunar's operational strategy for reducing failures and extending uptime through precise fault detection, rapid recovery, and AI-powered predictive health management, detailing the evolution of their OPS processes, practical implementations, and future challenges in applying PHM to internet services.

OperationsPHMaiops
0 likes · 18 min read
How Qunar Uses AI-Driven Fault Prediction to Boost System Reliability
21CTO
21CTO
Apr 3, 2019 · Operations

Why Software Quality Fails: Black Swans, Butterfly Effects, and Technical Debt

The article explores how unpredictable black‑swan events, the butterfly effect, Murphy's law, rapid business growth, technical debt, tool choices, complex domains, documentation, and leadership all combine to threaten software stability, and proposes agile, systematic, and quality‑centric approaches to mitigate these risks.

Software EngineeringSoftware qualityTechnical Debt
0 likes · 22 min read
Why Software Quality Fails: Black Swans, Butterfly Effects, and Technical Debt
Architecture Digest
Architecture Digest
Mar 26, 2019 · Operations

Didi's Full‑Chain Load Testing Architecture and Implementation

The article details Didi's end‑to‑end load‑testing strategy—including online environment testing, data isolation with virtual orders, trace‑based traffic marking, and a distributed virtual driver/passenger tool—describing its design, deployment stages, findings, and future reliability applications.

Data IsolationDidiLoad Testing
0 likes · 12 min read
Didi's Full‑Chain Load Testing Architecture and Implementation
Qunar Tech Salon
Qunar Tech Salon
Feb 19, 2019 · Operations

Forbidden City Night Festival Ticketing Chaos and How to Recover a Crashed Website

The article recounts the Forbidden City’s first night‑time Lantern Festival event, the overwhelming demand that caused the museum’s ticketing website to crash, and includes an interview with a senior operations engineer who explains the causes of such overloads and outlines rapid mitigation and scaling strategies.

Operationsscalingsystem reliability
0 likes · 6 min read
Forbidden City Night Festival Ticketing Chaos and How to Recover a Crashed Website
Alibaba Cloud Developer
Alibaba Cloud Developer
Nov 6, 2018 · Operations

How Alibaba Scaled Double 11: Lessons from a Decade of E‑commerce Mega‑Events

From its humble 2009 launch to the 2018 tenth anniversary, Alibaba’s Double 11 shopping festival evolved through relentless technical challenges—system crashes, CDN bottlenecks, over‑selling bugs, and massive load‑testing innovations—offering a decade‑long case study in operations, scalability, and resilience for large‑scale e‑commerce platforms.

Load TestingOperationsScalability
0 likes · 16 min read
How Alibaba Scaled Double 11: Lessons from a Decade of E‑commerce Mega‑Events
Alibaba Cloud Developer
Alibaba Cloud Developer
Nov 5, 2018 · Operations

How Alibaba Conquered Double 11: A Decade of Scaling, Crises, and Lessons

From the humble 2009 launch of Double 11 to the massive, cloud-native, multi-region architecture of 2018, Alibaba’s engineers chronicle yearly technical hurdles—traffic spikes, system crashes, CDN limits, over-selling, and the evolution of stress-testing, capacity planning, and operational safeguards that turned the shopping festival into a global engineering showcase.

OperationsPerformance TestingScalability
0 likes · 17 min read
How Alibaba Conquered Double 11: A Decade of Scaling, Crises, and Lessons
Mike Chen's Internet Architecture
Mike Chen's Internet Architecture
Aug 7, 2018 · Backend Development

Understanding Distributed Caching: Use Cases, Memcached vs Redis Comparison, and Common Challenges

This article explains why distributed caching is essential for high‑concurrency systems, outlines typical use cases, compares Memcached and Redis across features and performance, and discusses common problems such as cache avalanche, penetration, warm‑up, update strategies, and degradation.

Backend PerformanceMemcachedcaching strategies
0 likes · 8 min read
Understanding Distributed Caching: Use Cases, Memcached vs Redis Comparison, and Common Challenges
360 Tech Engineering
360 Tech Engineering
Aug 3, 2018 · Operations

Guidelines for Building Long‑Lived, Stable Systems: Goals, Practices, and Continuous Improvement

This article shares practical methodologies for designing, deploying, and maintaining systems that can reliably operate for ten years, covering goal setting, holistic design considerations, carrier and data‑center choices, active‑active architecture, server and platform selection, monitoring, and continuous personal improvement.

best practicessystem reliability
0 likes · 6 min read
Guidelines for Building Long‑Lived, Stable Systems: Goals, Practices, and Continuous Improvement
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
Aug 2, 2018 · Operations

How to Build Systems That Run Stably for 10 Years

This article shares practical methodologies for building software systems that remain stable for a decade, covering goal setting, holistic design, operator and data‑center choices, cross‑region active‑active challenges, server and platform selection, comprehensive monitoring, and the importance of continuous personal improvement.

Continuous ImprovementOperationsSoftware Architecture
0 likes · 7 min read
How to Build Systems That Run Stably for 10 Years
Meituan Technology Team
Meituan Technology Team
Apr 19, 2018 · Operations

How Meituan‑Dianping Built a 100% High‑Availability Core Transaction System

This article analyzes the rapid growth challenges of Meituan‑Dianping's core payment flow, explains key availability metrics such as MTBF and MTTR, and presents a comprehensive set of architectural, operational, and tooling strategies—including dependency decoupling, timeout tuning, circuit breaking, and full‑link stress testing—to achieve stable, fault‑tolerant transactions.

MicroservicesOperationscircuit breaker
0 likes · 20 min read
How Meituan‑Dianping Built a 100% High‑Availability Core Transaction System
Baidu Intelligent Testing
Baidu Intelligent Testing
Apr 16, 2018 · Operations

Online Load‑Testing Practices for Baidu Nuomi Marketing Activities

This article presents a comprehensive case study of Baidu Nuomi's online load‑testing methodology for high‑traffic marketing events, covering capacity estimation, test planning, execution, anti‑attack measures, platform architecture, and lessons learned to ensure system reliability and performance under peak loads.

Load Testingcapacity planningonline testing
0 likes · 16 min read
Online Load‑Testing Practices for Baidu Nuomi Marketing Activities
ITFLY8 Architecture Home
ITFLY8 Architecture Home
Mar 24, 2018 · Operations

How Service Degradation and Fault‑Tolerance Keep Large‑Scale Systems Resilient

This article explains how setting low timeouts for non‑core services, decoupling and physically isolating micro‑services, separating light and heavy workloads, and implementing automated configuration checks together enhance system reliability and reduce both technical and human errors in high‑traffic environments.

Configuration Managementfault tolerancesystem reliability
0 likes · 9 min read
How Service Degradation and Fault‑Tolerance Keep Large‑Scale Systems Resilient
Efficient Ops
Efficient Ops
Jan 31, 2018 · Operations

85 Essential Ops Rules Every Engineer Should Follow

This article presents a comprehensive list of 85 practical operations rules covering capacity planning, monitoring, automation, security, documentation, budgeting, team management, and incident handling, offering actionable guidance for building reliable, scalable, and efficient IT infrastructure.

IT ManagementOperationsbest practices
0 likes · 20 min read
85 Essential Ops Rules Every Engineer Should Follow
Efficient Ops
Efficient Ops
Jan 11, 2018 · Operations

Mastering Incident Troubleshooting: Proven SRE Strategies for Operations

This article shares practical SRE‑based principles for diagnosing and resolving online incidents, emphasizing systematic investigation, gathering clues, and prioritizing service restoration over immediate root‑cause identification to make troubleshooting less mystical and more effective.

OperationsSREincident management
0 likes · 7 min read
Mastering Incident Troubleshooting: Proven SRE Strategies for Operations
Architecture Digest
Architecture Digest
Oct 13, 2017 · Operations

Load Balancing, Reverse Proxy, and Isolation Techniques

This article explains how load balancing and reverse proxy mechanisms such as Nginx, Consul, and Hystrix work together with various isolation strategies—including thread, process, cluster, data‑center, and resource isolation—to improve system reliability and scalability in large‑scale web architectures.

ConsulHystrixIsolation
0 likes · 10 min read
Load Balancing, Reverse Proxy, and Isolation Techniques
21CTO
21CTO
Apr 16, 2017 · Operations

Which Load‑Balancing Strategy Guarantees the Highest Reliability?

This article explains common load‑balancing strategies—round‑robin, random, minimum response time, minimum concurrency, and hash—detailing their principles, advantages, drawbacks, and mathematical reliability analysis, including probability formulas and visual illustrations to help choose the most fault‑tolerant approach for distributed systems.

Distributed SystemsRound Robinload balancing
0 likes · 9 min read
Which Load‑Balancing Strategy Guarantees the Highest Reliability?
Efficient Ops
Efficient Ops
Feb 28, 2017 · Operations

Prepare Your E‑Commerce System for Mega‑Sales: Proactive Prevention & Rapid Response

This article outlines a comprehensive PDCA‑based methodology for e‑commerce platforms to proactively prevent issues, quickly detect anomalies, and execute rapid decisions during large‑scale promotions, covering system goal definition, performance evaluation, capacity planning, SLA management, and team/process maturity.

Operationscapacity planninge‑commerce
0 likes · 18 min read
Prepare Your E‑Commerce System for Mega‑Sales: Proactive Prevention & Rapid Response
Tencent Cloud Developer
Tencent Cloud Developer
Feb 10, 2017 · Backend Development

Design and Implementation of QQ Game Spring Festival Red Packet System: Resilience, Overload Protection, and Monitoring

The QQ Game Spring Festival Red Packet system was engineered with multi‑data‑center deployment, global load balancing, layered overload protection, flexible critical‑path redundancy, three‑dimensional monitoring, and extensive rehearsal testing, delivering high‑availability and fault‑tolerant service even under extreme traffic spikes.

overload-protectionsystem reliability
0 likes · 16 min read
Design and Implementation of QQ Game Spring Festival Red Packet System: Resilience, Overload Protection, and Monitoring
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
Oct 14, 2016 · Operations

Mastering Rapid Incident Response: An Ops Engineer’s 4‑Step Method

This guide teaches operations engineers a four‑step “Wang‑Wen‑Wen‑Qie” methodology—observing system health, listening via alerts, questioning changes, and pulse‑checking with profiling tools—to rapidly diagnose and resolve high‑impact incidents while maintaining clear communication and post‑mortem learning.

Operationsincident responsemonitoring
0 likes · 5 min read
Mastering Rapid Incident Response: An Ops Engineer’s 4‑Step Method
ITFLY8 Architecture Home
ITFLY8 Architecture Home
Oct 12, 2016 · Backend Development

How to Implement and Manage Feature Toggles in Java for Scalable Systems

This article explains how to design and operate feature toggles in Java applications, covering single‑instance implementation, cross‑instance synchronization via a meta‑server or Diamond, handling composite switches, avoiding security pitfalls, and automating degradation and upgrade based on runtime metrics.

Auto ScalingConfiguration Managementfeature toggle
0 likes · 8 min read
How to Implement and Manage Feature Toggles in Java for Scalable Systems
ITFLY8 Architecture Home
ITFLY8 Architecture Home
Oct 11, 2016 · Operations

Mastering Service Degradation: Strategies to Keep High‑Concurrency Systems Running

This article explores comprehensive service degradation techniques—including automatic and manual switchovers, read/write and multi‑level fallback strategies, and practical examples like timeout, failure count, and traffic throttling—to ensure core functionality remains available during traffic spikes or component failures in high‑concurrency systems.

backend operationsfallback strategieshigh concurrency
0 likes · 14 min read
Mastering Service Degradation: Strategies to Keep High‑Concurrency Systems Running
360 Quality & Efficiency
360 Quality & Efficiency
Sep 2, 2016 · Operations

Linux Test Project (LTP): Installation, Usage, and Stress Testing Guide

This article provides a comprehensive guide to the Linux Test Project (LTP), covering its purpose, supported architectures, directory layout, installation steps, test categories, execution scripts, stress‑testing commands, result analysis, and troubleshooting tips for improving kernel stability and reliability.

LTPLinuxkernel testing
0 likes · 7 min read
Linux Test Project (LTP): Installation, Usage, and Stress Testing Guide
Efficient Ops
Efficient Ops
Jul 27, 2016 · Operations

Mastering Service Degradation: Strategies to Keep High‑Traffic Systems Alive

This article explores practical service‑degradation techniques—including automatic and manual switches, read/write fallback, and multi‑level strategies—to ensure core functionality remains available during traffic spikes, failures, or resource constraints in high‑concurrency systems for.

BackendOperationsfallback strategies
0 likes · 11 min read
Mastering Service Degradation: Strategies to Keep High‑Traffic Systems Alive
Baidu Intelligent Testing
Baidu Intelligent Testing
May 9, 2016 · Backend Development

Effective Cache Strategies for High‑Concurrency Systems

The article explains how proper cache usage can dramatically improve resource utilization, response time, and reliability in high‑concurrency front‑end and back‑end systems, while also addressing cache hit‑rate optimization, data consistency, and mitigation techniques for cache penetration and avalanche scenarios.

Cache Consistencycache-avalanchecache-penetration
0 likes · 5 min read
Effective Cache Strategies for High‑Concurrency Systems
Architecture Digest
Architecture Digest
Apr 8, 2016 · Operations

Practical Fault‑Tolerance Practices in a Large‑Scale Activity Operations Platform

This article shares the author’s experience building fault‑tolerance for Tencent’s activity operations platform, covering retry strategies, automatic removal of unhealthy machines, timeout tuning, asynchronous processing, anti‑replay mechanisms, service degradation, service decoupling, and business‑level safeguards to reduce manual alarm handling and improve system robustness.

Distributed SystemsOperationsRetry
0 likes · 21 min read
Practical Fault‑Tolerance Practices in a Large‑Scale Activity Operations Platform
Architect
Architect
Jan 22, 2016 · Operations

System Reliability and Availability: Insights from the Alipay Outage and YunOS

The article examines system reliability concepts such as availability, MTBF, MTTR, and outage classifications, analyzes the Alipay service interruption, discusses various redundancy and failover strategies, and explores YunOS reliability testing and design practices to improve overall system robustness.

AvailabilityMTBFYunOS
0 likes · 15 min read
System Reliability and Availability: Insights from the Alipay Outage and YunOS
Java High-Performance Architecture
Java High-Performance Architecture
Jan 5, 2016 · Operations

How Service Degradation Keeps E‑commerce Platforms Stable During Traffic Surges

The article explains why service degradation is essential for large‑scale shopping events, outlines its different dimensions such as page, business module, and remote service downgrade, and describes both manual and automatic implementation methods to maintain system availability under heavy load.

Operationse‑commerceservice degradation
0 likes · 3 min read
How Service Degradation Keeps E‑commerce Platforms Stable During Traffic Surges
21CTO
21CTO
Sep 28, 2015 · Operations

Mastering Log Management: 16 Rules to Boost System Reliability

This article presents a comprehensive set of logging best‑practice rules—from defining log levels and classifications to using RequestIDs, monitoring alerts, and managing log size—aimed at improving system reliability, troubleshooting speed, and operational efficiency.

DebuggingLog ManagementOperations
0 likes · 23 min read
Mastering Log Management: 16 Rules to Boost System Reliability

Bridging Development and Operations: Challenges and Principles for System Scalability and Reliability

The article examines the differing challenges faced by development and operations teams, explains key concepts of system performance, scalability, stateless design, and session replication, and offers practical principles to align both sides for reliable, cost‑effective software delivery.

session replicationsystem reliability
0 likes · 8 min read
Bridging Development and Operations: Challenges and Principles for System Scalability and Reliability