Tagged articles

system reliability

138 articles · Page 2 of 2

Mar 11, 2020 · Operations

How to Elevate Your Monitoring System: Proven Practices from Top DevOps Models

This article explains why modern services depend on highly available, scalable monitoring, outlines a systematic way to assess and improve monitoring capabilities using open‑source tools and the DevOps Capability Maturity Model, and details concrete improvement points across data collection, management, and application.

ObservabilityOperationsdevops

0 likes · 9 min read

How to Elevate Your Monitoring System: Proven Practices from Top DevOps Models

MaGe Linux Operations

Dec 5, 2019 · Operations

When Alipay Crashed: Lessons on High Availability and Disaster Recovery

On December 5th Alipay experienced a brief outage that sent users into panic, prompting a humorous recount of personal losses, meme images, and a reminder of the critical importance of high‑availability architecture and disaster‑recovery planning for large‑scale financial services.

Alipay outageDisaster RecoveryOperations

0 likes · 3 min read

When Alipay Crashed: Lessons on High Availability and Disaster Recovery

JD Retail Technology

Oct 22, 2019 · Industry Insights

How JD.com Prepares Its Systems for 11.11: Stress Tests, Forcebot Evolution, and Quality Controls

JD.com's Retail Technology and Data Platform orchestrated a full‑chain, four‑entry‑point stress test for the 11.11 shopping festival, introduced an upgraded Forcebot traffic‑recording tool, and implemented a "Quality Month" with ten safeguards to ensure system stability and prevent incidents during the massive sales event.

Operationsdevopse‑commerce

0 likes · 7 min read

How JD.com Prepares Its Systems for 11.11: Stress Tests, Forcebot Evolution, and Quality Controls

ITFLY8 Architecture Home

Aug 16, 2019 · Operations

Building Scalable Degradation Plans: Lessons from Tong‑Cheng Yilong

At QCon Beijing 2019, senior architect Wang Junxiang shared Tong‑Cheng Yilong’s end‑to‑end degradation‑plan architecture, covering system design, data collection, metric computation, resource recovery, link‑level pre‑plan management, fault diagnosis, strategy extensibility, and high‑availability platform construction, offering practical insights for complex distributed systems.

High Availabilitydegradationdistributed systems

0 likes · 4 min read

Building Scalable Degradation Plans: Lessons from Tong‑Cheng Yilong

Alibaba Cloud Developer

Jun 18, 2019 · Operations

Why Designing for Failure Is the Key to Resilient Systems

The article explains how anticipating and engineering for diverse failure scenarios—from hardware faults and software bugs to traffic spikes and external attacks—can dramatically improve system reliability, reduce downtime, and protect business continuity in modern distributed and cloud environments.

Disaster RecoveryFailure DesignMonitoring

0 likes · 12 min read

Why Designing for Failure Is the Key to Resilient Systems

Efficient Ops

May 14, 2019 · Operations

How to Master Multi‑Cloud Operations: Lessons from a Gaming Company’s Hybrid Architecture

This talk shares a senior director’s experience building a hybrid multi‑cloud infrastructure for a game company, covering stability, efficiency, cost challenges, design‑for‑failure principles, standardization, resource automation, and the cultural and organizational factors that affect successful cloud operations.

Hybrid CloudMulti-CloudOperations

0 likes · 20 min read

How to Master Multi‑Cloud Operations: Lessons from a Gaming Company’s Hybrid Architecture

Efficient Ops

May 5, 2019 · Operations

How Qunar Uses AI-Driven Fault Prediction to Boost System Reliability

This article outlines Qunar's operational strategy for reducing failures and extending uptime through precise fault detection, rapid recovery, and AI-powered predictive health management, detailing the evolution of their OPS processes, practical implementations, and future challenges in applying PHM to internet services.

AIOpsMonitoringOperations

0 likes · 18 min read

How Qunar Uses AI-Driven Fault Prediction to Boost System Reliability

21CTO

Apr 3, 2019 · Operations

Why Software Quality Fails: Black Swans, Butterfly Effects, and Technical Debt

The article explores how unpredictable black‑swan events, the butterfly effect, Murphy's law, rapid business growth, technical debt, tool choices, complex domains, documentation, and leadership all combine to threaten software stability, and proposes agile, systematic, and quality‑centric approaches to mitigate these risks.

Agileblack swansoftware engineering

0 likes · 22 min read

Why Software Quality Fails: Black Swans, Butterfly Effects, and Technical Debt

Architecture Digest

Mar 26, 2019 · Operations

Didi's Full‑Chain Load Testing Architecture and Implementation

The article details Didi's end‑to‑end load‑testing strategy—including online environment testing, data isolation with virtual orders, trace‑based traffic marking, and a distributed virtual driver/passenger tool—describing its design, deployment stages, findings, and future reliability applications.

Data IsolationDididistributed simulation

0 likes · 12 min read

Didi's Full‑Chain Load Testing Architecture and Implementation

Qunar Tech Salon

Feb 19, 2019 · Operations

Forbidden City Night Festival Ticketing Chaos and How to Recover a Crashed Website

The article recounts the Forbidden City’s first night‑time Lantern Festival event, the overwhelming demand that caused the museum’s ticketing website to crash, and includes an interview with a senior operations engineer who explains the causes of such overloads and outlines rapid mitigation and scaling strategies.

Operationsscalingsystem reliability

0 likes · 6 min read

Forbidden City Night Festival Ticketing Chaos and How to Recover a Crashed Website

Alibaba Cloud Developer

Nov 6, 2018 · Operations

How Alibaba Scaled Double 11: Lessons from a Decade of E‑commerce Mega‑Events

From its humble 2009 launch to the 2018 tenth anniversary, Alibaba’s Double 11 shopping festival evolved through relentless technical challenges—system crashes, CDN bottlenecks, over‑selling bugs, and massive load‑testing innovations—offering a decade‑long case study in operations, scalability, and resilience for large‑scale e‑commerce platforms.

Operationse-commerceload testing

0 likes · 16 min read

How Alibaba Scaled Double 11: Lessons from a Decade of E‑commerce Mega‑Events

Alibaba Cloud Developer

Nov 5, 2018 · Operations

How Alibaba Conquered Double 11: A Decade of Scaling, Crises, and Lessons

From the humble 2009 launch of Double 11 to the massive, cloud-native, multi-region architecture of 2018, Alibaba’s engineers chronicle yearly technical hurdles—traffic spikes, system crashes, CDN limits, over-selling, and the evolution of stress-testing, capacity planning, and operational safeguards that turned the shopping festival into a global engineering showcase.

Cloud ComputingOperationse-commerce

0 likes · 17 min read

How Alibaba Conquered Double 11: A Decade of Scaling, Crises, and Lessons

Mike Chen's Internet Architecture

Aug 7, 2018 · Backend Development

Understanding Distributed Caching: Use Cases, Memcached vs Redis Comparison, and Common Challenges

This article explains why distributed caching is essential for high‑concurrency systems, outlines typical use cases, compares Memcached and Redis across features and performance, and discusses common problems such as cache avalanche, penetration, warm‑up, update strategies, and degradation.

Backend PerformanceMemcachedRedis

0 likes · 8 min read

Understanding Distributed Caching: Use Cases, Memcached vs Redis Comparison, and Common Challenges

360 Tech Engineering

Aug 3, 2018 · Operations

Guidelines for Building Long‑Lived, Stable Systems: Goals, Practices, and Continuous Improvement

This article shares practical methodologies for designing, deploying, and maintaining systems that can reliably operate for ten years, covering goal setting, holistic design considerations, carrier and data‑center choices, active‑active architecture, server and platform selection, monitoring, and continuous personal improvement.

best practicessystem reliability

0 likes · 6 min read

Guidelines for Building Long‑Lived, Stable Systems: Goals, Practices, and Continuous Improvement

360 Zhihui Cloud Developer

Aug 2, 2018 · Operations

How to Build Systems That Run Stably for 10 Years

This article shares practical methodologies for building software systems that remain stable for a decade, covering goal setting, holistic design, operator and data‑center choices, cross‑region active‑active challenges, server and platform selection, comprehensive monitoring, and the importance of continuous personal improvement.

Operationscontinuous improvementsoftware architecture

0 likes · 7 min read

How to Build Systems That Run Stably for 10 Years

JD Retail Technology

Jun 12, 2018 · Operations

JD Life Technology Service Platform's 6.18 Preparation and System Upgrade

This article details how JD's Life Technology Service Platform prepared for the 6.18 shopping festival through system upgrades, stress testing, monitoring enhancements, and emergency drills to ensure high availability, performance, and reliable service for users.

6.18High AvailabilityJD

0 likes · 6 min read

JD Life Technology Service Platform's 6.18 Preparation and System Upgrade

Meituan Technology Team

Apr 19, 2018 · Operations

How Meituan‑Dianping Built a 100% High‑Availability Core Transaction System

This article analyzes the rapid growth challenges of Meituan‑Dianping's core payment flow, explains key availability metrics such as MTBF and MTTR, and presents a comprehensive set of architectural, operational, and tooling strategies—including dependency decoupling, timeout tuning, circuit breaking, and full‑link stress testing—to achieve stable, fault‑tolerant transactions.

High AvailabilityMicroservicesMonitoring

0 likes · 20 min read

How Meituan‑Dianping Built a 100% High‑Availability Core Transaction System

Baidu Intelligent Testing

Apr 16, 2018 · Operations

Online Load‑Testing Practices for Baidu Nuomi Marketing Activities

This article presents a comprehensive case study of Baidu Nuomi's online load‑testing methodology for high‑traffic marketing events, covering capacity estimation, test planning, execution, anti‑attack measures, platform architecture, and lessons learned to ensure system reliability and performance under peak loads.

Online Testingcapacity planningload testing

0 likes · 16 min read

Online Load‑Testing Practices for Baidu Nuomi Marketing Activities

ITFLY8 Architecture Home

Mar 24, 2018 · Operations

How Service Degradation and Fault‑Tolerance Keep Large‑Scale Systems Resilient

This article explains how setting low timeouts for non‑core services, decoupling and physically isolating micro‑services, separating light and heavy workloads, and implementing automated configuration checks together enhance system reliability and reduce both technical and human errors in high‑traffic environments.

configuration managementfault tolerancesystem reliability

0 likes · 9 min read

How Service Degradation and Fault‑Tolerance Keep Large‑Scale Systems Resilient

Efficient Ops

Jan 31, 2018 · Operations

85 Essential Ops Rules Every Engineer Should Follow

This article presents a comprehensive list of 85 practical operations rules covering capacity planning, monitoring, automation, security, documentation, budgeting, team management, and incident handling, offering actionable guidance for building reliable, scalable, and efficient IT infrastructure.

IT ManagementOperationsbest practices

0 likes · 20 min read

85 Essential Ops Rules Every Engineer Should Follow

Efficient Ops

Jan 11, 2018 · Operations

Mastering Incident Troubleshooting: Proven SRE Strategies for Operations

This article shares practical SRE‑based principles for diagnosing and resolving online incidents, emphasizing systematic investigation, gathering clues, and prioritizing service restoration over immediate root‑cause identification to make troubleshooting less mystical and more effective.

Incident ManagementOperationsSRE

0 likes · 7 min read

Mastering Incident Troubleshooting: Proven SRE Strategies for Operations

Architecture Digest

Oct 13, 2017 · Operations

Load Balancing, Reverse Proxy, and Isolation Techniques

This article explains how load balancing and reverse proxy mechanisms such as Nginx, Consul, and Hystrix work together with various isolation strategies—including thread, process, cluster, data‑center, and resource isolation—to improve system reliability and scalability in large‑scale web architectures.

ConsulHystrixReverse Proxy

0 likes · 10 min read

Load Balancing, Reverse Proxy, and Isolation Techniques

ITFLY8 Architecture Home

Sep 30, 2017 · Operations

Why Trust Less? Defensive Strategies for High‑Performance, High‑Availability Systems

The article explores how adopting a "don't trust" mindset—through rigorous input validation, defensive coding, thorough testing, gradual rollouts, and comprehensive monitoring—helps build resilient, high‑performance systems and avoid common pitfalls in development and operations.

Defensive ProgrammingMonitoringdeployment

0 likes · 13 min read

Why Trust Less? Defensive Strategies for High‑Performance, High‑Availability Systems

21CTO

Apr 16, 2017 · Operations

Which Load‑Balancing Strategy Guarantees the Highest Reliability?

This article explains common load‑balancing strategies—round‑robin, random, minimum response time, minimum concurrency, and hash—detailing their principles, advantages, drawbacks, and mathematical reliability analysis, including probability formulas and visual illustrations to help choose the most fault‑tolerant approach for distributed systems.

Round Robindistributed systemsload balancing

0 likes · 9 min read

Which Load‑Balancing Strategy Guarantees the Highest Reliability?

Efficient Ops

Feb 28, 2017 · Operations

Prepare Your E‑Commerce System for Mega‑Sales: Proactive Prevention & Rapid Response

This article outlines a comprehensive PDCA‑based methodology for e‑commerce platforms to proactively prevent issues, quickly detect anomalies, and execute rapid decisions during large‑scale promotions, covering system goal definition, performance evaluation, capacity planning, SLA management, and team/process maturity.

MonitoringOperationscapacity planning

0 likes · 18 min read

Prepare Your E‑Commerce System for Mega‑Sales: Proactive Prevention & Rapid Response

Tencent Cloud Developer

Feb 10, 2017 · Backend Development

Design and Implementation of QQ Game Spring Festival Red Packet System: Resilience, Overload Protection, and Monitoring

The QQ Game Spring Festival Red Packet system was engineered with multi‑data‑center deployment, global load balancing, layered overload protection, flexible critical‑path redundancy, three‑dimensional monitoring, and extensive rehearsal testing, delivering high‑availability and fault‑tolerant service even under extreme traffic spikes.

overload-protectionsystem reliability

0 likes · 16 min read

Design and Implementation of QQ Game Spring Festival Red Packet System: Resilience, Overload Protection, and Monitoring

360 Zhihui Cloud Developer

Oct 14, 2016 · Operations

Mastering Rapid Incident Response: An Ops Engineer’s 4‑Step Method

This guide teaches operations engineers a four‑step “Wang‑Wen‑Wen‑Qie” methodology—observing system health, listening via alerts, questioning changes, and pulse‑checking with profiling tools—to rapidly diagnose and resolve high‑impact incidents while maintaining clear communication and post‑mortem learning.

MonitoringOperationsincident response

0 likes · 5 min read

Mastering Rapid Incident Response: An Ops Engineer’s 4‑Step Method

ITFLY8 Architecture Home

Oct 12, 2016 · Backend Development

How to Implement and Manage Feature Toggles in Java for Scalable Systems

This article explains how to design and operate feature toggles in Java applications, covering single‑instance implementation, cross‑instance synchronization via a meta‑server or Diamond, handling composite switches, avoiding security pitfalls, and automating degradation and upgrade based on runtime metrics.

Auto Scalingconfiguration managementfeature toggle

0 likes · 8 min read

How to Implement and Manage Feature Toggles in Java for Scalable Systems

ITFLY8 Architecture Home

Oct 11, 2016 · Operations

Mastering Service Degradation: Strategies to Keep High‑Concurrency Systems Running

This article explores comprehensive service degradation techniques—including automatic and manual switchovers, read/write and multi‑level fallback strategies, and practical examples like timeout, failure count, and traffic throttling—to ensure core functionality remains available during traffic spikes or component failures in high‑concurrency systems.

High concurrencybackend operationsfallback strategies

0 likes · 14 min read

Mastering Service Degradation: Strategies to Keep High‑Concurrency Systems Running

360 Quality & Efficiency

Sep 2, 2016 · Operations

Linux Test Project (LTP): Installation, Usage, and Stress Testing Guide

This article provides a comprehensive guide to the Linux Test Project (LTP), covering its purpose, supported architectures, directory layout, installation steps, test categories, execution scripts, stress‑testing commands, result analysis, and troubleshooting tips for improving kernel stability and reliability.

LTPLinuxkernel testing

0 likes · 7 min read

Linux Test Project (LTP): Installation, Usage, and Stress Testing Guide

Efficient Ops

Jul 27, 2016 · Operations

Mastering Service Degradation: Strategies to Keep High‑Traffic Systems Alive

This article explores practical service‑degradation techniques—including automatic and manual switches, read/write fallback, and multi‑level strategies—to ensure core functionality remains available during traffic spikes, failures, or resource constraints in high‑concurrency systems for.

High concurrencyOperationsbackend

0 likes · 11 min read

Mastering Service Degradation: Strategies to Keep High‑Traffic Systems Alive

Java High-Performance Architecture

May 27, 2016 · Operations

How Twitter Handles Massive Traffic Surges with Stress Testing and Preparedness

Twitter keeps its platform stable during massive traffic spikes by regularly performing large‑scale stress and extreme tests, analyzing performance metrics, and maintaining detailed contingency plans that guide rapid response to unexpected events such as the record‑breaking “Sky City” incident.

OperationsTwittercontingency planning

0 likes · 4 min read

How Twitter Handles Massive Traffic Surges with Stress Testing and Preparedness

Baidu Intelligent Testing

May 9, 2016 · Backend Development

Effective Cache Strategies for High‑Concurrency Systems

The article explains how proper cache usage can dramatically improve resource utilization, response time, and reliability in high‑concurrency front‑end and back‑end systems, while also addressing cache hit‑rate optimization, data consistency, and mitigation techniques for cache penetration and avalanche scenarios.

Cache AvalancheCache PenetrationCache consistency

0 likes · 5 min read

Effective Cache Strategies for High‑Concurrency Systems

Architecture Digest

Apr 8, 2016 · Operations

Practical Fault‑Tolerance Practices in a Large‑Scale Activity Operations Platform

This article shares the author’s experience building fault‑tolerance for Tencent’s activity operations platform, covering retry strategies, automatic removal of unhealthy machines, timeout tuning, asynchronous processing, anti‑replay mechanisms, service degradation, service decoupling, and business‑level safeguards to reduce manual alarm handling and improve system robustness.

Operationsdistributed systemsfault tolerance

0 likes · 21 min read

Practical Fault‑Tolerance Practices in a Large‑Scale Activity Operations Platform

Architect

Jan 22, 2016 · Operations

System Reliability and Availability: Insights from the Alipay Outage and YunOS

The article examines system reliability concepts such as availability, MTBF, MTTR, and outage classifications, analyzes the Alipay service interruption, discusses various redundancy and failover strategies, and explores YunOS reliability testing and design practices to improve overall system robustness.

Cloud ComputingDisaster RecoveryMTBF

0 likes · 15 min read

System Reliability and Availability: Insights from the Alipay Outage and YunOS

Java High-Performance Architecture

Jan 5, 2016 · Operations

How Service Degradation Keeps E‑commerce Platforms Stable During Traffic Surges

The article explains why service degradation is essential for large‑scale shopping events, outlines its different dimensions such as page, business module, and remote service downgrade, and describes both manual and automatic implementation methods to maintain system availability under heavy load.

Operationse-commerceservice degradation

0 likes · 3 min read

How Service Degradation Keeps E‑commerce Platforms Stable During Traffic Surges

21CTO

Sep 28, 2015 · Operations

Mastering Log Management: 16 Rules to Boost System Reliability

This article presents a comprehensive set of logging best‑practice rules—from defining log levels and classifications to using RequestIDs, monitoring alerts, and managing log size—aimed at improving system reliability, troubleshooting speed, and operational efficiency.

LoggingMonitoringOperations

0 likes · 23 min read

Mastering Log Management: 16 Rules to Boost System Reliability

Art of Distributed System Architecture Design

Jun 3, 2015 · Operations

Bridging Development and Operations: Challenges and Principles for System Scalability and Reliability

The article examines the differing challenges faced by development and operations teams, explains key concepts of system performance, scalability, stateless design, and session replication, and offers practical principles to align both sides for reliable, cost‑effective software delivery.

session replicationsystem reliability

0 likes · 8 min read

Bridging Development and Operations: Challenges and Principles for System Scalability and Reliability