Tagged articles
137 articles
Page 1 of 2
Architect
Architect
Mar 28, 2026 · Artificial Intelligence

Why AI Agents Need a Harness: From Model Power to System Reliability

The article analyzes how the growing strength of large language models shifts engineering bottlenecks from model capabilities to system stability, introducing the concept of a "Harness" that integrates models into real‑world workflows through state management, constraints, feedback loops, and verification mechanisms.

AI EngineeringAI OpsAgent Harness
0 likes · 18 min read
Why AI Agents Need a Harness: From Model Power to System Reliability
FunTester
FunTester
Feb 10, 2026 · Operations

Why Performance Testing Matters and How to Get Started: A Step‑by‑Step Guide

This article explains what performance testing is, why it’s essential for preventing system crashes under load, and provides a practical, step‑by‑step roadmap—including goal definition, test types, tool selection, metric interpretation, protection mechanisms, and result recording—to help developers and ops teams reliably assess and improve application performance.

Load TestingPerformance Testingmonitoring
0 likes · 13 min read
Why Performance Testing Matters and How to Get Started: A Step‑by‑Step Guide
DevOps Coach
DevOps Coach
Dec 10, 2025 · Operations

Why Rust Saved Cloudflare’s Edge While a Lua Nil Pointer Crashed It

The article examines Cloudflare’s December 2025 outage caused by a nil‑pointer bug hidden in Lua code, compares how Go and Rust would handle the same scenario, and extracts key operational lessons about global configuration, dynamic language risks, and the safety benefits of strong type systems.

GoLuaRust
0 likes · 8 min read
Why Rust Saved Cloudflare’s Edge While a Lua Nil Pointer Crashed It
Efficient Ops
Efficient Ops
Oct 19, 2025 · Operations

How Ningbo Bank Boosted System Reliability with SRE: Lessons from a 3‑Level Assessment

Ningbo Bank’s personal mobile banking system passed the SRE Level‑3 assessment, showcasing how systematic SRE practices, metric‑driven reliability engineering, and cross‑team collaboration can dramatically improve system stability, reduce failures, and support digital transformation in the financial sector.

Banking OperationsDigital TransformationIT stability
0 likes · 16 min read
How Ningbo Bank Boosted System Reliability with SRE: Lessons from a 3‑Level Assessment
JD Tech
JD Tech
Sep 26, 2025 · Operations

Avoiding High‑Availability Pitfalls: Real‑World JD Lessons and Solutions

This article examines common high‑availability challenges across applications, databases, caches, message queues, containers, and GC, presenting real JD engineering cases, root‑cause analyses, and practical mitigation strategies to help engineers design more resilient systems.

Message Queuedatabasefault tolerance
0 likes · 37 min read
Avoiding High‑Availability Pitfalls: Real‑World JD Lessons and Solutions
Architecture and Beyond
Architecture and Beyond
May 10, 2025 · Operations

What Heinrich’s 1:29:300 Rule Reveals About Preventing Online Outages

The article explains Heinrich's Law, its 1:29:300 accident pyramid, and how applying its principles—tracking minor incidents, hidden hazards, and systemic risks—can help software teams anticipate, diagnose, and prevent major online failures through systematic safety management and data‑driven practices.

Heinrich's LawOperationsincident management
0 likes · 15 min read
What Heinrich’s 1:29:300 Rule Reveals About Preventing Online Outages
FunTester
FunTester
Apr 12, 2025 · Operations

How to Design Effective Fault‑Testing Cases for Resilient Distributed Systems

This article explains why fault testing is essential for modern distributed and cloud environments, outlines core goals, design principles, common fault categories, practical implementation strategies such as chaos engineering and gray releases, and shows how to analyze results to continuously improve system reliability.

Distributed Systemschaos engineeringfault testing
0 likes · 18 min read
How to Design Effective Fault‑Testing Cases for Resilient Distributed Systems
FunTester
FunTester
Mar 31, 2025 · Operations

Performance Testing and Fault Testing: Complementary Pillars for System Stability

The article explains how performance testing measures system efficiency under load while fault testing validates resilience under abnormal conditions, highlighting their shared goals, differences, overlapping toolchains, and how their combined use drives architecture optimization and improves service level agreements in modern complex software systems.

Fault InjectionLoad TestingOperations
0 likes · 14 min read
Performance Testing and Fault Testing: Complementary Pillars for System Stability
Efficient Ops
Efficient Ops
Mar 24, 2025 · Operations

15 Common Ops Mistakes That Can Crash Your System – How to Avoid Them

This article outlines fifteen common operational and management mistakes—such as frequent incidents, excessive new hires, lack of automation, and missing rollback plans—that can trigger system outages, and offers guidance on how teams can strengthen testing, processes, and team capabilities to prevent downtime.

DevOpsOperationsSRE
0 likes · 6 min read
15 Common Ops Mistakes That Can Crash Your System – How to Avoid Them
Bilibili Tech
Bilibili Tech
Mar 18, 2025 · Operations

Technical Practices for Ensuring Stability of Bilibili’s 2025 Spring Festival Gala Live Stream

Bilibili’s engineering team built a scenario‑metadata and one‑click fault‑drill platform, implemented multi‑tier degradation, dynamic capacity planning, and extensive automated fault‑injection testing to guarantee zero‑severity incidents during the high‑traffic 2025 Spring Festival Gala live stream.

Fault InjectionOperationshigh concurrency
0 likes · 16 min read
Technical Practices for Ensuring Stability of Bilibili’s 2025 Spring Festival Gala Live Stream
iQIYI Technical Product Team
iQIYI Technical Product Team
Mar 13, 2025 · Operations

Automated Load Testing and Circuit Breaker Process for System Stability

To prevent performance degradation as systems scale, the team implemented an automated load‑testing and circuit‑breaker workflow that runs in the release pipeline, compares real‑time metrics against a baseline of CPU, QPS, memory and latency, blocks releases exceeding a 10 % drop, and logs issues, resulting in thousands of tests, dozens of bugs fixed, and up to 90 % faster wordlist creation.

AutomationLoad TestingPerformance Testing
0 likes · 6 min read
Automated Load Testing and Circuit Breaker Process for System Stability
FunTester
FunTester
Feb 13, 2025 · Operations

Why Fault Testing Is Critical for Modern Online Systems

In today's digital era, online services face increasing fault risks, and systematic fault testing—through chaos engineering, fault injection, stress testing, and disaster recovery drills—helps teams anticipate, evaluate, and improve system resilience, ultimately reducing downtime and protecting business continuity.

AutomationCloud NativeOperations
0 likes · 9 min read
Why Fault Testing Is Critical for Modern Online Systems
FunTester
FunTester
Jan 15, 2025 · Operations

How to Combine Performance Testing and Chaos Engineering for Rock‑Solid Systems

Drawing lessons from the 2021 AWS outage, this article explains how integrating performance testing with fault‑injection (chaos engineering) in microservice and Kubernetes environments can identify bottlenecks, validate resilience, and build a continuous stability strategy that balances speed and reliability.

KubernetesMicroservicesOperations
0 likes · 13 min read
How to Combine Performance Testing and Chaos Engineering for Rock‑Solid Systems
MaGe Linux Operations
MaGe Linux Operations
Jan 6, 2025 · Operations

What 2024 Outages Teach Us About Building Resilient Systems

A review of major 2024 service disruptions—from Alibaba Cloud to OpenAI—highlights key lessons such as early disaster‑recovery planning, regular backups, load balancing, real‑time monitoring, performance tuning, and capacity planning to improve system reliability and reduce future downtime.

disaster recoveryoutage analysissystem reliability
0 likes · 5 min read
What 2024 Outages Teach Us About Building Resilient Systems
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
Dec 4, 2024 · Operations

Service Reliability Essentials: Rate Limiting, Circuit Breaking & Degradation

This article explains common service problems and presents practical solutions such as rate limiting, circuit breaking, and degradation, detailing their principles, implementation methods—including Nginx, token‑bucket, sliding‑window algorithms, and Go‑zero code examples—while highlighting key considerations for building resilient microservice systems.

go-zerorate limitingservice degradation
0 likes · 15 min read
Service Reliability Essentials: Rate Limiting, Circuit Breaking & Degradation
Liangxu Linux
Liangxu Linux
Oct 1, 2024 · Operations

10 Proven Practices to Prevent System Failures for Ops Teams

This guide outlines ten practical strategies—including rollback testing, safe handling of destructive commands, prompt customization, robust backup and verification, production environment discipline, thorough handover, proactive monitoring, cautious auto‑failover, meticulous execution, and simplicity—to help operations engineers dramatically reduce system outages and improve reliability.

BackupOperationsbest practices
0 likes · 17 min read
10 Proven Practices to Prevent System Failures for Ops Teams
dbaplus Community
dbaplus Community
Sep 8, 2024 · Operations

10 Essential Ops Practices to Prevent System Failures

This article compiles ten practical operations‑engineer guidelines—ranging from change rollbacks and safe command aliases to backup verification, monitoring, and cautious automated failover—to help maintain high availability and avoid costly production incidents.

AutomationLinuxmonitoring
0 likes · 18 min read
10 Essential Ops Practices to Prevent System Failures
Bilibili Tech
Bilibili Tech
Sep 6, 2024 · Operations

Design and Implementation of a Cross‑Platform Real‑Time Troubleshooting System for Live Streaming

The team built a cross‑platform real‑time troubleshooting system for live streaming that adds critical‑business monitoring and a unified trace_id‑based tracing framework, simplifies OpenTracing, iterates reporting components, handles multi‑threading, stitches telemetry into searchable event chains, and via dashboards cut diagnosis time from two hours to five minutes, achieving a 91% fault‑resolution rate.

Distributed TracingPerformance Monitoringlive streaming
0 likes · 15 min read
Design and Implementation of a Cross‑Platform Real‑Time Troubleshooting System for Live Streaming
DevOps Operations Practice
DevOps Operations Practice
Sep 2, 2024 · Operations

How a Strong Operations Team Drives Business Success

In the digital era, a capable IT operations team ensures system stability, reduces costs, accelerates issue resolution, strengthens security, supports product development, and improves user experience, making it a critical driver of overall business value.

DevOpsIT Operationsbusiness support
0 likes · 6 min read
How a Strong Operations Team Drives Business Success
Cognitive Technology Team
Cognitive Technology Team
Aug 25, 2024 · Operations

Fault Isolation Techniques for High Availability in Distributed Systems

The article explains fault isolation as a key technique for improving distributed system availability, detailing multiple isolation levels—from data‑center to user‑level—and complementary strategies such as circuit breakers, timeouts, fast‑fail, load balancing, caching, and degradation switches.

Distributed SystemsResource Isolationcircuit breaker
0 likes · 10 min read
Fault Isolation Techniques for High Availability in Distributed Systems
Liangxu Linux
Liangxu Linux
Aug 24, 2024 · Fundamentals

How a Simple Data‑Type Conversion Bug Sank the Ariane 5 Rocket

The 1996 Ariane 5 launch failed when a reused navigation code incorrectly converted 64‑bit floating‑point velocity data to a 16‑bit signed integer, causing an overflow that disabled the guidance system and led to the rocket's explosion, highlighting critical software engineering lessons.

Ariane 5Software Engineeringbug
0 likes · 6 min read
How a Simple Data‑Type Conversion Bug Sank the Ariane 5 Rocket
Open Source Linux
Open Source Linux
Aug 23, 2024 · Operations

10 Proven Ops Practices to Prevent System Failures

This article shares ten practical operations strategies—including change rollbacks, safe handling of destructive commands, prompt customization, rigorous backup and verification, production environment discipline, careful handovers, robust alerting, cautious automatic failover, meticulous checks, and simplicity—to dramatically improve system reliability and availability.

BackupLinuxOperations
0 likes · 17 min read
10 Proven Ops Practices to Prevent System Failures
Architect
Architect
Aug 6, 2024 · Operations

Handling Interface-Level Failures: Degradation, Circuit Breaking, Rate Limiting, and Queuing

The article explains what interface‑level failures are, why they occur due to internal bugs or external overload, and presents four practical mitigation techniques—degradation, circuit breaking, rate limiting, and queuing—detailing their principles, implementation options, and trade‑offs for reliable system operation.

Queuecircuit breakerdegradation
0 likes · 16 min read
Handling Interface-Level Failures: Degradation, Circuit Breaking, Rate Limiting, and Queuing
IT Services Circle
IT Services Circle
Aug 5, 2024 · Fundamentals

Ariane 5 Rocket Explosion Caused by a Software Integer‑Overflow Bug

The 1996 Ariane 5 launch failed and exploded due to a single line of legacy code that caused a 64‑bit floating‑point to 16‑bit signed integer conversion overflow in the guidance system, highlighting the dangers of unchecked code reuse, inadequate error handling, and insufficient testing in critical software.

Software Engineeringdata type conversioninteger overflow
0 likes · 6 min read
Ariane 5 Rocket Explosion Caused by a Software Integer‑Overflow Bug
ITPUB
ITPUB
Jul 2, 2024 · Operations

Why Cost‑Cutting Undermines Tech Reliability: Lessons from Massive Internet Outages

The article examines how unrealistic cost‑reduction targets, ignored expert advice, and short‑term resource cuts have repeatedly caused large‑scale outages in major internet platforms, highlighting the labor‑, knowledge‑, and asset‑intensive nature of technical reliability and proposing sustained, expert‑led planning as a remedy.

IT Managementlarge-scale systemssystem reliability
0 likes · 11 min read
Why Cost‑Cutting Undermines Tech Reliability: Lessons from Massive Internet Outages
Efficient Ops
Efficient Ops
Jun 25, 2024 · Operations

Mastering the Four Golden Signals: A Practical Guide to System Monitoring

This guide explains how to use the four golden signals—latency, traffic, errors, and saturation—to design effective monitoring across servers, services, and external dependencies, helping teams detect issues early and maintain reliable, high‑performance systems.

SREmonitoringsystem reliability
0 likes · 20 min read
Mastering the Four Golden Signals: A Practical Guide to System Monitoring
Efficient Ops
Efficient Ops
May 12, 2024 · Operations

From Firefighting to Fire‑Starting: Mastering Operations for System Reliability

The article outlines a three‑stage evolution of operations—from rapid incident response to proactive fault‑injection—while offering practical guidance on improving availability, visualizing changes, and aligning technical metrics with business value to elevate the role of operations engineers.

AvailabilityFault InjectionSRE
0 likes · 7 min read
From Firefighting to Fire‑Starting: Mastering Operations for System Reliability
iQIYI Technical Product Team
iQIYI Technical Product Team
May 10, 2024 · Operations

Full‑Link Load Testing of iQIYI Playback Service: Process, Tools, and Outcomes

iQIYI implemented full‑link load testing of its playback service using LoadMaker for traffic generation and Rover for link control, mapping the topology, creating weighted user scenarios, and safely pressurizing production‑like environments, which validated multi‑times historical peak capacity, uncovered bottlenecks, and enabled several performance and disaster‑recovery improvements without impacting real users.

Load Testingcapacity planningiQIYI
0 likes · 10 min read
Full‑Link Load Testing of iQIYI Playback Service: Process, Tools, and Outcomes
Efficient Ops
Efficient Ops
Apr 14, 2024 · Operations

How to Ensure System Stability and High Availability: An SRE Playbook

This article explains the definitions of stability and high availability, distinguishes their relationship, outlines key performance indicators, and provides a comprehensive framework—including fault prevention, detection, and recovery, as well as design, coding, testing, monitoring, and emergency response practices—to help teams build reliable, highly available systems.

SREcapacity planninghigh availability
0 likes · 10 min read
How to Ensure System Stability and High Availability: An SRE Playbook
AntTech
AntTech
Apr 3, 2024 · Artificial Intelligence

Post‑Mortem of an AI‑Generated Flash‑Sale System Failure at Ant Internal Network

The article analyzes a recent outage of Ant's internal flash‑sale service built with AI‑generated low‑code, explains why the AI‑written business logic was not the cause, details the database capacity bottleneck that triggered a snowball effect, and discusses future automation and operational strategies to prevent similar failures.

AIOperationsdatabase scaling
0 likes · 12 min read
Post‑Mortem of an AI‑Generated Flash‑Sale System Failure at Ant Internal Network
Efficient Ops
Efficient Ops
Mar 25, 2024 · Operations

How CAICT’s SRE Standards Strengthen System Reliability and Continuity

This article outlines the rising frequency of system outages, explains the key characteristics and challenges of modern large‑scale distributed systems, introduces China’s CAICT SRE framework and its two‑part reliability model, showcases a successful SRE case, and announces the 2024 SRE maturity assessment program.

Digital GovernanceSREsoftware reliability
0 likes · 12 min read
How CAICT’s SRE Standards Strengthen System Reliability and Continuity
Test Development Learning Exchange
Test Development Learning Exchange
Mar 7, 2024 · Operations

Full‑Chain Load Testing: Definition, Challenges, and Best Practices

This article explains the concept of full‑chain load testing for e‑commerce systems, outlines why it is essential, discusses major challenges such as coordination and data isolation, and provides practical steps and optimization strategies to reliably simulate real‑world traffic and improve system stability.

Load TestingPerformance Optimizatione‑commerce
0 likes · 9 min read
Full‑Chain Load Testing: Definition, Challenges, and Best Practices
ITPUB
ITPUB
Feb 17, 2024 · Operations

Why Ops Professionals Must Look Up: The 4+1+1+1 Framework Explained

The article reflects on the relentless challenges of IT operations, outlines the never‑ending skill gaps, standards, trends and blame, and introduces a 4+1+1+1 model that separates developers, testers, security staff from four core ops responsibilities to guide systematic ops system construction.

4+1+1+1 modelIT opsInfrastructure Management
0 likes · 6 min read
Why Ops Professionals Must Look Up: The 4+1+1+1 Framework Explained
Efficient Ops
Efficient Ops
Jan 29, 2024 · Operations

Mastering Incident Response: A Practical Guide to Faster Service Recovery

This guide walks ops teams through real‑world incident scenarios, from quick symptom identification and emergency recovery to improving monitoring, crafting concise emergency plans, and leveraging automation for smarter fault handling, helping organizations restore services faster and reduce downtime.

emergency planningfault handlingincident response
0 likes · 14 min read
Mastering Incident Response: A Practical Guide to Faster Service Recovery
Efficient Ops
Efficient Ops
Jan 23, 2024 · Operations

Why Building Truly High‑Availability Systems Is Harder Than You Think

The article examines why 2023 saw a surge in major online outages, linking layoffs and cost‑cutting to lost expertise, and explores the entropy and Murphy laws that make perpetual high availability impossible without continuous, systematic investment and cultural change.

SRETechnical Debthigh availability
0 likes · 13 min read
Why Building Truly High‑Availability Systems Is Harder Than You Think
DevOps
DevOps
Jan 12, 2024 · Operations

Why Building a Never‑Failing System Is Impossible and How to Pursue Continuous High Availability

The article analyses why truly never‑failing systems cannot exist—citing entropy and Murphy’s laws—examines the organizational and technical obstacles to continuous high availability, and offers practical cultural and engineering practices such as testing, code review, monitoring, and regular system health checks to mitigate risk.

Murphy's LawOperationsSRE
0 likes · 14 min read
Why Building a Never‑Failing System Is Impossible and How to Pursue Continuous High Availability
Huolala Tech
Huolala Tech
Dec 21, 2023 · Operations

How to Accurately Evaluate and Guarantee System Capacity for High‑Traffic Services

This article explains why capacity assessment and guarantee are essential for high‑traffic services, outlines the core factors influencing system capacity such as thread count, response time, CPU, memory and database resources, presents calculation formulas, describes load‑testing methods, shares practical benchmark results for Tomcat and Undertow, and offers actionable recommendations for improving throughput and stability.

Load TestingPerformance Optimizationbackend services
0 likes · 16 min read
How to Accurately Evaluate and Guarantee System Capacity for High‑Traffic Services
21CTO
21CTO
Dec 6, 2023 · Operations

How One Line of C Code Crippled AT&T’s Network for 9 Hours

A 1990 AT&T network outage caused by an untested C code change led to a nine‑hour service collapse, massive financial loss, and widespread disruption, illustrating how a single software bug can trigger cascading failures in large‑scale telecommunications systems.

AT&TC programmingnetwork outage
0 likes · 6 min read
How One Line of C Code Crippled AT&T’s Network for 9 Hours
Architecture and Beyond
Architecture and Beyond
Dec 2, 2023 · Operations

Postmortem Analysis of the Yuque Service Outage and Lessons on Complex Systems and the KISS Principle

The article reviews the October 23 Yuque service outage, analyzes root causes such as a buggy upgrade tool and outdated storage, extracts operational lessons on testing, disaster recovery, high‑availability, communication, and advocates the KISS principle to simplify complex systems for improved reliability.

Complex SystemsKISS principleOperations
0 likes · 10 min read
Postmortem Analysis of the Yuque Service Outage and Lessons on Complex Systems and the KISS Principle
Bilibili Tech
Bilibili Tech
Nov 28, 2023 · Operations

Technical Assurance Practices for the 13th League of Legends World Championship Live Stream

For the 13th League of Legends World Championship live stream on Bilibili, a comprehensive technical‑assurance framework—covering pre‑event traffic buildup, in‑event experience, and post‑event replay—mapped over 60 business functions, applied a traffic‑estimation model, executed fault‑injection drills, load tests, strict SOPs and change control, and real‑time monitoring, enabling 120 million viewers and a peak of 460 million concurrent users.

Fault InjectionOperationsPerformance Testing
0 likes · 19 min read
Technical Assurance Practices for the 13th League of Legends World Championship Live Stream
Baidu Geek Talk
Baidu Geek Talk
Nov 22, 2023 · Operations

Stability Assurance for Baidu Search Aladdin during Large-Scale Events

Baidu’s Aladdin search service safeguards stability during massive traffic spikes—such as Gaokao, the Tokyo and Beijing Olympics—by mapping dependencies, deploying multi‑dimensional monitoring, adding scaling layers like multi‑region Redis, and establishing rapid‑response on‑call teams, achieving over 99.99 % uptime and near‑real‑time data updates.

backend operationsfault handlinglarge-scale traffic
0 likes · 9 min read
Stability Assurance for Baidu Search Aladdin during Large-Scale Events
Efficient Ops
Efficient Ops
Nov 6, 2023 · Operations

How Beijing Mobile Achieved Leading SRE Maturity: Insights from a Level‑3 Assessment

Beijing Mobile’s Order Center project passed the CAICT’s Level‑3 System Reliability and Continuity Engineering assessment, showcasing how SRE practices, cultural shifts, and tool automation boosted system stability, reduced incidents by 77%, cut recovery time by 54%, and set a benchmark for large‑scale IT operations in China’s telecom sector.

China MobileDevOpsDigital Transformation
0 likes · 17 min read
How Beijing Mobile Achieved Leading SRE Maturity: Insights from a Level‑3 Assessment
JD Tech
JD Tech
Oct 30, 2023 · Operations

High‑Availability Assurance for E‑Commerce Mega‑Promotion Systems

This article outlines a systematic approach to ensuring high availability for e‑commerce mega‑promotion events, covering historical context, business model analysis, goal setting, strategic planning, tactical execution, and growth, with detailed evaluation of marketing, transaction, fulfillment, and monitoring processes.

Performance Monitoringe‑commercehigh availability
0 likes · 22 min read
High‑Availability Assurance for E‑Commerce Mega‑Promotion Systems
Architecture and Beyond
Architecture and Beyond
Oct 29, 2023 · Operations

Postmortem of the October 23 Yuque Service Outage: Lessons on Complex Systems and the KISS Principle

The October 23 Yuque outage, caused by a buggy upgrade tool and outdated storage hardware, highlighted the importance of thorough testing, robust disaster‑recovery, high‑availability architecture, clear communication, continuous learning, and applying the KISS principle to simplify complex systems and improve operational stability.

Complex SystemsKISS principleOperations
0 likes · 10 min read
Postmortem of the October 23 Yuque Service Outage: Lessons on Complex Systems and the KISS Principle
Bilibili Tech
Bilibili Tech
Sep 26, 2023 · Backend Development

Applying CQRS Architecture to Live Streaming Room Service: Design, Evolution, and Operational Practices

The live‑streaming room service was re‑architected using CQRS, dividing read‑heavy viewer functions from write‑intensive broadcaster operations, splitting the monolith into focused Go micro‑services, adding multi‑level caching, event‑driven sync, extensive observability, and automated incident‑response to achieve massive scalability and rapid fault recovery.

CQRSObservabilitylive streaming
0 likes · 18 min read
Applying CQRS Architecture to Live Streaming Room Service: Design, Evolution, and Operational Practices
Didi Tech
Didi Tech
Sep 5, 2023 · Operations

Observability and Stability Engineering in Didi Ride‑Hailing Platform

At Didi, observability and stability engineering combine automated, AI‑driven alarm generation, distributed tracing, and ChatOps‑based fault handling to manage micro‑service complexity, massive traffic spikes, and cross‑region operations, emphasizing systematic investment, AIOps evolution, and a recruitment call for backend and test engineers.

DidiDistributed SystemsObservability
0 likes · 16 min read
Observability and Stability Engineering in Didi Ride‑Hailing Platform
JD Tech
JD Tech
Aug 28, 2023 · Backend Development

Handling Large Payload Issues in JD Logistics: Causes, Impacts, and Mitigation Strategies

The article analyzes the root causes and system‑wide consequences of oversized messages in JD Logistics, explains middleware limits of JMQ and JSF, and presents design principles, code‑level checks, and practical mitigation techniques such as pagination, claim‑check pattern, batch sizing, and monitoring to prevent service outages.

JMQJSFMessage Queue
0 likes · 32 min read
Handling Large Payload Issues in JD Logistics: Causes, Impacts, and Mitigation Strategies
JD Retail Technology
JD Retail Technology
Aug 24, 2023 · Operations

High‑Availability Strategies for E‑commerce Large‑Scale Promotion Systems

This article outlines a comprehensive framework for preparing e‑commerce platforms for major sales events, covering the history of promotions, business models, system chain segmentation, stability goals, strategic planning, tactical measures, growth promotion, and reference resources to ensure high availability and reliable user experience.

e‑commercehigh availabilitylarge‑scale promotion
0 likes · 19 min read
High‑Availability Strategies for E‑commerce Large‑Scale Promotion Systems
Python Programming Learning Circle
Python Programming Learning Circle
Jul 10, 2023 · Fundamentals

Famous Software Bugs That Shaped History

This article reviews several notorious software bugs—from the Y2K millennium bug and a missile defense timing error that cost lives, to a Mars probe navigation mishap, a false Cold‑War alarm, and a costly Pepsi promotion glitch—illustrating how tiny code flaws can cause massive real‑world consequences.

Software EngineeringY2Ksoftware bugs
0 likes · 6 min read
Famous Software Bugs That Shaped History
Efficient Ops
Efficient Ops
Jun 25, 2023 · Operations

How to Build a Next‑Gen “Big Operations” System for Reliability and Observability

This article outlines the evolution from manual operations to DevOps and SRE‑driven “big operations,” detailing system reliability and continuity practices, observability concepts, and the development of AIOps maturity standards, offering a comprehensive guide for building stable, efficient, and secure operational frameworks.

DevOpsObservabilityOperations
0 likes · 14 min read
How to Build a Next‑Gen “Big Operations” System for Reliability and Observability
dbaplus Community
dbaplus Community
Jun 5, 2023 · Operations

Mastering Production Faults: Diagnose and Fix Network, Server, Database Issues

This guide outlines the most common production failures—including network, server, database, software, security, storage, configuration, and third‑party service issues—and provides step‑by‑step methods to detect, troubleshoot, and resolve each problem, helping maintain system stability and reliability.

OperationsServerdatabase
0 likes · 30 min read
Mastering Production Faults: Diagnose and Fix Network, Server, Database Issues
Wukong Talks Architecture
Wukong Talks Architecture
May 17, 2023 · Operations

Common Production Faults and Their Handling Guide

This guide outlines the most common production failures—including network, server, database, software, security, storage, configuration, and third‑party service issues—and provides detailed steps for detecting, diagnosing, and resolving each type to maintain system stability and reliability.

Operationsfault handlingproduction
0 likes · 30 min read
Common Production Faults and Their Handling Guide
DeWu Technology
DeWu Technology
Apr 26, 2023 · Operations

Stability and Alerting Practices for E‑commerce Order Submission Service

The article details how a high‑throughput e‑commerce checkout pipeline achieves stability by combining fine‑grained metrics, custom trace logs, version‑based data validation, and targeted alert rules that detect latency spikes, error‑code surges, and downstream service failures, enabling rapid incident localization and reliable order processing.

Alertinge‑commercemonitoring
0 likes · 12 min read
Stability and Alerting Practices for E‑commerce Order Submission Service
JD Tech
JD Tech
Mar 14, 2023 · Operations

Introduction to Chaos Engineering and Its Practical Exercise Workflow

This article offers a comprehensive overview of chaos engineering, explaining its definition, why it is needed, the value it brings, a detailed step‑by‑step practice workflow—including preparation, execution, recovery and review phases—typical drill scenarios, key assessment metrics, and risk‑control measures to improve system reliability and high‑availability.

Fault Injectionchaos engineeringrisk management
0 likes · 11 min read
Introduction to Chaos Engineering and Its Practical Exercise Workflow
Top Architect
Top Architect
Dec 25, 2022 · Backend Development

Why Use Message Queues? Benefits, Challenges, and Practical Solutions

This article explains why message queues are essential for decoupling services, enabling asynchronous processing, and smoothing traffic spikes, then details the new challenges they introduce—such as availability, complexity, duplicate consumption, ordering, and data consistency—and offers concrete mitigation strategies for each issue.

Data ConsistencyDecouplingIdempotency
0 likes · 15 min read
Why Use Message Queues? Benefits, Challenges, and Practical Solutions
dbaplus Community
dbaplus Community
Oct 25, 2022 · Operations

How a Government System’s Week‑Long Outage Exposed Critical Backend and Load‑Balancing Flaws

A government information system suffered a week of instability, including service deadlocks, Tomcat memory overflows, and load‑balancing failures, prompting a deep forensic analysis that uncovered database lock‑ups, faulty front‑end loops, inadequate monitoring, and misconfigured logging, leading to concrete remediation steps and lessons for future reliability.

OperationsTomcatincident analysis
0 likes · 21 min read
How a Government System’s Week‑Long Outage Exposed Critical Backend and Load‑Balancing Flaws
DeWu Technology
DeWu Technology
Oct 17, 2022 · Operations

High Availability: Principles and Practices for System Stability

High availability—measured in nines of uptime—requires partitioning systems, decoupling components, choosing robust technologies, deploying redundant instances with automatic failover, capacity planning, rapid scaling, traffic shaping, resource isolation, global protection, observability, and disciplined change management to achieve stable, resilient services.

Observabilitycapacity planningchange management
0 likes · 10 min read
High Availability: Principles and Practices for System Stability
Efficient Ops
Efficient Ops
Oct 13, 2022 · Operations

Essential Guide to Effective Monitoring in Operations: Goals, Methods, and Tools

This article outlines the essential components of operational monitoring, covering monitoring objectives, methods, core processes, key tools, metrics for hardware, system, application, network, and business layers, as well as alerting, handling, and best practices for building a comprehensive, reliable monitoring solution.

Alertingmetricssystem reliability
0 likes · 7 min read
Essential Guide to Effective Monitoring in Operations: Goals, Methods, and Tools
ITPUB
ITPUB
Oct 4, 2022 · Operations

What Makes a System Truly High‑Availability? Lessons from B‑Station’s Outage

The article examines B‑Station’s July 2021 outage, explains the concept and quantitative metrics of high availability, and outlines practical techniques such as rate limiting, isolation, failover, timeout control, circuit breaking, degradation, and multi‑region deployment to achieve resilient systems.

MTBFMTTRcircuit breaker
0 likes · 15 min read
What Makes a System Truly High‑Availability? Lessons from B‑Station’s Outage
Qunar Tech Salon
Qunar Tech Salon
Jun 16, 2022 · Operations

Practical Chaos Engineering Practices at Qunar Travel: Architecture, Scenarios, and Automation

This article details Qunar Travel's mature chaos engineering platform built on chaosblade, covering value analysis, system architecture, shutdown and dependency drills, automated closed‑loop testing, attack‑defense exercises, and the measurable reliability improvements achieved across thousands of services.

Distributed SystemsFault InjectionOperations
0 likes · 18 min read
Practical Chaos Engineering Practices at Qunar Travel: Architecture, Scenarios, and Automation
JD Retail Technology
JD Retail Technology
Jun 10, 2022 · Operations

JD International 618 Promotion: Technical Operations, DDoS Protection, and Performance Testing

The article details JD International's technical preparation for the 618 sales event, covering operational planning, DDoS mitigation with Cloudflare, performance testing methodologies, cross‑team collaboration, and real‑time monitoring to ensure system stability and a seamless shopping experience across multiple regions.

DDoS protectionPerformance Testingcloud computing
0 likes · 13 min read
JD International 618 Promotion: Technical Operations, DDoS Protection, and Performance Testing
21CTO
21CTO
Mar 31, 2022 · Operations

What Caused the Biggest 2021 Outages? Lessons from Bilibili, Facebook, AWS, and More

The article reviews ten major 2021 service outages—from Chinese platforms like Bilibili and Futu to global giants such as Facebook, Roblox, and AWS—analyzing their root causes, redundancy failures, and the operational lessons needed to prevent future black‑swans.

high availabilityincident responseoutage analysis
0 likes · 15 min read
What Caused the Biggest 2021 Outages? Lessons from Bilibili, Facebook, AWS, and More
NetEase Cloud Music Tech Team
NetEase Cloud Music Tech Team
Mar 4, 2022 · Operations

Cloud Music's Pre-Plan Platform for Stability Management

Since 2018, Cloud Music has progressively built a comprehensive stability management capability—moving from no safeguards to an integrated pre‑plan platform that standardizes incident response, automates execution, and shares best practices—thereby enhancing service reliability and guiding future stability initiatives.

cloud computingpre-plan platformstability management
0 likes · 17 min read
Cloud Music's Pre-Plan Platform for Stability Management
TAL Education Technology
TAL Education Technology
Feb 10, 2022 · Operations

Client‑Side Circuit Breaking Strategies: State Machine, Google SRE Breaker, and Mitigation Techniques

This article explains why client‑side circuit breaking is essential, describes common state‑machine and Google SRE breaker strategies, provides practical pseudocode, and discusses mitigation methods such as Gutter mode, jittered exponential backoff, and graceful degradation to protect system stability.

Circuit BreakingGoogle SREclient-side
0 likes · 14 min read
Client‑Side Circuit Breaking Strategies: State Machine, Google SRE Breaker, and Mitigation Techniques
Programmer DD
Programmer DD
Feb 8, 2022 · Operations

What Triggered the Biggest Internet Outages of 2021? Lessons from 10 Major Incidents

A comprehensive review of ten major 2021 internet outages—from domestic platforms like Bilibili and Futu to global services such as Facebook, Roblox, and AWS—examines their root causes, the role of infrastructure design, and the operational lessons needed to improve system resilience.

cloud infrastructureincident responseoutage analysis
0 likes · 16 min read
What Triggered the Biggest Internet Outages of 2021? Lessons from 10 Major Incidents
Java Backend Technology
Java Backend Technology
Feb 7, 2022 · Operations

Why Did the Internet Crash in 2021? 10 Major Outage Lessons

The article reviews ten significant 2021 internet outages—both domestic and international—analyzing their root causes, from server room power failures to configuration bugs, and highlights the operational lessons engineers can learn to improve system resilience.

Case StudyOperationsOutage
0 likes · 17 min read
Why Did the Internet Crash in 2021? 10 Major Outage Lessons
ITPUB
ITPUB
Jan 5, 2022 · Operations

Why Contingency Planning Beats System Optimization: Lessons from Xi'an One‑Code Collapse

The recent collapse of Xi'an’s One‑Code health system highlighted that system failures often stem from blocked pipelines rather than database overload, and the article argues that robust manual contingency plans—such as alternative mini‑programs or simple backup apps—are essential to prevent small glitches from becoming crises.

IT infrastructurecontingency planningdisaster recovery
0 likes · 9 min read
Why Contingency Planning Beats System Optimization: Lessons from Xi'an One‑Code Collapse
Programmer DD
Programmer DD
Dec 22, 2021 · Operations

Why Did Xi’an’s Health‑Code App Crash? A Deep Dive into the Failure

The article analyzes the Xi’an “Yima Tong” health‑code system outage, detailing the symptoms, root‑cause factors such as rate‑limiting gaps, server overload, architectural coupling, and ISP differences, and then offers short‑term, long‑term, design, high‑availability, and testing recommendations to prevent future crashes.

Cloud Nativeincident analysisperformance
0 likes · 13 min read
Why Did Xi’an’s Health‑Code App Crash? A Deep Dive into the Failure
Baidu Geek Talk
Baidu Geek Talk
Oct 20, 2021 · Operations

Practical Strategies for Building High‑Availability Systems

This article presents a comprehensive, step‑by‑step guide on improving system reliability through early fault detection, scope reduction, frequency reduction, and rapid incident handling, using real‑world practices from Baidu's commercial hosting platform.

Log StandardizationOperationscapacity planning
0 likes · 20 min read
Practical Strategies for Building High‑Availability Systems
Efficient Ops
Efficient Ops
Oct 9, 2021 · Operations

10 Essential Ops Rules Every Engineer Should Follow

The article shares ten practical operations principles—from avoiding duplicate work and embracing mistakes to establishing backup roles, monitoring bottlenecks, valuing platform tools, clarifying responsibilities, encouraging knowledge sharing, holding regular meetings, balancing performance metrics, and continuously optimizing processes for reliable, efficient system management.

system reliabilityteam management
0 likes · 10 min read
10 Essential Ops Rules Every Engineer Should Follow
DevOps
DevOps
Aug 11, 2021 · Operations

Introduction to Chaos Engineering – Part 2: Four Steps for Disrupting Complex Systems

This article explains that chaos engineering is not a magic cure but a disciplined practice for testing distributed systems by designing and running controlled experiments, outlining four essential steps—observability, defining steady state, hypothesizing events, and executing experiments—to gain confidence in system resilience.

ObservabilityOperationschaos engineering
0 likes · 11 min read
Introduction to Chaos Engineering – Part 2: Four Steps for Disrupting Complex Systems
HelloTech
HelloTech
Jul 30, 2021 · Operations

Foundations of High Availability: Defining and Managing Strong and Weak Service Dependencies

The article defines strong versus weak service dependencies, outlines governance through discovery, fault injection, and refactoring, recommends front‑end and back‑end fault‑tolerance measures such as timeouts and circuit breakers, describes isolation and artificial degradation switches, verifies classifications, and notes current middleware gaps and hiring information.

BackendFault InjectionService Dependency
0 likes · 10 min read
Foundations of High Availability: Defining and Managing Strong and Weak Service Dependencies
IT Architects Alliance
IT Architects Alliance
Jun 29, 2021 · Operations

Understanding High Availability: Compute and Storage Strategies Explained

This article defines high availability, explains why achieving four nines is a common goal, and categorizes HA into compute and storage solutions, detailing common architectures such as active‑passive, master‑slave, symmetric and asymmetric clusters, as well as various storage replication patterns.

Infrastructurecompute HAhigh availability
0 likes · 3 min read
Understanding High Availability: Compute and Storage Strategies Explained
High Availability Architecture
High Availability Architecture
Jun 2, 2021 · Operations

Design and Implementation of Full‑Link Load Testing at Dada Group

This article describes Dada Group’s evolution from a simple 1:1 test environment to a sophisticated machine‑labeling load‑testing solution, detailing core design, isolation techniques, custom testing platform, model construction, pre‑heat strategies, and post‑test analysis that ensure system stability during high‑traffic events.

Distributed SystemsLoad TestingMicroservices
0 likes · 16 min read
Design and Implementation of Full‑Link Load Testing at Dada Group
ITFLY8 Architecture Home
ITFLY8 Architecture Home
Apr 23, 2021 · Operations

How JD’s Open Platform Guarantees Reliable Message Delivery with Dynamic BMQ Design

This article explains how JD’s Open Platform’s Business Message Queue (BMQ) architecture, dynamic channels, retry and downgrade mechanisms, and real‑time monitoring ensure reliable, low‑risk message delivery across thousands of merchants while simplifying integration and scaling for future growth.

AlertingDynamic ConfigurationJD Open Platform
0 likes · 10 min read
How JD’s Open Platform Guarantees Reliable Message Delivery with Dynamic BMQ Design
dbaplus Community
dbaplus Community
Mar 25, 2021 · Operations

Mastering High‑Quality Service Architecture: Load Balancing, Rate Limiting, Retries & Timeouts

This article distills Bilibili's technical director insights on building high‑service‑quality architectures, covering systematic load‑balancing strategies, sophisticated rate‑limiting mechanisms, robust retry policies, precise timeout controls, and comprehensive approaches to prevent cascading failures in large‑scale systems.

Backend ArchitectureSREload balancing
0 likes · 14 min read
Mastering High‑Quality Service Architecture: Load Balancing, Rate Limiting, Retries & Timeouts
Alibaba Cloud Native
Alibaba Cloud Native
Jan 16, 2021 · Backend Development

How Real-World High‑Concurrency Challenges Shaped My Coding Skills

The author recounts four pivotal experiences—from handling a billion‑scale transaction system and joining Taobao’s ad‑hoc “firefighter” squad, to rewriting a communication framework and deep‑diving into JVM internals—illustrating how real‑world challenges and collaborative learning dramatically sharpened his coding and system‑reliability skills.

JVMJavahigh concurrency
0 likes · 11 min read
How Real-World High‑Concurrency Challenges Shaped My Coding Skills
Youzan Coder
Youzan Coder
Dec 30, 2020 · Operations

ERROR Log Governance and Monitoring Alerting Practice at Youzan

Youzan’s log‑governance guide uses a car‑dashboard analogy to show why precise ERROR logs and sensible alerts matter, defines INFO/WARN/ERROR levels, sets daily reduction targets, leverages top‑error analysis and water‑level monitoring, and ultimately cut daily ERROR entries from thousands to about one hundred while catching issues before incidents.

AlertingError HandlingLog Management
0 likes · 9 min read
ERROR Log Governance and Monitoring Alerting Practice at Youzan
macrozheng
macrozheng
Nov 12, 2020 · Operations

Red Cliffs Battle: Lessons on Service Avalanche and Circuit Breakers

Using the historic Red Cliffs battle as a metaphor, this article explains how linked services can cause a cascading failure—service avalanche—in microservice architectures, and details prevention techniques such as rate limiting, isolation, and especially circuit breaker mechanisms with their principles and recovery algorithms.

Service Avalanchecircuit breakersystem reliability
0 likes · 13 min read
Red Cliffs Battle: Lessons on Service Avalanche and Circuit Breakers
21CTO
21CTO
Sep 15, 2020 · Fundamentals

20 Most Catastrophic Software Bugs That Shook the World

From rockets exploding to financial crashes, this article chronicles twenty historic software bugs, detailing their losses, how they unfolded, and the programming mistakes that caused them, illustrating the massive economic and societal impact of faulty code.

Economic ImpactSoftware EngineeringTechnology History
0 likes · 15 min read
20 Most Catastrophic Software Bugs That Shook the World
Youku Technology
Youku Technology
Jul 15, 2020 · Backend Development

Designing for Failure: How Streaming Control Systems Stay Resilient

This article explains the concept of failure‑oriented design, why it matters for large‑scale streaming services, and details concrete architectural patterns—such as layered services, database fallback, cache redundancy, consistency checks, and dynamic traffic switching—used by a production playback control platform.

Backend Architecturecache redundancydatabase fallback
0 likes · 9 min read
Designing for Failure: How Streaming Control Systems Stay Resilient
JD Retail Technology
JD Retail Technology
Jun 11, 2020 · Operations

How JD Health Engineered System Stability for the 618 Mega‑Sale

Facing unprecedented traffic during the 2020 618 shopping festival, JD Health’s product R&D team implemented comprehensive rehearsals, stress testing, architecture reviews, dual‑channel risk controls, and 24‑hour monitoring to ensure system stability and rapid response for its health‑care e‑commerce platforms.

618 promotionJD HealthOperations
0 likes · 5 min read
How JD Health Engineered System Stability for the 618 Mega‑Sale
Tencent Cloud Developer
Tencent Cloud Developer
Apr 22, 2020 · Cloud Native

Designing High‑Quality Service Architecture Under Traffic Peaks: Load Balancing, Rate Limiting, Retries, Timeouts, and Failure Mitigation

Drawing on Google SRE principles, Bilibili’s technical director outlines a systematic, cloud‑native framework for high‑quality service architecture during traffic peaks, covering frontend and internal load balancing, distributed rate limiting, controlled retries, fail‑fast timeouts, and comprehensive failure‑mitigation strategies.

SREcloud-nativeload balancing
0 likes · 13 min read
Designing High‑Quality Service Architecture Under Traffic Peaks: Load Balancing, Rate Limiting, Retries, Timeouts, and Failure Mitigation
21CTO
21CTO
Mar 24, 2020 · Operations

Mastering System Resilience: Rate Limiting, Circuit Breaking, and Degradation

To keep systems highly available under sudden traffic spikes, developers employ three core strategies—rate limiting, circuit breaking, and service degradation—each controlling request flow, isolating failures, and gracefully reducing functionality to maintain stability, with practical examples and algorithmic approaches explained.

Circuit BreakingOperationsrate limiting
0 likes · 5 min read
Mastering System Resilience: Rate Limiting, Circuit Breaking, and Degradation
Efficient Ops
Efficient Ops
Mar 11, 2020 · Operations

How to Elevate Your Monitoring System: Proven Practices from Top DevOps Models

This article explains why modern services depend on highly available, scalable monitoring, outlines a systematic way to assess and improve monitoring capabilities using open‑source tools and the DevOps Capability Maturity Model, and details concrete improvement points across data collection, management, and application.

DevOpsObservabilityOperations
0 likes · 9 min read
How to Elevate Your Monitoring System: Proven Practices from Top DevOps Models