Tagged articles
37 articles
Page 1 of 1
Fighter's World
Fighter's World
Apr 26, 2026 · Artificial Intelligence

How to Make AI Agents Reliable: Skillify’s 10‑Step Continuous Improvement Process

Agent systems often repeat the same failures, like missing historical calendar data or miscalculating time zones, but Garry Tan’s Skillify framework turns each error into a testable skill with a ten‑step checklist—including contracts, deterministic scripts, unit and integration tests, LLM evals, resolver checks, DRY audits, smoke tests, and knowledge‑base filing—to make agents structurally unable to repeat mistakes.

AI AgentsContinuous ImprovementLLM evaluation
0 likes · 22 min read
How to Make AI Agents Reliable: Skillify’s 10‑Step Continuous Improvement Process
Continuous Delivery 2.0
Continuous Delivery 2.0
Dec 9, 2025 · Operations

How Tencent Interactive Entertainment Scaled SRE: From Traditional Ops to Modern Reliability Engineering

This article examines Tencent Interactive Entertainment's eight‑year journey from a traditional operations team to a 400‑person SRE organization, detailing timeline milestones, the shift in mindset and practices, management challenges, and the broader industry trends driving reliability engineering adoption.

OperationsSRETencent
0 likes · 13 min read
How Tencent Interactive Entertainment Scaled SRE: From Traditional Ops to Modern Reliability Engineering
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Oct 29, 2025 · Operations

How to Prevent Avalanche Failures in Large‑Scale Microservice Systems

This article explains how Baidu's SRE team identified the root causes of avalanche failures in massive microservice architectures, modeled system limits with Little’s Law, and implemented engineering practices such as retry budgets, queue throttling, and global TTL controls to achieve self‑healing and eliminate avalanche incidents.

MicroservicesSREavalanche failure
0 likes · 9 min read
How to Prevent Avalanche Failures in Large‑Scale Microservice Systems
Programmer DD
Programmer DD
Oct 3, 2025 · Operations

How Netflix Turned Incident Management into a Scalable Engineer‑Owned Process

This article explains how Netflix’s engineering teams shifted incident handling from a centralized SRE function to a company‑wide, engineer‑driven practice by selecting the right tooling, standardizing processes, and reshaping culture, enabling rapid, reliable responses for hundreds of millions of viewers.

NetflixSRETool Selection
0 likes · 10 min read
How Netflix Turned Incident Management into a Scalable Engineer‑Owned Process
JD Tech Talk
JD Tech Talk
Feb 6, 2025 · Operations

Stability Assurance Mechanisms and Practices for Site Reliability Engineering (SRE)

This article outlines comprehensive stability assurance mechanisms—including standards, process workflows, the distinction between developers and SREs, personal responsibilities, and practical construction directions—to guide teams in building resilient, high‑availability systems through proactive, daily, and incident‑response practices.

SREprocessreliability engineering
0 likes · 10 min read
Stability Assurance Mechanisms and Practices for Site Reliability Engineering (SRE)
Efficient Ops
Efficient Ops
Jan 1, 2025 · Operations

What 2024’s Biggest Outages Teach Us About Building Resilient Systems

Reviewing the major service disruptions—from Alibaba Cloud to OpenAI—this article extracts key SRE lessons such as early disaster‑recovery planning, regular backups, load balancing, real‑time monitoring, performance tuning, and capacity planning, urging enterprises to adopt resilient practices for a more stable future.

OperationsOutage ManagementSRE
0 likes · 6 min read
What 2024’s Biggest Outages Teach Us About Building Resilient Systems
Alibaba Cloud Observability
Alibaba Cloud Observability
Aug 12, 2024 · Operations

How iLogtail Achieves Million‑Scale Observability with SRE Best Practices

This article explains how iLogtail, Alibaba Cloud's high‑performance observability agent, tackles reliability challenges at million‑scale deployments through a comprehensive SRE workflow that spans design, development, testing, gray‑release, operations, and continuous customer support, all while leveraging cloud‑native tools and automation.

Cloud NativeDevOpsSRE
0 likes · 31 min read
How iLogtail Achieves Million‑Scale Observability with SRE Best Practices
Alibaba Cloud Native
Alibaba Cloud Native
Aug 7, 2024 · Operations

How iLogtail Achieves Million‑Scale Observability with SRE Practices

This article details how Alibaba Cloud's iLogtail agent, serving tens of thousands of hosts and containers, overcomes unique stability challenges through a comprehensive SRE approach that spans design, development, testing, gray‑release, operations, and customer‑support, ultimately boosting reliability and reducing incident rates.

Cloud NativeObservabilitySRE
0 likes · 32 min read
How iLogtail Achieves Million‑Scale Observability with SRE Practices
Cognitive Technology Team
Cognitive Technology Team
May 16, 2024 · Operations

Guide to Building Stability in Distributed Systems

This guide presents comprehensive principles, best practices, and techniques for designing, deploying, and maintaining stable distributed systems, covering fault tolerance, monitoring, capacity planning, incident response, and operational reliability to help engineers achieve high availability.

Distributed SystemsOperationsreliability engineering
0 likes · 1 min read
Guide to Building Stability in Distributed Systems
Efficient Ops
Efficient Ops
Dec 20, 2023 · Operations

How Bilibili Implements SLO Engineering to Boost Service Reliability

This article details Bilibili's practical SLO engineering approach, covering foundational components, SLI selection, application and business level SLIs, alerting strategies, SLO‑driven quality operations, and the GOC framework for rapid fault discovery, localization, and recovery, illustrating how reliability is systematically improved.

OperationsSLOreliability engineering
0 likes · 16 min read
How Bilibili Implements SLO Engineering to Boost Service Reliability
Efficient Ops
Efficient Ops
Nov 26, 2023 · Operations

Beijing Mobile’s SRE Success: Automation, Cloud‑Native Ops & Reliability

The article details how Beijing Mobile’s SRE Smart Operations team applied SRE principles, automation, and cloud‑native tools to transform traditional DevOps into a reliable, scalable operation, highlighting their fault‑prevention, monitoring, incident response, and continuous improvement practices that earned them the 2023 IT Technology Leadership award.

AutomationOperationsSRE
0 likes · 7 min read
Beijing Mobile’s SRE Success: Automation, Cloud‑Native Ops & Reliability
Bilibili Tech
Bilibili Tech
Nov 17, 2023 · Operations

Bilibili CDN/SLB Outage Analysis and Cloud‑Edge Coordination Strategies

The August 4 2023 Bilibili outage, triggered by automatic back‑origin and domain‑disable policies that flooded the BFS load balancer with traffic, caused widespread white‑screens, but was mitigated within the 1‑5‑10 framework through rapid CDN switching, rate‑limit enhancements, storage backup, and client‑side fallback, illustrating the need for tighter cloud‑edge coordination.

CDNSLBreliability engineering
0 likes · 19 min read
Bilibili CDN/SLB Outage Analysis and Cloud‑Edge Coordination Strategies
Meituan Technology Team
Meituan Technology Team
Oct 12, 2023 · Operations

Pattern-Based Reliability Governance for Billion-Scale Traffic Systems

The article analyzes reliability governance challenges in Meituan's billion‑traffic systems, introduces pattern mining as a way to uncover common reliability issues, and presents three concrete case studies—idempotency, dependency, and over‑privilege governance—demonstrating how large‑scale traffic data and environment isolation enable low‑cost, automated reliability solutions.

Idempotencyaccess controldependency governance
0 likes · 19 min read
Pattern-Based Reliability Governance for Billion-Scale Traffic Systems
FunTester
FunTester
Jul 13, 2023 · Industry Insights

How HuoLala Built a 0‑to‑1 Stability Metric System and Cut Faults by 78%

In this detailed case study, HuoLala's stability leader shares how a two‑year, zero‑to‑one stability metric framework was designed, implemented, and iterated—covering the why, the pain points, the metric definition process, data collection platform, cultural adoption, and the resulting 78% fault reduction and SLA improvement from three to four nines.

Case StudyOperationsPerformance Monitoring
0 likes · 18 min read
How HuoLala Built a 0‑to‑1 Stability Metric System and Cut Faults by 78%
dbaplus Community
dbaplus Community
Jun 20, 2023 · Operations

How Agricultural Bank Built a Chaos Engineering Platform for Resilience

The article outlines the Agricultural Bank of China's initiative to adopt chaos engineering, describing the challenges of modern distributed systems, the design and capabilities of their in‑house chaos platform, product research, industry comparisons, practical use cases across development, operations and disaster recovery, and future development directions.

Cloud NativeDistributed SystemsPlatform Development
0 likes · 14 min read
How Agricultural Bank Built a Chaos Engineering Platform for Resilience
Efficient Ops
Efficient Ops
Jun 20, 2023 · Operations

Mastering SRE: How Error Budgets and SLOs Drive System Reliability

This article explains the fundamentals of Site Reliability Engineering, detailing how SRE combines development and operations to improve stability through metrics like MTBF and MTTR, the roles of SLI/SLO, the VALET selection method, and the practical use of error budgets for quantifying work and guiding alerts.

Error BudgetMTBFOperations
0 likes · 14 min read
Mastering SRE: How Error Budgets and SLOs Drive System Reliability
Efficient Ops
Efficient Ops
May 31, 2023 · Operations

How Tencent Scales SRE: Building a SLO‑Based Quality Operations System

This article examines Tencent's end‑to‑end SRE quality‑operation framework built on Service Level Objectives (SLO) and On‑Call, detailing industry background, problem statements, SLO management, On‑Call benefits, product architecture, large‑scale deployment, and future plans for reliability engineering.

On-CallQuality OperationsSLO
0 likes · 11 min read
How Tencent Scales SRE: Building a SLO‑Based Quality Operations System
Efficient Ops
Efficient Ops
Mar 28, 2023 · Operations

Why SRE Matters: Bridging Product Development and Reliability Engineering

This article explains the role of Site Reliability Engineering (SRE), its responsibilities, how it complements product development, the software lifecycle perspective, and practical approaches to ensure system stability through controllability, observability, and best‑practice implementation.

ObservabilityOperationsSRE
0 likes · 14 min read
Why SRE Matters: Bridging Product Development and Reliability Engineering
Bilibili Tech
Bilibili Tech
Oct 29, 2022 · Operations

Stability Building and SLO Operations After the “713 Incident”

The deck outlines post‑incident stability enhancements and the adoption of Service Level Objectives after the “713” fault, detailing failure analysis, reliability upgrades, monitoring practices, and the definition and operation of SLOs to sustain system quality, illustrated through architecture diagrams and reliability metrics.

SLOreliability engineeringsite reliability
0 likes · 1 min read
Stability Building and SLO Operations After the “713 Incident”
Bilibili Tech
Bilibili Tech
Aug 2, 2022 · Operations

Lessons Learned from Implementing SLOs at Bilibili: Practices, Pitfalls, and Reflections

Bilibili adopted Google‑SRE SLO practices—selecting SLIs, defining availability and latency targets, grading services, and tracking error budgets—but encountered costly grading inconsistencies, hidden error detection, and inaccurate business‑level metrics, leading them to realize SLOs are chiefly valuable for early alerting rather than exhaustive reporting.

Cloud NativeError BudgetOperations
0 likes · 21 min read
Lessons Learned from Implementing SLOs at Bilibili: Practices, Pitfalls, and Reflections
Bilibili Tech
Bilibili Tech
Jun 21, 2022 · Cloud Native

Evolution of SRE in the Cloud‑Native Era – Insights from Industry Experts

Industry experts from Zhejiang Mobile, Bilibili, and Xiaomi discuss how SRE has evolved in the cloud‑native era, sharing concrete frameworks, observability practices, and cost‑focused platforms while emphasizing stability, metrics, on‑call processes, and the need to adapt Google’s model to real‑world product and operational contexts.

Cloud NativeDevOpsSRE
0 likes · 31 min read
Evolution of SRE in the Cloud‑Native Era – Insights from Industry Experts
DataFunTalk
DataFunTalk
Jun 19, 2022 · Artificial Intelligence

FMEA Knowledge Graph: Integrating Failure Analysis with AI for Intelligent Manufacturing

This article explains how integrating FMEA with knowledge graph and AI technologies can enhance product quality and reliability across high‑end manufacturing sectors such as semiconductors, automotive, and medical devices, presenting case studies, standards, and a platform built by Daguan Data.

FMEAManufacturingreliability engineering
0 likes · 19 min read
FMEA Knowledge Graph: Integrating Failure Analysis with AI for Intelligent Manufacturing
HelloTech
HelloTech
Jul 12, 2021 · Operations

Introduction to System Stability: Concepts, Metrics, and Practices

The article explains Haro’s approach to system stability—defining high‑availability, key metrics such as SLA, RPO/RTO, MTTR/MTBF, and the 5‑5‑10 rule—while outlining cultural and technical safeguards, full‑team participation, process integration, and incremental tooling to prevent faults and ensure rapid recovery.

MTTRRPORTO
0 likes · 11 min read
Introduction to System Stability: Concepts, Metrics, and Practices
Efficient Ops
Efficient Ops
Jun 7, 2021 · Operations

How Alibaba’s ECS Team Built a Scalable SRE System: Lessons for Large R&D Teams

This article summarizes Alibaba Cloud Elastic Compute Service's four‑year SRE journey, covering why ECS created its own SRE organization, the five‑layer SRE framework, standards, automation platforms, empowerment practices, and team‑building insights that can guide large development teams toward reliable, high‑availability operations.

SREreliability engineering
0 likes · 24 min read
How Alibaba’s ECS Team Built a Scalable SRE System: Lessons for Large R&D Teams
Efficient Ops
Efficient Ops
Jan 19, 2021 · Operations

How SRE Bridges Development and Operations to Boost System Reliability

This article explores the role of Site Reliability Engineering (SRE) as a bridge between product development and operations, detailing its responsibilities, core principles, lifecycle perspective, stability value, and practical frameworks for controllability, observability, and best‑practice implementation to enhance system reliability.

ObservabilitySREreliability engineering
0 likes · 13 min read
How SRE Bridges Development and Operations to Boost System Reliability
dbaplus Community
dbaplus Community
Nov 23, 2020 · Operations

Mastering Fault Management: Building a Robust SRE Stability Framework

This article outlines a comprehensive SRE fault‑management framework, covering core responsibilities, stability metrics such as MTBF and MTTR, detailed pre‑, during‑, and post‑incident processes, monitoring, capacity planning, disaster‑recovery, error budgeting, organizational support, and future trends like AIOps and chaos engineering.

Error BudgetMTBFMTTR
0 likes · 30 min read
Mastering Fault Management: Building a Robust SRE Stability Framework
Efficient Ops
Efficient Ops
Mar 26, 2020 · Operations

Why SRE Exists and How It Solves Reliability Challenges

This article explains why Site Reliability Engineering (SRE) emerged, outlines its core responsibilities, required skill set, and how it addresses reliability challenges through decoupling, SLO‑driven monitoring, and scenario‑based drills, while highlighting key observations and focus areas for modern operations teams.

SLOSREmonitoring
0 likes · 13 min read
Why SRE Exists and How It Solves Reliability Challenges
21CTO
21CTO
Nov 15, 2019 · Operations

How SRE Designs Highly Available Software Systems at Scale

This article presents Google SRE expert Ramón Medrano Llamas’s comprehensive guide on designing, operating, and maintaining large‑scale, highly available software systems, covering SRE fundamentals, daily workflows, scalability strategies, fault‑tolerant architecture, monitoring, and operational best practices.

SREScalable Systemsfault tolerance
0 likes · 13 min read
How SRE Designs Highly Available Software Systems at Scale
Efficient Ops
Efficient Ops
May 6, 2019 · Operations

How Live Streaming Ops Ensure Real-Time Reliability at Scale

Zhang Guanshi, the operations director at Huya Live, shares how his team designs a hybrid‑cloud architecture, implements a six‑pillar reliability framework, and leverages real‑time monitoring, AIOps, and rapid‑recovery tools to maintain stable, low‑latency live video streams for millions of viewers.

Operationscloud architecturelive streaming
0 likes · 22 min read
How Live Streaming Ops Ensure Real-Time Reliability at Scale
MaGe Linux Operations
MaGe Linux Operations
May 5, 2018 · Operations

Inside a High-Stakes Trading System Upgrade: Lessons from a Day-Long Ops Marathon

This article recounts a securities firm's intensive one‑day system upgrade, detailing the weeks‑long preparation, the meticulous Saturday‑morning execution, and the post‑upgrade verification that together illustrate the critical role of disciplined operations in maintaining ultra‑reliable trading platforms.

Case StudyIT OperationsTrading Platform
0 likes · 7 min read
Inside a High-Stakes Trading System Upgrade: Lessons from a Day-Long Ops Marathon
Efficient Ops
Efficient Ops
Dec 29, 2017 · Operations

How Alibaba’s Global Operations Center Guarantees Seamless Service During Double‑11

This article outlines Alibaba's Global Operations Center (GOC) practices for ensuring stable, high‑performance online services during massive traffic spikes like Double‑11, covering current challenges, the operational assurance framework, best‑practice implementations, and future directions such as automation and AI‑driven monitoring.

reliability engineering
0 likes · 28 min read
How Alibaba’s Global Operations Center Guarantees Seamless Service During Double‑11
Efficient Ops
Efficient Ops
Jul 27, 2015 · Operations

What Google SREs Do: Inside the Role that Powers Reliable Services

This article explains the responsibilities, requirements, and daily work of Google Site Reliability Engineers, contrasts them with Software Engineers, outlines key internal infrastructure components, and discusses the future direction of operations engineering in the cloud era.

GoogleInfrastructureOperations
0 likes · 11 min read
What Google SREs Do: Inside the Role that Powers Reliable Services