Tagged articles

SRE

414 articles · Page 1 of 5
Architect
Architect
Jun 24, 2026 · Artificial Intelligence

What Architects Should Focus on When Claude, Codex, and Mira Discuss Loop

The article examines Loop engineering for AI agents, arguing that beyond entry points like Claude, Codex, or Mira, architects must ensure reliable feedback, persistent state, clear stop conditions, and human hand‑off, drawing parallels to high‑reliability SRE practices and proposing concrete design and evaluation steps.

AI AgentsAutomationLoop Engineering
0 likes · 20 min read
What Architects Should Focus on When Claude, Codex, and Mira Discuss Loop
TonyBai
TonyBai
Jun 15, 2026 · Operations

When AI Generates Code 10× Faster, Who Safeguards System Reliability?

The article analyzes Google’s SRE whitepaper on AI‑driven operations, detailing how generative AI accelerates code production 4‑10×, introduces five SRE AI autonomy levels, three core AI‑ops components, and a safety architecture that decouples decision‑making from execution to prevent catastrophic failures.

AI OpsAutomationGoogle
0 likes · 12 min read
When AI Generates Code 10× Faster, Who Safeguards System Reliability?
FunTester
FunTester
Jun 12, 2026 · R&D Management

Why Removing QA Requires Building a New Quality Framework

Eliminating a dedicated QA function may look like cost savings, but without establishing a comprehensive quality system—including self‑testing, automation, release gates, monitoring, and post‑incident reviews—risk simply shifts to production, leading to hidden incidents, longer rollbacks, and ultimately higher total cost.

QARisk ManagementSRE
0 likes · 18 min read
Why Removing QA Requires Building a New Quality Framework
SuanNi
SuanNi
Jun 8, 2026 · Artificial Intelligence

First Enterprise IT Ops Agent Benchmark Shows Claude Leads with Just 47% Score

The ITBench-AA benchmark, the first evaluation specifically for enterprise IT operations agents, tests 59 SRE scenarios and reveals that even top models like Claude Opus 4.7 achieve only a 47% score, highlighting both the difficulty of the tasks and the cost‑effectiveness gap between proprietary and open‑source agents.

AI AgentClaudeIT Operations
0 likes · 11 min read
First Enterprise IT Ops Agent Benchmark Shows Claude Leads with Just 47% Score
Cloud Native Technology Community
Cloud Native Technology Community
May 18, 2026 · Operations

How to Cut Engineering Time on Kubernetes Upgrades

Kubernetes upgrades can consume 4‑6 weeks of engineering effort per minor release, delaying product roadmaps and inflating cloud costs, while reports show teams lose dozens of workdays to incidents and over‑provisioned resources, highlighting the need for dedicated SRE ownership to reclaim time for business‑impacting work.

Operational CostPlatform EngineeringSRE
0 likes · 8 min read
How to Cut Engineering Time on Kubernetes Upgrades
dbaplus Community
dbaplus Community
May 1, 2026 · Operations

Why a Simple Nginx Change Made All Gateway Requests Return 400 (And How to Fix It)

A production incident caused by replacing two Nginx reverse proxies introduced an upstream name with an underscore, resulting in invalid Host headers and 400 Bad Request responses from Spring Cloud Gateway; the article details the step‑by‑step investigation, evidence from logs, tcpdump, and code, and presents configuration fixes to restore normal operation.

HTTP 400NGINXSRE
0 likes · 15 min read
Why a Simple Nginx Change Made All Gateway Requests Return 400 (And How to Fix It)
FunTester
FunTester
Apr 28, 2026 · Operations

How Self‑Healing Automation Platforms Transform SRE Practices

The article explains how a self‑healing platform improves SRE reliability by reducing MTTR, preserving error‑budget, automating high‑impact incident remediation, enforcing safety guardrails, and shifting team focus from firefighting to sustainable reliability engineering.

AutomationError BudgetMTTR
0 likes · 10 min read
How Self‑Healing Automation Platforms Transform SRE Practices
FunTester
FunTester
Apr 27, 2026 · Operations

Why Relying on Humans for Incident Recovery Fails and How Self‑Healing Automation Platforms Help

The article explains that large‑scale incidents overwhelm on‑call engineers who must manually piece together context from countless signals, and shows how a self‑healing automation platform can take over repetitive, known failure patterns, verify fixes, and reduce fatigue while keeping humans in the loop for oversight.

AutomationOperationsPlatform Engineering
0 likes · 8 min read
Why Relying on Humans for Incident Recovery Fails and How Self‑Healing Automation Platforms Help
Linyb Geek Road
Linyb Geek Road
Apr 25, 2026 · Operations

How to Build Stable SaaS Systems: Key Practices for Reliability

The article outlines practical methods for ensuring SaaS system stability, covering resource‑related issues, middleware reliability, pre‑release gray deployments, automated release procedures, comprehensive monitoring, load‑balancing strategies, degradation handling, rate limiting, chaos engineering, and SRE implementation.

MonitoringSRESaaS
0 likes · 10 min read
How to Build Stable SaaS Systems: Key Practices for Reliability
DevOps Coach
DevOps Coach
Apr 22, 2026 · Operations

2026 AI DevOps Outlook: 10 Must‑Watch MCP Servers Transforming SRE

The article surveys the rapidly growing Model Context Protocol (MCP) ecosystem in 2026, detailing ten AI‑enabled DevOps servers, their core capabilities, real‑world impact on SRE workflows, and a practical framework for selecting the most valuable servers for a given team.

AI DevOpsMCPObservability
0 likes · 16 min read
2026 AI DevOps Outlook: 10 Must‑Watch MCP Servers Transforming SRE
Raymond Ops
Raymond Ops
Apr 20, 2026 · Operations

How to Build a Standardized SRE On‑Call Process: From Alert Grading to Handoff Templates

This article presents a complete SRE on‑call handbook that defines alert severity levels, provides concrete Prometheus Alertmanager configurations, outlines a step‑by‑step response flow, details war‑room roles, escalation paths, handoff checklists, post‑mortem procedures, and dozens of ready‑to‑use templates to reduce MTTR and improve reliability.

Alert ManagementOn-CallOperations
0 likes · 27 min read
How to Build a Standardized SRE On‑Call Process: From Alert Grading to Handoff Templates
Ray's Galactic Tech
Ray's Galactic Tech
Apr 19, 2026 · Cloud Native

Building a Production‑Ready Cloud‑Native Kubernetes Platform: From Zero to SRE Success

This article presents a step‑by‑step guide to designing and implementing a production‑grade Kubernetes platform with GitOps, observability, capacity governance, fault‑injection, and SRE practices, showing how to achieve unified delivery, reliability, and low‑cost operation for high‑concurrency business services.

Cloud NativeGitOpsObservability
0 likes · 37 min read
Building a Production‑Ready Cloud‑Native Kubernetes Platform: From Zero to SRE Success
Alibaba Cloud Native
Alibaba Cloud Native
Apr 8, 2026 · Operations

How HiClaw Transforms SRE with Multi‑Agent Collaboration in Cloud‑Native Environments

The article details how the HiClaw distributed multi‑agent platform is built and organized for SRE teams, explains the roles of human users and digital bots, describes permission design, showcases fault‑diagnosis and release scenarios, and evaluates the efficiency and innovation gains of this cloud‑native automation approach.

AI OpsAutomationCloud Native
0 likes · 14 min read
How HiClaw Transforms SRE with Multi‑Agent Collaboration in Cloud‑Native Environments
DevOps Coach
DevOps Coach
Mar 31, 2026 · Operations

How AI‑Driven Observability Can Cut MTTR: A 12‑Step Investigation Framework

This article explains how modern SRE teams can combine AI‑assisted observability with structured critical thinking to build a 12‑step investigation model that accelerates fault detection, hypothesis generation, telemetry validation, root‑cause analysis, and automated remediation, ultimately reducing MTTR and improving reliability.

AIIncident ManagementObservability
0 likes · 9 min read
How AI‑Driven Observability Can Cut MTTR: A 12‑Step Investigation Framework
DevOps Coach
DevOps Coach
Mar 26, 2026 · Operations

Can an AI Agent Replace Your SRE Night‑Shift? Inside Google’s Remote MCP‑Powered Autonomous SRE Agent

The article examines the chronic pain points of on‑call SRE teams—alert fatigue, long MTTR, inconsistent RCA, and communication bottlenecks—and presents a detailed, four‑layer architecture that uses Google’s Remote MCP server and an AI‑driven autonomous SRE agent to automate log retrieval, knowledge lookup, root‑cause analysis, and stakeholder notifications, dramatically improving reliability and efficiency.

Google CloudMCPOperations
0 likes · 21 min read
Can an AI Agent Replace Your SRE Night‑Shift? Inside Google’s Remote MCP‑Powered Autonomous SRE Agent
DevOps Coach
DevOps Coach
Mar 24, 2026 · Operations

Avoid the Top 10 Kubernetes Monitoring Mistakes Every SRE Team Makes

This article examines the ten most common Kubernetes monitoring errors that SRE teams encounter, explains why each mistake harms reliability, and provides concrete, actionable solutions—including the Golden Signals framework, pod‑restart analysis, alert‑fatigue reduction, application‑level observability, etcd health checks, network metrics, control‑plane monitoring, log‑metric correlation, resource request tracking, and end‑to‑end observability—to help teams build robust, scalable monitoring systems.

Cloud NativeMonitoringObservability
0 likes · 11 min read
Avoid the Top 10 Kubernetes Monitoring Mistakes Every SRE Team Makes
MaGe Linux Operations
MaGe Linux Operations
Mar 16, 2026 · Operations

Kubernetes Pod Troubleshooting Guide: Diagnose CrashLoopBackOff, OOMKilled & More

A comprehensive, step‑by‑step guide for SREs and DevOps engineers to diagnose and resolve common Kubernetes pod issues—including CrashLoopBackOff, OOMKilled, ImagePullBackOff, Pending, Evicted, and Terminating—by leveraging pod lifecycle knowledge, kubectl commands, logs, events, node inspection, scripts, real‑world case studies, and monitoring best practices.

SREdevopskubernetes
0 likes · 55 min read
Kubernetes Pod Troubleshooting Guide: Diagnose CrashLoopBackOff, OOMKilled & More
Ops Development Stories
Ops Development Stories
Mar 14, 2026 · Operations

OpenClaw AI Hype: An SRE’s Warning About Hidden Ops Risks

The article examines the rapid popularity of the open‑source AI agent OpenClaw, revealing how hype, cost misconceptions, and inadequate security practices create serious operational risks for both individual and enterprise users, and offers concrete SRE‑style safeguards to mitigate these dangers.

AIOpenClawRisk Management
0 likes · 9 min read
OpenClaw AI Hype: An SRE’s Warning About Hidden Ops Risks
Raymond Ops
Raymond Ops
Mar 10, 2026 · Operations

How to Master Service Avalanche Recovery: A Complete SRE Playbook from Alert to Restoration

This guide walks SRE and senior operations engineers through a real-world service‑avalanche incident, detailing alert hierarchy design, fault‑location commands, emergency SOPs, capacity‑baseline building, and post‑mortem best practices to dramatically reduce MTTR in distributed micro‑service environments.

SREService Avalanchecapacity planning
0 likes · 19 min read
How to Master Service Avalanche Recovery: A Complete SRE Playbook from Alert to Restoration
DevOps Coach
DevOps Coach
Mar 6, 2026 · Operations

SRE vs Platform Engineering vs DevOps: Key Differences, Roles, and Toolchains

An in‑depth comparison of Site Reliability Engineering (SRE), Platform Engineering, and DevOps explains their origins, core responsibilities, distinct tools, and how they complement each other in modern cloud‑native organizations, helping teams choose the right practices for reliable, scalable software delivery.

Cloud NativePlatform EngineeringSRE
0 likes · 9 min read
SRE vs Platform Engineering vs DevOps: Key Differences, Roles, and Toolchains
Architect-Kip
Architect-Kip
Mar 4, 2026 · Operations

Essential SRE Monitoring and Alerting Standards: From Metrics to Incident Response

This guide outlines comprehensive SRE monitoring and alerting standards, covering core principles, log instrumentation, health‑check requirements, baseline resource and application metrics, alarm severity tiers, response SLAs, on‑call rotation, continuous optimization, and noise‑reduction mechanisms to ensure reliable service operation.

AlertingMetricsMonitoring
0 likes · 14 min read
Essential SRE Monitoring and Alerting Standards: From Metrics to Incident Response
Raymond Ops
Raymond Ops
Mar 2, 2026 · Operations

Why Most Alerts Fail and How to Build a Night‑Quiet, High‑Signal Monitoring System

This article examines the root causes of alert fatigue—mis‑configured thresholds, noisy alerts, lack of context, and poor routing—then presents a step‑by‑step guide using golden signals, dynamic baselines, enriched alert payloads, severity‑based routing, and suppression techniques to create an effective, low‑noise monitoring system.

AlertingAlertmanagerMonitoring
0 likes · 24 min read
Why Most Alerts Fail and How to Build a Night‑Quiet, High‑Signal Monitoring System
Raymond Ops
Raymond Ops
Mar 1, 2026 · Operations

How I Transitioned from Traditional Ops to SRE/DevOps in 18 Months

This detailed guide shares a step‑by‑step 18‑month roadmap, covering self‑assessment, skill acquisition (Python, Kubernetes, monitoring), project execution, interview preparation, and real‑world outcomes for engineers moving from legacy operations to SRE/DevOps roles.

CI/CDMonitoringPython
0 likes · 35 min read
How I Transitioned from Traditional Ops to SRE/DevOps in 18 Months
Ops Community
Ops Community
Feb 2, 2026 · Operations

How to Process 10GB Logs in 30 Seconds with Grep, Sed, and Awk

This comprehensive guide shows how to use the GNU tools grep, sed, and awk to quickly analyse massive Nginx access logs, covering their streaming design, optimal command parameters, real‑world examples, performance tricks, security safeguards and step‑by‑step scripts for fault isolation and reporting.

SREShell Scriptingawk
0 likes · 38 min read
How to Process 10GB Logs in 30 Seconds with Grep, Sed, and Awk
DevOps Coach
DevOps Coach
Jan 25, 2026 · Operations

Why Infra Companies Are Racing Into Observability and What It Means for 2026

The article examines how SRE and infrastructure teams are converging, why major infra vendors are acquiring observability assets, the rising cost pressures, and how OpenTelemetry combined with Apache Iceberg forms a new standard stack that AI‑driven incident response will rely on in the coming years.

AI incident responseApache IcebergSRE
0 likes · 11 min read
Why Infra Companies Are Racing Into Observability and What It Means for 2026
MaGe Linux Operations
MaGe Linux Operations
Jan 7, 2026 · Operations

How to Eliminate Alert Fatigue: 10 Proven Prometheus Alerting Techniques

This comprehensive guide walks you through the architecture of Prometheus and Alertmanager, shows how to design, write, and test robust alert rules, and shares ten practical techniques—including proper for‑durations, rate() usage, recording rules, multi‑level alerts, and inhibition—to dramatically reduce alert noise and improve SRE reliability.

AlertingAlertmanagerMonitoring
0 likes · 40 min read
How to Eliminate Alert Fatigue: 10 Proven Prometheus Alerting Techniques
Ops Development Stories
Ops Development Stories
Dec 31, 2025 · Operations

12 Major 2025 Internet Outages: What Every Ops Team Can Learn

This article compiles twelve high‑profile internet service failures from 2025, detailing each incident’s description, micro‑scenario, technical root cause, and risk perspective, and extracts actionable lessons on infrastructure resilience, change management, and security‑aware operations.

Internet OutagesOperationsReliability
0 likes · 20 min read
12 Major 2025 Internet Outages: What Every Ops Team Can Learn
MaGe Linux Operations
MaGe Linux Operations
Dec 10, 2025 · Operations

Standardized SRE On‑Call Handbook: Alert Grading, Response Flow, and Handoff Templates

This handbook presents a complete, two‑year‑tested SRE on‑call process that defines alert severity tiers, response requirements, escalation paths, War‑Room roles, handoff schedules, post‑mortem procedures, and provides ready‑to‑use configuration snippets, checklists and templates to reduce MTTR and repeat incidents.

Alert ManagementOn-CallOperations
0 likes · 26 min read
Standardized SRE On‑Call Handbook: Alert Grading, Response Flow, and Handoff Templates
Continuous Delivery 2.0
Continuous Delivery 2.0
Dec 9, 2025 · Operations

How Tencent Interactive Entertainment Scaled SRE: From Traditional Ops to Modern Reliability Engineering

This article examines Tencent Interactive Entertainment's eight‑year journey from a traditional operations team to a 400‑person SRE organization, detailing timeline milestones, the shift in mindset and practices, management challenges, and the broader industry trends driving reliability engineering adoption.

OperationsOrganizational ChangeReliability Engineering
0 likes · 13 min read
How Tencent Interactive Entertainment Scaled SRE: From Traditional Ops to Modern Reliability Engineering
DevOps Coach
DevOps Coach
Dec 8, 2025 · Operations

How to Quantify SRE ROI: Turning Reliability Metrics into Business Value

This article explains how SRE leaders can bridge the gap between technical reliability metrics and business outcomes by defining core SRE concepts, applying a step‑by‑step ROI formula, illustrating code‑level impact, avoiding common pitfalls, and looking ahead to AI‑driven reliability forecasting.

BusinessValueMetricsOperations
0 likes · 10 min read
How to Quantify SRE ROI: Turning Reliability Metrics into Business Value
Raymond Ops
Raymond Ops
Dec 6, 2025 · Cloud Native

Master Helm: From Installation to Advanced Kubernetes Deployments

This comprehensive guide explains Helm’s core concepts, installation steps, basic commands, real‑world deployment examples for Nginx and WordPress, advanced features like hooks and sub‑charts, common pitfalls, and SRE‑focused best practices for reliable, automated Kubernetes package management.

CI/CDSREdevops
0 likes · 15 min read
Master Helm: From Installation to Advanced Kubernetes Deployments
DevOps Coach
DevOps Coach
Nov 10, 2025 · Operations

How to Use SRE Metrics for Data‑Driven Reliability and Faster Releases

This guide explains the SRE framework—SLA, SLO, SLI hierarchy, golden signals, error budgets, and DORA metrics—showing how to instrument a Python app with OpenTelemetry, query Prometheus, avoid common pitfalls, and adopt a cultural and technical process that balances feature velocity with system stability.

DORAError BudgetGolden Signals
0 likes · 18 min read
How to Use SRE Metrics for Data‑Driven Reliability and Faster Releases
Efficient Ops
Efficient Ops
Nov 9, 2025 · Operations

How Tencent’s PCG Achieves Full‑Link Observability and AI‑Powered SRE

The talk details Tencent PCG’s end‑to‑end observability platform, its data‑standardization pipeline, client‑backend session linking, AI‑enhanced SRE Agent with large language models, and the roadmap toward a SaaS offering, illustrating how modern operations integrate AI for rapid fault localization.

AILarge Language ModelMonitoring
0 likes · 17 min read
How Tencent’s PCG Achieves Full‑Link Observability and AI‑Powered SRE
Continuous Delivery 2.0
Continuous Delivery 2.0
Nov 4, 2025 · Operations

Google's STAMP Framework: Redefining SRE for AI‑Driven Systems

Google’s SRE team is shifting from traditional error‑budget approaches to the STAMP (Systems-Theoretic Accident Model and Processes) framework, applying control theory and system‑level analysis to manage the growing complexity of AI‑powered services, improve safety, and proactively prevent hazardous states.

AIControl TheoryReliability
0 likes · 12 min read
Google's STAMP Framework: Redefining SRE for AI‑Driven Systems
Instant Consumer Technology Team
Instant Consumer Technology Team
Nov 3, 2025 · Artificial Intelligence

Large Language Models Power Big Data SRE Knowledge & Root‑Cause Automation

Facing the growing complexity of big‑data platforms, the SRE team adopted large‑language‑model agents to automate knowledge management and root‑cause analysis, employing Retrieval‑Augmented Generation, a vector store, and the Model Context Protocol to enable intelligent, scalable, and efficient incident diagnosis and resolution.

AIKnowledge ManagementMCP
0 likes · 12 min read
Large Language Models Power Big Data SRE Knowledge & Root‑Cause Automation
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Oct 29, 2025 · Operations

How to Prevent Avalanche Failures in Large‑Scale Microservice Systems

This article explains how Baidu's SRE team identified the root causes of avalanche failures in massive microservice architectures, modeled system limits with Little’s Law, and implemented engineering practices such as retry budgets, queue throttling, and global TTL controls to achieve self‑healing and eliminate avalanche incidents.

MicroservicesReliability EngineeringSRE
0 likes · 9 min read
How to Prevent Avalanche Failures in Large‑Scale Microservice Systems
MaGe Linux Operations
MaGe Linux Operations
Oct 16, 2025 · Operations

SRE Playbook: From Alert to Full Recovery of Service Avalanches

This comprehensive SRE guide walks through a real-world service avalanche incident, detailing alert triggering, root‑cause analysis, step‑by‑step recovery, capacity baseline creation, layered alert design, automated scripts, and post‑mortem best practices to help engineers prevent and resolve large‑scale outages.

AlertingSREService Avalanche
0 likes · 20 min read
SRE Playbook: From Alert to Full Recovery of Service Avalanches
Huawei Cloud Developer Alliance
Huawei Cloud Developer Alliance
Oct 16, 2025 · Operations

How HyperRouter Enables Deterministic Operations for L4 Load Balancing

This article explains how Huawei Cloud's HyperRouter implements deterministic operations through a combination of L4/L7 load‑balancing co‑design, high‑performance data‑plane choices, self‑healing mechanisms, point‑to‑point architecture, Cell + Shuffle‑Sharding isolation, and user‑centric observability, providing a reproducible blueprint for reliable cloud services.

Cloud NativeDPDKObservability
0 likes · 17 min read
How HyperRouter Enables Deterministic Operations for L4 Load Balancing
Continuous Delivery 2.0
Continuous Delivery 2.0
Oct 11, 2025 · Operations

Mastering Enterprise SRE: From Core Concepts to Practical Implementation

This comprehensive guide explains the core principles of Site Reliability Engineering, outlines a phased roadmap for enterprise adoption, details essential monitoring, automation, and reliability platforms, and addresses team structure, talent development, common challenges, and real‑world success stories to help organizations build effective SRE practices.

AutomationSRESite Reliability Engineering
0 likes · 16 min read
Mastering Enterprise SRE: From Core Concepts to Practical Implementation
Continuous Delivery 2.0
Continuous Delivery 2.0
Oct 9, 2025 · Operations

Why SRE Is the Next Evolution for Enterprise Operations

This article introduces the first part of an SRE series, explaining why organizations need SRE, the incremental path for building SRE capabilities, and differentiated strategies for medium and large enterprises, emphasizing gradual, platform‑driven automation and cultural change for reliable digital transformation.

EnterprisePlatform EngineeringSRE
0 likes · 8 min read
Why SRE Is the Next Evolution for Enterprise Operations
Alibaba Cloud Developer
Alibaba Cloud Developer
Oct 9, 2025 · Operations

How AI‑Powered Multi‑Agent Systems Turn Fault Postmortems into Proactive Risk Prevention

This article explains how an AI‑driven multi‑agent platform automates fault postmortem generation, enriches analysis with memory management, prompt engineering, and RAG techniques, and delivers actionable insights for SREs, developers, and non‑technical stakeholders, ultimately shifting incident handling from reactive to proactive.

AIAutomationIncident Management
0 likes · 44 min read
How AI‑Powered Multi‑Agent Systems Turn Fault Postmortems into Proactive Risk Prevention
DevOps Coach
DevOps Coach
Oct 6, 2025 · Interview Experience

Top 10 SRE Interview Questions with Expert Answers

This article presents ten essential SRE interview questions covering process priority, shell variable persistence, TTD/TTR metrics, system design for LinkedIn and Twitter, load‑balancing strategies, conflict handling, REST API usage, and log‑parsing code, each with detailed explanations and practical examples.

Linux commandsSRESystem Design
0 likes · 9 min read
Top 10 SRE Interview Questions with Expert Answers
MaGe Linux Operations
MaGe Linux Operations
Oct 4, 2025 · Operations

How I Doubled My Salary by Switching from Traditional Ops to SRE in 18 Months

Over 18 months, the author details a step‑by‑step transformation from a fire‑fighting traditional operations role to a high‑paying SRE/DevOps career, covering motivations, skill gaps, learning plans, project implementations, interview preparation, and real‑world outcomes, offering a practical roadmap for engineers seeking similar growth.

CI/CDCloud NativeMonitoring
0 likes · 44 min read
How I Doubled My Salary by Switching from Traditional Ops to SRE in 18 Months
Programmer DD
Programmer DD
Oct 3, 2025 · Operations

How Netflix Turned Incident Management into a Scalable Engineer‑Owned Process

This article explains how Netflix’s engineering teams shifted incident handling from a centralized SRE function to a company‑wide, engineer‑driven practice by selecting the right tooling, standardizing processes, and reshaping culture, enabling rapid, reliable responses for hundreds of millions of viewers.

Incident ManagementNetflixReliability Engineering
0 likes · 10 min read
How Netflix Turned Incident Management into a Scalable Engineer‑Owned Process
DevOps Coach
DevOps Coach
Oct 2, 2025 · Interview Experience

Top 10 SRE Interview Questions & Answers to Ace Your Next Interview

This article compiles ten essential Site Reliability Engineering interview questions covering incident command systems, shell types, browser request flow, SSH, error budgets, toil reduction, Linux boot process, QUIC benefits, UDP VPN usage, and common enterprise network protocols, providing concise answers to help you prepare effectively.

OperationsReliabilitySRE
0 likes · 10 min read
Top 10 SRE Interview Questions & Answers to Ace Your Next Interview
MaGe Linux Operations
MaGe Linux Operations
Sep 28, 2025 · Operations

What Core Skills Do SRE Engineers Need to Master?

This article outlines the essential technical, incident‑response, reliability‑management, collaboration, and systemic‑thinking abilities that Site Reliability Engineering (SRE) professionals must develop to ensure high‑availability, stable services in modern internet environments.

Incident ManagementSRESite Reliability Engineering
0 likes · 5 min read
What Core Skills Do SRE Engineers Need to Master?
Architecture & Thinking
Architecture & Thinking
Sep 17, 2025 · Artificial Intelligence

How the 32B ‘Zhiyu’ Model is Revolutionizing Intelligent Operations

The Zhiyu model, a 32‑billion‑parameter SRE‑focused LLM, combines extensive domain knowledge, enhanced professional skills, and deterministic RAG to deliver precise, actionable insights for intelligent operations, backed by a robust multi‑source training pipeline, staged training, and flexible deployment options.

AI OperationsModel TrainingRAG
0 likes · 7 min read
How the 32B ‘Zhiyu’ Model is Revolutionizing Intelligent Operations
Ops Community
Ops Community
Sep 16, 2025 · Operations

Mastering SRE: Fast Incident Response and Prevention Strategies

This guide walks SRE engineers through a complete incident lifecycle—preventive multi‑layer monitoring, chaos‑testing drills, rapid 10‑minute response tactics, systematic root‑cause analysis, effective communication roles, post‑mortem reviews, and practical case studies—helping teams minimize downtime and business loss.

Incident ManagementRoot Cause AnalysisSRE
0 likes · 11 min read
Mastering SRE: Fast Incident Response and Prevention Strategies
Efficient Ops
Efficient Ops
Aug 25, 2025 · Operations

How SOMM Is Revolutionizing Intelligent Ops with AIOps, SRE & FinOps

The China Academy of Information and Communications Technology introduced the SOMM (System Operation Maturity Model) framework, emphasizing tool intelligence, refined management, and robust operation, and detailed its AIOps, SRE, and FinOps assessment modules, evaluation criteria, maturity levels, and showcase of leading enterprises that have achieved top‑tier certifications.

AIOpsFinOpsMaturity Model
0 likes · 8 min read
How SOMM Is Revolutionizing Intelligent Ops with AIOps, SRE & FinOps
MaGe Linux Operations
MaGe Linux Operations
Aug 24, 2025 · Operations

Master Production Incident Troubleshooting: SEAL Methodology & Essential Ops Toolbox

This comprehensive guide shares a veteran ops engineer's real‑world troubleshooting mindset, the SEAL framework, a curated toolbox of monitoring, logging, performance, and network utilities, detailed case studies, incident‑response grading, automation scripts, and future‑ready AIOps practices for keeping production systems stable.

AutomationMonitoringSRE
0 likes · 19 min read
Master Production Incident Troubleshooting: SEAL Methodology & Essential Ops Toolbox
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Aug 5, 2025 · Operations

Inside Alibaba’s Tesla: Data‑Driven Ops for 100k+ Big Data Nodes

The article details how Alibaba’s Tesla SRE platform supports the massive offline and real‑time big‑data ecosystems through a layered, data‑driven operations framework—DataOps—integrating unified portals, configuration, job, workflow, and analytics platforms, enabling automated monitoring, intelligent decision‑making, and self‑healing capabilities across 100,000+ nodes.

AIOpsBig DataDataOps
0 likes · 20 min read
Inside Alibaba’s Tesla: Data‑Driven Ops for 100k+ Big Data Nodes
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Aug 5, 2025 · Operations

How Alibaba’s Open‑Source SREWorks Transforms Cloud‑Native Data Operations

Alibaba's SREWorks platform, now open‑source, combines cloud‑native architecture, DataOps and AIOps to address the growing complexity of big‑data and AI operations, offering a layered SaaS/PaaS/IaaS solution that streamlines delivery, monitoring, management, control, operation, and service for modern enterprises.

AIOpsCloud NativeDataOps
0 likes · 10 min read
How Alibaba’s Open‑Source SREWorks Transforms Cloud‑Native Data Operations
MaGe Linux Operations
MaGe Linux Operations
Jul 12, 2025 · Operations

Master Helm: The Ultimate Guide to Kubernetes Package Management and Deployment

This comprehensive article explains Helm’s core concepts, installation, basic commands, advanced features, real‑world case studies, common pitfalls, and SRE best practices, showing how Helm streamlines Kubernetes deployments, improves reliability, and enables automated, version‑controlled operations for modern cloud‑native environments.

Infrastructure AutomationSREkubernetes
0 likes · 16 min read
Master Helm: The Ultimate Guide to Kubernetes Package Management and Deployment
Tech Architecture Stories
Tech Architecture Stories
Jun 14, 2025 · Operations

What Caused Google Cloud’s Massive June 2025 Outage and What We Can Learn

On June 12, 2025, a faulty policy update in Google’s Service Control triggered null‑pointer crashes across regions, causing a global outage that also impacted Cloudflare, Twitch, Discord, and others; the incident exposed missing feature flags, inadequate error handling, and lack of exponential backoff, prompting rapid SRE remediation.

Google CloudSREcloud operations
0 likes · 7 min read
What Caused Google Cloud’s Massive June 2025 Outage and What We Can Learn
DevOps Operations Practice
DevOps Operations Practice
Jun 11, 2025 · Operations

Ops vs DevOps vs SRE: Which Role Matches Your Career Goals?

This article compares traditional Operations (Ops), DevOps, and Site Reliability Engineering (SRE) by outlining their definitions, core responsibilities, typical technology stacks, and career considerations, helping readers understand the distinct philosophies and choose the path that best fits their interests and market demand.

CareerSREtechnology stack
0 likes · 6 min read
Ops vs DevOps vs SRE: Which Role Matches Your Career Goals?
FunTester
FunTester
May 16, 2025 · Operations

Chaos Engineering: Evolution, Workflow, Advantages, and Practice Principles

Chaos engineering is a discipline that deliberately injects faults into distributed systems to test and improve resilience, tracing its evolution from Netflix's Chaos Monkey to modern platforms, outlining its operational workflow, benefits, and core principles for reliable system design.

Fault InjectionOperationsReliability
0 likes · 9 min read
Chaos Engineering: Evolution, Workflow, Advantages, and Practice Principles
Baidu Geek Talk
Baidu Geek Talk
Apr 23, 2025 · Operations

Baidu SRE Digital Immunity System: Construction, Evolution, and Practice

Baidu’s SRE digital‑immune system, evolved into an AI‑powered intelligent immunity platform, quantifies and mitigates risk across thousands of services by integrating data‑driven monitoring, rule‑based detection, and large‑model GraphRAG knowledge mining, cutting degradation cases by ~40% and shifting operations from reactive troubleshooting to proactive, data‑centric quality assurance.

AICloud NativeDigital Immunity
0 likes · 14 min read
Baidu SRE Digital Immunity System: Construction, Evolution, and Practice
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Apr 18, 2025 · Operations

How Baidu’s AI‑Powered Digital Immune System Reinvents SRE Risk Management

This article explains why modern SRE teams need a digital immune system, describes Baidu’s data‑driven approach to improve system resilience, outlines the three‑phase evolution from digital transformation to AI‑enhanced risk mining, and shares concrete results and future directions for sustainable operations.

AICloud NativeDigital Immune System
0 likes · 15 min read
How Baidu’s AI‑Powered Digital Immune System Reinvents SRE Risk Management
21CTO
21CTO
Apr 9, 2025 · Operations

9 Must‑Have Container Monitoring Tools and Best Practices for Modern Cloud‑Native Environments

This article reviews nine practical container‑monitoring solutions—from Last9 and Prometheus to Dynatrace and Elastic Observability—detailing their key features, pricing, and why developers prefer them, and then offers comprehensive best‑practice guidance for metrics, tagging, alerts, and advanced observability strategies in Kubernetes‑driven cloud‑native deployments.

AlertingCloud NativeMetrics
0 likes · 25 min read
9 Must‑Have Container Monitoring Tools and Best Practices for Modern Cloud‑Native Environments
Liangxu Linux
Liangxu Linux
Apr 6, 2025 · Operations

How to Define SLIs, SLOs, and SLAs for Effective SRE Practices

This guide explains how SRE teams should collaborate early in the software development lifecycle to define Service Level Indicators (SLIs), set realistic Service Level Objectives (SLOs) and Service Level Agreements (SLAs), and integrate observability signals, error budgeting, risk management, and incident handling into reliable operations.

Error BudgetIncident ManagementObservability
0 likes · 13 min read
How to Define SLIs, SLOs, and SLAs for Effective SRE Practices
Efficient Ops
Efficient Ops
Mar 24, 2025 · Operations

15 Common Ops Mistakes That Can Crash Your System – How to Avoid Them

This article outlines fifteen common operational and management mistakes—such as frequent incidents, excessive new hires, lack of automation, and missing rollback plans—that can trigger system outages, and offers guidance on how teams can strengthen testing, processes, and team capabilities to prevent downtime.

Incident ManagementOperationsSRE
0 likes · 6 min read
15 Common Ops Mistakes That Can Crash Your System – How to Avoid Them
Ops Development Stories
Ops Development Stories
Mar 24, 2025 · Operations

Why Do Some Ops Teams Face Value Challenges? Insights for CTOs

Operations leaders and CTOs often confront the question of the true value of their teams, and this article explores who asks it, why it matters, typical challenges, and practical ways to define and protect the operational role through unified platforms, processes, and strategic collaboration with development.

CTOOperationsPlatform Engineering
0 likes · 13 min read
Why Do Some Ops Teams Face Value Challenges? Insights for CTOs
FunTester
FunTester
Mar 23, 2025 · Operations

The Origin, Development, and Future of Chaos Engineering

Chaos engineering, introduced by Netflix in 2011 to proactively inject failures and test system resilience, has evolved over the past decade into a widely adopted practice integrated with SRE, automation, AI, and Kubernetes, offering best‑practice guidelines and future trends for improving distributed system reliability.

ReliabilitySREkubernetes
0 likes · 8 min read
The Origin, Development, and Future of Chaos Engineering
Efficient Ops
Efficient Ops
Mar 18, 2025 · Operations

Is 24/7 On‑Call a Nightmare? Real Ops Insights from Zhihu Discussions

This article compiles diverse Zhihu comments on the reality of 24 × 7 on‑call duties, contrasting exaggerated myths with practical team‑based solutions, global shift models, backup strategies, and actionable tips for improving operations without sacrificing personal life.

On-CallSREteamwork
0 likes · 7 min read
Is 24/7 On‑Call a Nightmare? Real Ops Insights from Zhihu Discussions
dbaplus Community
dbaplus Community
Mar 5, 2025 · Operations

How to Build a Resilient Content‑Risk System: From Diagnosis to Continuous Improvement

This article shares a step‑by‑step account of taking over a content‑risk stability role in early 2024, defining system stability, diagnosing recurring issues, and implementing a three‑phase framework—pre‑emptive reduction, impact mitigation, and post‑incident improvement—to boost success rates, cut incident response time, and achieve a modular architecture.

JVM OptimizationMonitoringSRE
0 likes · 20 min read
How to Build a Resilient Content‑Risk System: From Diagnosis to Continuous Improvement
Efficient Ops
Efficient Ops
Feb 26, 2025 · Databases

Efficient Operations for Heterogeneous Databases: Insights from Guangdong Mobile

The article summarizes Lai Kunchi's presentation at the 24th GOPS Global Operations Conference, covering the current state and challenges of database development, Guangdong Mobile's database operation system, and future directions for managing heterogeneous databases in evolving business architectures.

AIOpsDatabase operationsSRE
0 likes · 3 min read
Efficient Operations for Heterogeneous Databases: Insights from Guangdong Mobile
ITPUB
ITPUB
Feb 11, 2025 · Operations

Why Your Monitoring Fails and How to Build Effective Observability Data

Many companies deploy fragmented monitoring and observability tools yet still struggle to pinpoint incidents; this article analyzes the root causes—under‑utilized tools and scenario‑agnostic data—and offers practical steps to organize metrics, build layered insights, and improve fault‑resolution efficiency.

Data EngineeringMonitoringObservability
0 likes · 12 min read
Why Your Monitoring Fails and How to Build Effective Observability Data
JD Tech Talk
JD Tech Talk
Feb 6, 2025 · Operations

Stability Assurance Mechanisms and Practices for Site Reliability Engineering (SRE)

This article outlines comprehensive stability assurance mechanisms—including standards, process workflows, the distinction between developers and SREs, personal responsibilities, and practical construction directions—to guide teams in building resilient, high‑availability systems through proactive, daily, and incident‑response practices.

Reliability EngineeringSREStability
0 likes · 10 min read
Stability Assurance Mechanisms and Practices for Site Reliability Engineering (SRE)
JD Cloud Developers
JD Cloud Developers
Feb 6, 2025 · Operations

How to Build a Robust Stability Framework: Key Mechanisms for SRE Success

This article outlines a comprehensive stability framework for SRE teams, detailing essential mechanisms such as review processes, coding standards, incident management, on‑call responsibilities, and daily operational practices, while also highlighting the cultural shift needed to achieve reliable, high‑availability systems.

Incident ManagementOperationsSRE
0 likes · 11 min read
How to Build a Robust Stability Framework: Key Mechanisms for SRE Success
Ops Development Stories
Ops Development Stories
Jan 23, 2025 · Operations

How SREs Can Boost Their Influence Within Teams

This article explores why influence matters for Site Reliability Engineers, outlines the challenges they face in gaining recognition, and provides practical strategies—enhancing technical expertise, improving communication, quantifying achievements, and sharing knowledge—to elevate their impact within organizations.

OperationsSREcommunication
0 likes · 19 min read
How SREs Can Boost Their Influence Within Teams
Efficient Ops
Efficient Ops
Jan 20, 2025 · Operations

Inside Qunar’s Pre‑Release Platform: Design, Practice, and Future Outlook

The article recaps Li Jingkang’s presentation at the 2024 GOPS Global Operations Conference, detailing the background, principles, design, and real‑world implementation of Qunar’s pre‑release platform, and outlines its future direction within DevOps, SRE, AIOps, and cloud‑native practices.

AIOpsCloud NativeOperations
0 likes · 3 min read
Inside Qunar’s Pre‑Release Platform: Design, Practice, and Future Outlook
Ops Development Stories
Ops Development Stories
Jan 16, 2025 · Operations

How AI is Transforming Site Reliability Engineering (SRE)

This article examines how the rapid rise of AI reshapes Site Reliability Engineering by enhancing monitoring, automating operations, improving fault diagnosis, and presenting new challenges such as data security and model explainability, ultimately driving more efficient and reliable system management.

AIReliabilitySRE
0 likes · 21 min read
How AI is Transforming Site Reliability Engineering (SRE)
Efficient Ops
Efficient Ops
Jan 14, 2025 · Operations

What Ops Professionals Learn from Real-World Incident Stories

This article compiles real‑world operations incidents—from accidental database deletions and faulty deployments to hidden data tampering and network device failures—highlighting how quick diagnosis, preventive maintenance, and SRE practices can mitigate impact on users, reputation, and revenue.

Case StudiesSREdatabase backup
0 likes · 6 min read
What Ops Professionals Learn from Real-World Incident Stories
Tencent Cloud Developer
Tencent Cloud Developer
Jan 7, 2025 · Operations

Designing High‑Availability Systems: Principles, Architecture, and Operations

This comprehensive guide explains how to design, build, and operate high‑availability systems by covering availability metrics, fault‑tolerance strategies, capacity planning, code and data layer architecture, automated testing, monitoring, and clear role responsibilities to ensure services stay reliable and resilient under load.

Cloud NativeHigh AvailabilitySRE
0 likes · 32 min read
Designing High‑Availability Systems: Principles, Architecture, and Operations
Efficient Ops
Efficient Ops
Jan 1, 2025 · Operations

What 2024’s Biggest Outages Teach Us About Building Resilient Systems

Reviewing the major service disruptions—from Alibaba Cloud to OpenAI—this article extracts key SRE lessons such as early disaster‑recovery planning, regular backups, load balancing, real‑time monitoring, performance tuning, and capacity planning, urging enterprises to adopt resilient practices for a more stable future.

Disaster RecoveryOperationsOutage Management
0 likes · 6 min read
What 2024’s Biggest Outages Teach Us About Building Resilient Systems
Tech Architecture Stories
Tech Architecture Stories
Dec 28, 2024 · Operations

Why Preventing Small Issues Is the Key to System Stability

The article explains how early detection and preventive measures—such as comprehensive monitoring, rate limiting, chaos testing, and proper SLOs—are essential for maintaining system stability and avoiding larger incidents, drawing on SRE principles and the incident triangle theory.

Error BudgetIncident PreventionOperations
0 likes · 4 min read
Why Preventing Small Issues Is the Key to System Stability
Efficient Ops
Efficient Ops
Dec 8, 2024 · Operations

Unlocking BizDevOps: Key Insights from Shanghai’s Enterprise Summit

The article recaps Shanghai’s BizDevOps Enterprise Summit, highlighting five expert sessions on R&D‑operations integration in securities, platform engineering breakthroughs, large‑model agents in financial ops, Ctrip’s 10 PB JuiceFS practice, and core SRE stability strategies for financial firms.

AI AgentsBizDevOpsCloud Native
0 likes · 4 min read
Unlocking BizDevOps: Key Insights from Shanghai’s Enterprise Summit
Efficient Ops
Efficient Ops
Nov 20, 2024 · Operations

How China’s Telecom Leaders Accelerate DevOps & AIOps Standards for Faster Delivery

The article outlines China’s 2024‑2027 information standard action plan, the rollout of ITU DevOps and AIOps assessments, and showcases dozens of telecom projects that achieved significant improvements in delivery speed, reliability, automation and observability through standardized DevOps, SRE and AI‑ops practices.

AIOpsSREStandardization
0 likes · 23 min read
How China’s Telecom Leaders Accelerate DevOps & AIOps Standards for Faster Delivery
Efficient Ops
Efficient Ops
Nov 19, 2024 · Operations

Mastering System Stability: Proven SRE Practices for Reliable, High‑Availability Services

This article explains how system stability depends on architecture and code details, defines SLA and the “nines” metric, outlines Google’s SRE hierarchy, and provides practical governance steps—including development and release processes, high‑availability design, capacity planning, monitoring, incident response, and team culture—to achieve reliable, high‑availability services.

High AvailabilitySREcapacity planning
0 likes · 34 min read
Mastering System Stability: Proven SRE Practices for Reliable, High‑Availability Services
Efficient Ops
Efficient Ops
Nov 14, 2024 · Operations

Why Alipay Crashed: Lessons on Backup and Disaster Recovery

The recent Alipay outage during Double‑11 revealed a partial failure in its system message database, prompting users to experience payment errors, duplicate charges, and delayed withdrawals, while the company’s response highlighted the importance of comprehensive backup, redundancy, disaster‑recovery planning, monitoring, and security measures to ensure service continuity.

AlipaySREdisaster-recovery
0 likes · 10 min read
Why Alipay Crashed: Lessons on Backup and Disaster Recovery
Efficient Ops
Efficient Ops
Nov 14, 2024 · Operations

How SRE Standards Boost System Reliability in China’s Digital Era

Amid a surge of high‑profile outages, the CAICT introduces a comprehensive SRE framework that addresses large‑scale, high‑frequency changes, complex tech stacks, and massive traffic, outlining development and operational reliability practices, maturity levels, and actionable guidelines to enhance system stability.

Digital GovernanceIT ManagementSRE
0 likes · 12 min read
How SRE Standards Boost System Reliability in China’s Digital Era