Tagged articles

SRE

414 articles · Page 1 of 5

Jun 24, 2026 · Artificial Intelligence

What Architects Should Focus on When Claude, Codex, and Mira Discuss Loop

The article examines Loop engineering for AI agents, arguing that beyond entry points like Claude, Codex, or Mira, architects must ensure reliable feedback, persistent state, clear stop conditions, and human hand‑off, drawing parallels to high‑reliability SRE practices and proposing concrete design and evaluation steps.

AI AgentsAutomationLoop Engineering

0 likes · 20 min read

What Architects Should Focus on When Claude, Codex, and Mira Discuss Loop

Continuous Delivery 2.0

Jun 15, 2026 · Operations

Step‑by‑Step AIOps Rollout: How Tencent IEG Tech Ops Reinvented SRE Efficiency

Tencent IEG's tech operations team tackled six common SRE AI adoption bottlenecks with a three‑stage, layered framework, built a unified platform and metric system, and demonstrated measurable AI‑driven efficiency gains across multiple SRE scenarios.

AIAIOpsMetrics

0 likes · 11 min read

Step‑by‑Step AIOps Rollout: How Tencent IEG Tech Ops Reinvented SRE Efficiency

TonyBai

Jun 15, 2026 · Operations

When AI Generates Code 10× Faster, Who Safeguards System Reliability?

The article analyzes Google’s SRE whitepaper on AI‑driven operations, detailing how generative AI accelerates code production 4‑10×, introduces five SRE AI autonomy levels, three core AI‑ops components, and a safety architecture that decouples decision‑making from execution to prevent catastrophic failures.

AI OpsAutomationGoogle

0 likes · 12 min read

When AI Generates Code 10× Faster, Who Safeguards System Reliability?

FunTester

Jun 12, 2026 · R&D Management

Why Removing QA Requires Building a New Quality Framework

Eliminating a dedicated QA function may look like cost savings, but without establishing a comprehensive quality system—including self‑testing, automation, release gates, monitoring, and post‑incident reviews—risk simply shifts to production, leading to hidden incidents, longer rollbacks, and ultimately higher total cost.

QARisk ManagementSRE

0 likes · 18 min read

Why Removing QA Requires Building a New Quality Framework

Continuous Delivery 2.0

Jun 11, 2026 · Operations

Step‑by‑Step AIOps Rollout at Tencent IEG: Reinventing SRE Efficiency

Tencent IEG’s tech‑operations team details a layered AIOps implementation that tackles six core SRE bottlenecks, builds a unified platform and metric system, and demonstrates measurable efficiency, quality, and cost‑saving gains across multiple operational scenarios.

AIAIOpsAutomation

0 likes · 11 min read

Step‑by‑Step AIOps Rollout at Tencent IEG: Reinventing SRE Efficiency

SuanNi

Jun 8, 2026 · Artificial Intelligence

First Enterprise IT Ops Agent Benchmark Shows Claude Leads with Just 47% Score

The ITBench-AA benchmark, the first evaluation specifically for enterprise IT operations agents, tests 59 SRE scenarios and reveals that even top models like Claude Opus 4.7 achieve only a 47% score, highlighting both the difficulty of the tasks and the cost‑effectiveness gap between proprietary and open‑source agents.

AI AgentClaudeIT Operations

0 likes · 11 min read

First Enterprise IT Ops Agent Benchmark Shows Claude Leads with Just 47% Score

MaGe Linux Operations

Jun 8, 2026 · Industry Insights

AI Reshapes the IT Industry: Which High-Value IT Jobs Will Dominate the Next Decade?

The article analyzes how AI is transforming IT hiring by favoring system‑level talent, ranks AI infrastructure, cloud‑native platform engineering and SRE as the most cost‑effective roles for the next 5‑10 years, and advises current ops staff to upskill accordingly.

AIAI InfrastructureCloud Native

0 likes · 6 min read

AI Reshapes the IT Industry: Which High-Value IT Jobs Will Dominate the Next Decade?

DaTaobao Tech

May 27, 2026 · Artificial Intelligence

From Human‑AI Collaboration to AI‑Led Code Quality: Building a Digital SRE Agent for Automated Blocker Fixes

The article traces the evolution of AI in software development, explains why traditional code‑quality processes struggle with Blocker issues, and details how a browser‑automated AI SRE agent can discover, route, and fully repair these problems while keeping humans in the final review loop.

AIAutomationSRE

0 likes · 18 min read

From Human‑AI Collaboration to AI‑Led Code Quality: Building a Digital SRE Agent for Automated Blocker Fixes

Cloud Native Technology Community

May 18, 2026 · Operations

How to Cut Engineering Time on Kubernetes Upgrades

Kubernetes upgrades can consume 4‑6 weeks of engineering effort per minor release, delaying product roadmaps and inflating cloud costs, while reports show teams lose dozens of workdays to incidents and over‑provisioned resources, highlighting the need for dedicated SRE ownership to reclaim time for business‑impacting work.

Operational CostPlatform EngineeringSRE

0 likes · 8 min read

How to Cut Engineering Time on Kubernetes Upgrades

dbaplus Community

May 1, 2026 · Operations

Why a Simple Nginx Change Made All Gateway Requests Return 400 (And How to Fix It)

A production incident caused by replacing two Nginx reverse proxies introduced an upstream name with an underscore, resulting in invalid Host headers and 400 Bad Request responses from Spring Cloud Gateway; the article details the step‑by‑step investigation, evidence from logs, tcpdump, and code, and presents configuration fixes to restore normal operation.

HTTP 400NGINXSRE

0 likes · 15 min read

Why a Simple Nginx Change Made All Gateway Requests Return 400 (And How to Fix It)

FunTester

Apr 28, 2026 · Operations

How Self‑Healing Automation Platforms Transform SRE Practices

The article explains how a self‑healing platform improves SRE reliability by reducing MTTR, preserving error‑budget, automating high‑impact incident remediation, enforcing safety guardrails, and shifting team focus from firefighting to sustainable reliability engineering.

AutomationError BudgetMTTR

0 likes · 10 min read

How Self‑Healing Automation Platforms Transform SRE Practices

FunTester

Apr 27, 2026 · Operations

Why Relying on Humans for Incident Recovery Fails and How Self‑Healing Automation Platforms Help

The article explains that large‑scale incidents overwhelm on‑call engineers who must manually piece together context from countless signals, and shows how a self‑healing automation platform can take over repetitive, known failure patterns, verify fixes, and reduce fatigue while keeping humans in the loop for oversight.

AutomationOperationsPlatform Engineering

0 likes · 8 min read

Why Relying on Humans for Incident Recovery Fails and How Self‑Healing Automation Platforms Help

Linyb Geek Road

Apr 25, 2026 · Operations

How to Build Stable SaaS Systems: Key Practices for Reliability

The article outlines practical methods for ensuring SaaS system stability, covering resource‑related issues, middleware reliability, pre‑release gray deployments, automated release procedures, comprehensive monitoring, load‑balancing strategies, degradation handling, rate limiting, chaos engineering, and SRE implementation.

MonitoringSRESaaS

0 likes · 10 min read

How to Build Stable SaaS Systems: Key Practices for Reliability

DevOps Coach

Apr 22, 2026 · Operations

2026 AI DevOps Outlook: 10 Must‑Watch MCP Servers Transforming SRE

The article surveys the rapidly growing Model Context Protocol (MCP) ecosystem in 2026, detailing ten AI‑enabled DevOps servers, their core capabilities, real‑world impact on SRE workflows, and a practical framework for selecting the most valuable servers for a given team.

AI DevOpsMCPObservability

0 likes · 16 min read

2026 AI DevOps Outlook: 10 Must‑Watch MCP Servers Transforming SRE

Raymond Ops

Apr 20, 2026 · Operations

How to Build a Standardized SRE On‑Call Process: From Alert Grading to Handoff Templates

This article presents a complete SRE on‑call handbook that defines alert severity levels, provides concrete Prometheus Alertmanager configurations, outlines a step‑by‑step response flow, details war‑room roles, escalation paths, handoff checklists, post‑mortem procedures, and dozens of ready‑to‑use templates to reduce MTTR and improve reliability.

Alert ManagementOn-CallOperations

0 likes · 27 min read

How to Build a Standardized SRE On‑Call Process: From Alert Grading to Handoff Templates

Ray's Galactic Tech

Apr 19, 2026 · Cloud Native

Building a Production‑Ready Cloud‑Native Kubernetes Platform: From Zero to SRE Success

This article presents a step‑by‑step guide to designing and implementing a production‑grade Kubernetes platform with GitOps, observability, capacity governance, fault‑injection, and SRE practices, showing how to achieve unified delivery, reliability, and low‑cost operation for high‑concurrency business services.

Cloud NativeGitOpsObservability

0 likes · 37 min read

Building a Production‑Ready Cloud‑Native Kubernetes Platform: From Zero to SRE Success

Alibaba Cloud Native

Apr 8, 2026 · Operations

How HiClaw Transforms SRE with Multi‑Agent Collaboration in Cloud‑Native Environments

The article details how the HiClaw distributed multi‑agent platform is built and organized for SRE teams, explains the roles of human users and digital bots, describes permission design, showcases fault‑diagnosis and release scenarios, and evaluates the efficiency and innovation gains of this cloud‑native automation approach.

AI OpsAutomationCloud Native

0 likes · 14 min read

How HiClaw Transforms SRE with Multi‑Agent Collaboration in Cloud‑Native Environments

DevOps Coach

Mar 31, 2026 · Operations

How AI‑Driven Observability Can Cut MTTR: A 12‑Step Investigation Framework

This article explains how modern SRE teams can combine AI‑assisted observability with structured critical thinking to build a 12‑step investigation model that accelerates fault detection, hypothesis generation, telemetry validation, root‑cause analysis, and automated remediation, ultimately reducing MTTR and improving reliability.

AIIncident ManagementObservability

0 likes · 9 min read

How AI‑Driven Observability Can Cut MTTR: A 12‑Step Investigation Framework

DevOps Coach

Mar 26, 2026 · Operations

Can an AI Agent Replace Your SRE Night‑Shift? Inside Google’s Remote MCP‑Powered Autonomous SRE Agent

The article examines the chronic pain points of on‑call SRE teams—alert fatigue, long MTTR, inconsistent RCA, and communication bottlenecks—and presents a detailed, four‑layer architecture that uses Google’s Remote MCP server and an AI‑driven autonomous SRE agent to automate log retrieval, knowledge lookup, root‑cause analysis, and stakeholder notifications, dramatically improving reliability and efficiency.

Google CloudMCPOperations

0 likes · 21 min read

Can an AI Agent Replace Your SRE Night‑Shift? Inside Google’s Remote MCP‑Powered Autonomous SRE Agent

DevOps Coach

Mar 24, 2026 · Operations

Avoid the Top 10 Kubernetes Monitoring Mistakes Every SRE Team Makes

This article examines the ten most common Kubernetes monitoring errors that SRE teams encounter, explains why each mistake harms reliability, and provides concrete, actionable solutions—including the Golden Signals framework, pod‑restart analysis, alert‑fatigue reduction, application‑level observability, etcd health checks, network metrics, control‑plane monitoring, log‑metric correlation, resource request tracking, and end‑to‑end observability—to help teams build robust, scalable monitoring systems.

Cloud NativeMonitoringObservability

0 likes · 11 min read

Avoid the Top 10 Kubernetes Monitoring Mistakes Every SRE Team Makes

MaGe Linux Operations

Mar 16, 2026 · Operations

Kubernetes Pod Troubleshooting Guide: Diagnose CrashLoopBackOff, OOMKilled & More

A comprehensive, step‑by‑step guide for SREs and DevOps engineers to diagnose and resolve common Kubernetes pod issues—including CrashLoopBackOff, OOMKilled, ImagePullBackOff, Pending, Evicted, and Terminating—by leveraging pod lifecycle knowledge, kubectl commands, logs, events, node inspection, scripts, real‑world case studies, and monitoring best practices.

SREdevopskubernetes

0 likes · 55 min read

Kubernetes Pod Troubleshooting Guide: Diagnose CrashLoopBackOff, OOMKilled & More

Ops Development Stories

Mar 14, 2026 · Operations

OpenClaw AI Hype: An SRE’s Warning About Hidden Ops Risks

The article examines the rapid popularity of the open‑source AI agent OpenClaw, revealing how hype, cost misconceptions, and inadequate security practices create serious operational risks for both individual and enterprise users, and offers concrete SRE‑style safeguards to mitigate these dangers.

AIOpenClawRisk Management

0 likes · 9 min read

OpenClaw AI Hype: An SRE’s Warning About Hidden Ops Risks

Raymond Ops

Mar 10, 2026 · Operations

How to Master Service Avalanche Recovery: A Complete SRE Playbook from Alert to Restoration

This guide walks SRE and senior operations engineers through a real-world service‑avalanche incident, detailing alert hierarchy design, fault‑location commands, emergency SOPs, capacity‑baseline building, and post‑mortem best practices to dramatically reduce MTTR in distributed micro‑service environments.

SREService Avalanchecapacity planning

0 likes · 19 min read

How to Master Service Avalanche Recovery: A Complete SRE Playbook from Alert to Restoration

DevOps Coach

Mar 6, 2026 · Operations

SRE vs Platform Engineering vs DevOps: Key Differences, Roles, and Toolchains

An in‑depth comparison of Site Reliability Engineering (SRE), Platform Engineering, and DevOps explains their origins, core responsibilities, distinct tools, and how they complement each other in modern cloud‑native organizations, helping teams choose the right practices for reliable, scalable software delivery.

Cloud NativePlatform EngineeringSRE

0 likes · 9 min read

SRE vs Platform Engineering vs DevOps: Key Differences, Roles, and Toolchains

Architect-Kip

Mar 4, 2026 · Operations

Essential SRE Monitoring and Alerting Standards: From Metrics to Incident Response

This guide outlines comprehensive SRE monitoring and alerting standards, covering core principles, log instrumentation, health‑check requirements, baseline resource and application metrics, alarm severity tiers, response SLAs, on‑call rotation, continuous optimization, and noise‑reduction mechanisms to ensure reliable service operation.

AlertingMetricsMonitoring

0 likes · 14 min read

Essential SRE Monitoring and Alerting Standards: From Metrics to Incident Response

Raymond Ops

Mar 2, 2026 · Operations

Why Most Alerts Fail and How to Build a Night‑Quiet, High‑Signal Monitoring System

This article examines the root causes of alert fatigue—mis‑configured thresholds, noisy alerts, lack of context, and poor routing—then presents a step‑by‑step guide using golden signals, dynamic baselines, enriched alert payloads, severity‑based routing, and suppression techniques to create an effective, low‑noise monitoring system.

AlertingAlertmanagerMonitoring

0 likes · 24 min read

Why Most Alerts Fail and How to Build a Night‑Quiet, High‑Signal Monitoring System

Raymond Ops

Mar 1, 2026 · Operations

How I Transitioned from Traditional Ops to SRE/DevOps in 18 Months

This detailed guide shares a step‑by‑step 18‑month roadmap, covering self‑assessment, skill acquisition (Python, Kubernetes, monitoring), project execution, interview preparation, and real‑world outcomes for engineers moving from legacy operations to SRE/DevOps roles.

CI/CDMonitoringPython

0 likes · 35 min read

How I Transitioned from Traditional Ops to SRE/DevOps in 18 Months

Instant Consumer Technology Team

Feb 6, 2026 · Operations

How eBPF Transforms Modern SRE Practices and Cloud‑Native Operations

This article explores the strategic role of eBPF in cloud‑native operations, detailing its technical foundations, real‑world use cases from major tech companies, step‑by‑step troubleshooting methods, and a concrete implementation for TCP retransmission monitoring in a high‑traffic gateway system.

Cloud NativeObservabilityOperations

0 likes · 21 min read

How eBPF Transforms Modern SRE Practices and Cloud‑Native Operations

Ops Community

Feb 2, 2026 · Operations

How to Process 10GB Logs in 30 Seconds with Grep, Sed, and Awk

This comprehensive guide shows how to use the GNU tools grep, sed, and awk to quickly analyse massive Nginx access logs, covering their streaming design, optimal command parameters, real‑world examples, performance tricks, security safeguards and step‑by‑step scripts for fault isolation and reporting.

SREShell Scriptingawk

0 likes · 38 min read

How to Process 10GB Logs in 30 Seconds with Grep, Sed, and Awk

DevOps Coach

Jan 25, 2026 · Operations

Why Infra Companies Are Racing Into Observability and What It Means for 2026

The article examines how SRE and infrastructure teams are converging, why major infra vendors are acquiring observability assets, the rising cost pressures, and how OpenTelemetry combined with Apache Iceberg forms a new standard stack that AI‑driven incident response will rely on in the coming years.

AI incident responseApache IcebergSRE

0 likes · 11 min read

Why Infra Companies Are Racing Into Observability and What It Means for 2026

Ops Development Stories

Jan 12, 2026 · Operations

Choosing the Best 2026 Observability Stack: From Collection to Alerts

This article reviews the 2026 observability landscape, outlines selection principles, compares open‑source and commercial solutions for data collection, storage, alerting and event management, and discusses how AI is reshaping monitoring and AIOps practices.

AlertingMetricsMonitoring

0 likes · 9 min read

Choosing the Best 2026 Observability Stack: From Collection to Alerts

Ubuntu

Jan 11, 2026 · Operations

Why Debian 13.3 Matters for Ubuntu 26.04 LTS: Deep Dive into the New Toolchain and Security Upgrades

Debian 13.3, released on January 10 2026, brings GCC 15.2, GDB 17.1, Signed‑by repository policies and eBPF audit hardening, all of which shape the upcoming Ubuntu 26.04 LTS feature‑freeze, migration timeline, and security posture for backend, AI and SRE workloads.

DebianGCC 15.2GDB 17.1

0 likes · 6 min read

Why Debian 13.3 Matters for Ubuntu 26.04 LTS: Deep Dive into the New Toolchain and Security Upgrades

MaGe Linux Operations

Jan 7, 2026 · Operations

How to Eliminate Alert Fatigue: 10 Proven Prometheus Alerting Techniques

This comprehensive guide walks you through the architecture of Prometheus and Alertmanager, shows how to design, write, and test robust alert rules, and shares ten practical techniques—including proper for‑durations, rate() usage, recording rules, multi‑level alerts, and inhibition—to dramatically reduce alert noise and improve SRE reliability.

AlertingAlertmanagerMonitoring

0 likes · 40 min read

How to Eliminate Alert Fatigue: 10 Proven Prometheus Alerting Techniques

Ops Development Stories

Dec 31, 2025 · Operations

12 Major 2025 Internet Outages: What Every Ops Team Can Learn

This article compiles twelve high‑profile internet service failures from 2025, detailing each incident’s description, micro‑scenario, technical root cause, and risk perspective, and extracts actionable lessons on infrastructure resilience, change management, and security‑aware operations.

Internet OutagesOperationsReliability

0 likes · 20 min read

12 Major 2025 Internet Outages: What Every Ops Team Can Learn

MaGe Linux Operations

Dec 10, 2025 · Operations

Standardized SRE On‑Call Handbook: Alert Grading, Response Flow, and Handoff Templates

This handbook presents a complete, two‑year‑tested SRE on‑call process that defines alert severity tiers, response requirements, escalation paths, War‑Room roles, handoff schedules, post‑mortem procedures, and provides ready‑to‑use configuration snippets, checklists and templates to reduce MTTR and repeat incidents.

Alert ManagementOn-CallOperations

0 likes · 26 min read

Standardized SRE On‑Call Handbook: Alert Grading, Response Flow, and Handoff Templates

Continuous Delivery 2.0

Dec 9, 2025 · Operations

How Tencent Interactive Entertainment Scaled SRE: From Traditional Ops to Modern Reliability Engineering

This article examines Tencent Interactive Entertainment's eight‑year journey from a traditional operations team to a 400‑person SRE organization, detailing timeline milestones, the shift in mindset and practices, management challenges, and the broader industry trends driving reliability engineering adoption.

OperationsOrganizational ChangeReliability Engineering

0 likes · 13 min read

How Tencent Interactive Entertainment Scaled SRE: From Traditional Ops to Modern Reliability Engineering

DevOps Coach

Dec 8, 2025 · Operations

How to Quantify SRE ROI: Turning Reliability Metrics into Business Value

This article explains how SRE leaders can bridge the gap between technical reliability metrics and business outcomes by defining core SRE concepts, applying a step‑by‑step ROI formula, illustrating code‑level impact, avoiding common pitfalls, and looking ahead to AI‑driven reliability forecasting.

BusinessValueMetricsOperations

0 likes · 10 min read

How to Quantify SRE ROI: Turning Reliability Metrics into Business Value

Raymond Ops

Dec 6, 2025 · Cloud Native

Master Helm: From Installation to Advanced Kubernetes Deployments

This comprehensive guide explains Helm’s core concepts, installation steps, basic commands, real‑world deployment examples for Nginx and WordPress, advanced features like hooks and sub‑charts, common pitfalls, and SRE‑focused best practices for reliable, automated Kubernetes package management.

CI/CDSREdevops

0 likes · 15 min read

Master Helm: From Installation to Advanced Kubernetes Deployments

Continuous Delivery 2.0

Nov 17, 2025 · Operations

How Tencent’s BlueKing Platform Evolved into a Full‑Stack SRE Solution

This article traces the evolution of Tencent’s BlueKing platform from its early automation phase to a data‑driven, AI‑enhanced SRE ecosystem, highlighting architectural milestones, open‑source contributions, and practical lessons for organizations adopting Site Reliability Engineering.

AutomationBlueKingPlatform Engineering

0 likes · 10 min read

How Tencent’s BlueKing Platform Evolved into a Full‑Stack SRE Solution

DevOps Coach

Nov 10, 2025 · Operations

How to Use SRE Metrics for Data‑Driven Reliability and Faster Releases

This guide explains the SRE framework—SLA, SLO, SLI hierarchy, golden signals, error budgets, and DORA metrics—showing how to instrument a Python app with OpenTelemetry, query Prometheus, avoid common pitfalls, and adopt a cultural and technical process that balances feature velocity with system stability.

DORAError BudgetGolden Signals

0 likes · 18 min read

How to Use SRE Metrics for Data‑Driven Reliability and Faster Releases

Efficient Ops

Nov 9, 2025 · Operations

How Tencent’s PCG Achieves Full‑Link Observability and AI‑Powered SRE

The talk details Tencent PCG’s end‑to‑end observability platform, its data‑standardization pipeline, client‑backend session linking, AI‑enhanced SRE Agent with large language models, and the roadmap toward a SaaS offering, illustrating how modern operations integrate AI for rapid fault localization.

AILarge Language ModelMonitoring

0 likes · 17 min read

How Tencent’s PCG Achieves Full‑Link Observability and AI‑Powered SRE

Continuous Delivery 2.0

Nov 4, 2025 · Operations

Google's STAMP Framework: Redefining SRE for AI‑Driven Systems

Google’s SRE team is shifting from traditional error‑budget approaches to the STAMP (Systems-Theoretic Accident Model and Processes) framework, applying control theory and system‑level analysis to manage the growing complexity of AI‑powered services, improve safety, and proactively prevent hazardous states.

AIControl TheoryReliability

0 likes · 12 min read

Google's STAMP Framework: Redefining SRE for AI‑Driven Systems

Instant Consumer Technology Team

Nov 3, 2025 · Artificial Intelligence

Large Language Models Power Big Data SRE Knowledge & Root‑Cause Automation

Facing the growing complexity of big‑data platforms, the SRE team adopted large‑language‑model agents to automate knowledge management and root‑cause analysis, employing Retrieval‑Augmented Generation, a vector store, and the Model Context Protocol to enable intelligent, scalable, and efficient incident diagnosis and resolution.

AIKnowledge ManagementMCP

0 likes · 12 min read

Large Language Models Power Big Data SRE Knowledge & Root‑Cause Automation

Baidu Intelligent Cloud Tech Hub

Oct 29, 2025 · Operations

How to Prevent Avalanche Failures in Large‑Scale Microservice Systems

This article explains how Baidu's SRE team identified the root causes of avalanche failures in massive microservice architectures, modeled system limits with Little’s Law, and implemented engineering practices such as retry budgets, queue throttling, and global TTL controls to achieve self‑healing and eliminate avalanche incidents.

MicroservicesReliability EngineeringSRE

0 likes · 9 min read

How to Prevent Avalanche Failures in Large‑Scale Microservice Systems

Efficient Ops

Oct 19, 2025 · Operations

How Ningbo Bank Boosted System Reliability with SRE: Lessons from a 3‑Level Assessment

Ningbo Bank’s personal mobile banking system passed the SRE Level‑3 assessment, showcasing how systematic SRE practices, metric‑driven reliability engineering, and cross‑team collaboration can dramatically improve system stability, reduce failures, and support digital transformation in the financial sector.

Banking OperationsIT stabilitySRE

0 likes · 16 min read

How Ningbo Bank Boosted System Reliability with SRE: Lessons from a 3‑Level Assessment

MaGe Linux Operations

Oct 16, 2025 · Operations

SRE Playbook: From Alert to Full Recovery of Service Avalanches

This comprehensive SRE guide walks through a real-world service avalanche incident, detailing alert triggering, root‑cause analysis, step‑by‑step recovery, capacity baseline creation, layered alert design, automated scripts, and post‑mortem best practices to help engineers prevent and resolve large‑scale outages.

AlertingSREService Avalanche

0 likes · 20 min read

SRE Playbook: From Alert to Full Recovery of Service Avalanches

Huawei Cloud Developer Alliance

Oct 16, 2025 · Operations

How HyperRouter Enables Deterministic Operations for L4 Load Balancing

This article explains how Huawei Cloud's HyperRouter implements deterministic operations through a combination of L4/L7 load‑balancing co‑design, high‑performance data‑plane choices, self‑healing mechanisms, point‑to‑point architecture, Cell + Shuffle‑Sharding isolation, and user‑centric observability, providing a reproducible blueprint for reliable cloud services.

Cloud NativeDPDKObservability

0 likes · 17 min read

How HyperRouter Enables Deterministic Operations for L4 Load Balancing

Continuous Delivery 2.0

Oct 11, 2025 · Operations

Mastering Enterprise SRE: From Core Concepts to Practical Implementation

This comprehensive guide explains the core principles of Site Reliability Engineering, outlines a phased roadmap for enterprise adoption, details essential monitoring, automation, and reliability platforms, and addresses team structure, talent development, common challenges, and real‑world success stories to help organizations build effective SRE practices.

AutomationSRESite Reliability Engineering

0 likes · 16 min read

Mastering Enterprise SRE: From Core Concepts to Practical Implementation

Continuous Delivery 2.0

Oct 9, 2025 · Operations

Why SRE Is the Next Evolution for Enterprise Operations

This article introduces the first part of an SRE series, explaining why organizations need SRE, the incremental path for building SRE capabilities, and differentiated strategies for medium and large enterprises, emphasizing gradual, platform‑driven automation and cultural change for reliable digital transformation.

EnterprisePlatform EngineeringSRE

0 likes · 8 min read

Why SRE Is the Next Evolution for Enterprise Operations

Alibaba Cloud Developer

Oct 9, 2025 · Operations

How AI‑Powered Multi‑Agent Systems Turn Fault Postmortems into Proactive Risk Prevention

This article explains how an AI‑driven multi‑agent platform automates fault postmortem generation, enriches analysis with memory management, prompt engineering, and RAG techniques, and delivers actionable insights for SREs, developers, and non‑technical stakeholders, ultimately shifting incident handling from reactive to proactive.

AIAutomationIncident Management

0 likes · 44 min read

How AI‑Powered Multi‑Agent Systems Turn Fault Postmortems into Proactive Risk Prevention

DevOps Coach

Oct 6, 2025 · Interview Experience

How I Doubled My Salary by Switching from Traditional Ops to SRE in 18 Months

Over 18 months, the author details a step‑by‑step transformation from a fire‑fighting traditional operations role to a high‑paying SRE/DevOps career, covering motivations, skill gaps, learning plans, project implementations, interview preparation, and real‑world outcomes, offering a practical roadmap for engineers seeking similar growth.

CI/CDCloud NativeMonitoring

0 likes · 44 min read

How I Doubled My Salary by Switching from Traditional Ops to SRE in 18 Months

Programmer DD

Oct 3, 2025 · Operations

How Netflix Turned Incident Management into a Scalable Engineer‑Owned Process

This article explains how Netflix’s engineering teams shifted incident handling from a centralized SRE function to a company‑wide, engineer‑driven practice by selecting the right tooling, standardizing processes, and reshaping culture, enabling rapid, reliable responses for hundreds of millions of viewers.

Incident ManagementNetflixReliability Engineering

0 likes · 10 min read

How Netflix Turned Incident Management into a Scalable Engineer‑Owned Process

DevOps Coach

Oct 2, 2025 · Interview Experience

What Core Skills Do SRE Engineers Need to Master?

This article outlines the essential technical, incident‑response, reliability‑management, collaboration, and systemic‑thinking abilities that Site Reliability Engineering (SRE) professionals must develop to ensure high‑availability, stable services in modern internet environments.

Incident ManagementSRESite Reliability Engineering

0 likes · 5 min read

What Core Skills Do SRE Engineers Need to Master?

Architecture & Thinking

Sep 17, 2025 · Artificial Intelligence

How the 32B ‘Zhiyu’ Model is Revolutionizing Intelligent Operations

The Zhiyu model, a 32‑billion‑parameter SRE‑focused LLM, combines extensive domain knowledge, enhanced professional skills, and deterministic RAG to deliver precise, actionable insights for intelligent operations, backed by a robust multi‑source training pipeline, staged training, and flexible deployment options.

AI OperationsModel TrainingRAG

0 likes · 7 min read

How the 32B ‘Zhiyu’ Model is Revolutionizing Intelligent Operations

Ops Community

Sep 16, 2025 · Operations

Mastering SRE: Fast Incident Response and Prevention Strategies

This guide walks SRE engineers through a complete incident lifecycle—preventive multi‑layer monitoring, chaos‑testing drills, rapid 10‑minute response tactics, systematic root‑cause analysis, effective communication roles, post‑mortem reviews, and practical case studies—helping teams minimize downtime and business loss.

Incident ManagementRoot Cause AnalysisSRE

0 likes · 11 min read

Mastering SRE: Fast Incident Response and Prevention Strategies

Efficient Ops

Aug 25, 2025 · Operations

How SOMM Is Revolutionizing Intelligent Ops with AIOps, SRE & FinOps

The China Academy of Information and Communications Technology introduced the SOMM (System Operation Maturity Model) framework, emphasizing tool intelligence, refined management, and robust operation, and detailed its AIOps, SRE, and FinOps assessment modules, evaluation criteria, maturity levels, and showcase of leading enterprises that have achieved top‑tier certifications.

AIOpsFinOpsMaturity Model

0 likes · 8 min read

How SOMM Is Revolutionizing Intelligent Ops with AIOps, SRE & FinOps

MaGe Linux Operations

Aug 24, 2025 · Operations

Master Production Incident Troubleshooting: SEAL Methodology & Essential Ops Toolbox

This comprehensive guide shares a veteran ops engineer's real‑world troubleshooting mindset, the SEAL framework, a curated toolbox of monitoring, logging, performance, and network utilities, detailed case studies, incident‑response grading, automation scripts, and future‑ready AIOps practices for keeping production systems stable.

AutomationMonitoringSRE

0 likes · 19 min read

Master Production Incident Troubleshooting: SEAL Methodology & Essential Ops Toolbox

Alibaba Cloud Big Data AI Platform

Aug 6, 2025 · Cloud Native

How SREWorks Transforms Image Building with Cloud‑Native CI/CD and Kaniko

This article explains how SREWorks evolved its continuous delivery pipeline from traditional machine‑based builds to fully cloud‑native, elastic image construction using on‑demand Pods, Docker‑in‑Docker techniques, and the Kaniko builder to achieve secure, serverless, and fast deployments.

KanikoSREcloud-native

0 likes · 13 min read

How SREWorks Transforms Image Building with Cloud‑Native CI/CD and Kaniko

Alibaba Cloud Big Data AI Platform

Aug 5, 2025 · Operations

Inside Alibaba’s Tesla: Data‑Driven Ops for 100k+ Big Data Nodes

The article details how Alibaba’s Tesla SRE platform supports the massive offline and real‑time big‑data ecosystems through a layered, data‑driven operations framework—DataOps—integrating unified portals, configuration, job, workflow, and analytics platforms, enabling automated monitoring, intelligent decision‑making, and self‑healing capabilities across 100,000+ nodes.

AIOpsBig DataDataOps

0 likes · 20 min read

Inside Alibaba’s Tesla: Data‑Driven Ops for 100k+ Big Data Nodes

Alibaba Cloud Big Data AI Platform

Aug 5, 2025 · Operations

How Alibaba’s Open‑Source SREWorks Transforms Cloud‑Native Data Operations

Alibaba's SREWorks platform, now open‑source, combines cloud‑native architecture, DataOps and AIOps to address the growing complexity of big‑data and AI operations, offering a layered SaaS/PaaS/IaaS solution that streamlines delivery, monitoring, management, control, operation, and service for modern enterprises.

AIOpsCloud NativeDataOps

0 likes · 10 min read

How Alibaba’s Open‑Source SREWorks Transforms Cloud‑Native Data Operations

Alibaba Cloud Big Data AI Platform

Aug 4, 2025 · Operations

Unlocking Unmanned Ops: DataOps & SRE Strategies for Big Data Management

The article explains how DataOps and SRE practices enable large‑scale, data‑driven operations in big‑data environments, aiming for fully automated, intelligent, and ultimately unmanned management of complex systems.

AI OpsBig DataDataOps

0 likes · 6 min read

Unlocking Unmanned Ops: DataOps & SRE Strategies for Big Data Management

MaGe Linux Operations

Jul 12, 2025 · Operations

Master Helm: The Ultimate Guide to Kubernetes Package Management and Deployment

This comprehensive article explains Helm’s core concepts, installation, basic commands, advanced features, real‑world case studies, common pitfalls, and SRE best practices, showing how Helm streamlines Kubernetes deployments, improves reliability, and enables automated, version‑controlled operations for modern cloud‑native environments.

Infrastructure AutomationSREkubernetes

0 likes · 16 min read

Master Helm: The Ultimate Guide to Kubernetes Package Management and Deployment

Tech Architecture Stories

Jun 14, 2025 · Operations

What Caused Google Cloud’s Massive June 2025 Outage and What We Can Learn

On June 12, 2025, a faulty policy update in Google’s Service Control triggered null‑pointer crashes across regions, causing a global outage that also impacted Cloudflare, Twitch, Discord, and others; the incident exposed missing feature flags, inadequate error handling, and lack of exponential backoff, prompting rapid SRE remediation.

Google CloudSREcloud operations

0 likes · 7 min read

What Caused Google Cloud’s Massive June 2025 Outage and What We Can Learn

DevOps Operations Practice

Jun 11, 2025 · Operations

Ops vs DevOps vs SRE: Which Role Matches Your Career Goals?

This article compares traditional Operations (Ops), DevOps, and Site Reliability Engineering (SRE) by outlining their definitions, core responsibilities, typical technology stacks, and career considerations, helping readers understand the distinct philosophies and choose the path that best fits their interests and market demand.

CareerSREtechnology stack

0 likes · 6 min read

Ops vs DevOps vs SRE: Which Role Matches Your Career Goals?

DevOps Operations Practice

Jun 7, 2025 · Operations

How Ops Professionals Can Reach a 300k Annual Salary: Real‑World Tips

This article compiles practical advice from experienced operations engineers on the challenges and strategies for achieving a 300,000 CNY yearly salary, covering skill development, career moves, company size, automation, and the evolving role of SRE/DevOps.

CareerOperationsSRE

0 likes · 6 min read

How Ops Professionals Can Reach a 300k Annual Salary: Real‑World Tips

Efficient Ops

Jun 3, 2025 · Operations

What Anthropic’s SRE Team Learned: 23 Practical Ops Tips for Scalable AI Infrastructure

This article shares Anthropic’s SRE engineer insights on 23 actionable practices—from schema migration and Karpenter node management to OpenTelemetry adoption, Helm chart storage, and Terraform versus CloudFormation—offering concrete recommendations for building reliable, cost‑effective AI and cloud‑native platforms.

Cloud NativeSREdevops

0 likes · 12 min read

What Anthropic’s SRE Team Learned: 23 Practical Ops Tips for Scalable AI Infrastructure

Efficient Ops

May 28, 2025 · Operations

Unlocking Intelligent Operations: Inside China’s SOMM Maturity Model for AIOps, SRE, and FinOps

The article introduces China’s System Operation Maturity Model (SOMM), detailing its three pillars—AIOps, SRE, and FinOps—along with the underlying standards, assessment results, and how enterprises leverage these frameworks to achieve smarter, more reliable, and cost‑effective IT operations.

AIOpsFinOpsIT Operations

0 likes · 7 min read

Unlocking Intelligent Operations: Inside China’s SOMM Maturity Model for AIOps, SRE, and FinOps

FunTester

May 16, 2025 · Operations

Chaos Engineering: Evolution, Workflow, Advantages, and Practice Principles

Chaos engineering is a discipline that deliberately injects faults into distributed systems to test and improve resilience, tracing its evolution from Netflix's Chaos Monkey to modern platforms, outlining its operational workflow, benefits, and core principles for reliable system design.

Fault InjectionOperationsReliability

0 likes · 9 min read

Chaos Engineering: Evolution, Workflow, Advantages, and Practice Principles

dbaplus Community

May 11, 2025 · Operations

Mastering SRE’s Four Golden Signals with Prometheus: A Practical Guide

This guide explains the four SRE golden signals—Latency, Traffic, Errors, and Saturation—covers their definitions, how to measure them with Prometheus in Node.js, compares them to RED and USE frameworks, and provides concrete alerting rules for reliable service monitoring.

Golden SignalsObservabilitySRE

0 likes · 12 min read

Mastering SRE’s Four Golden Signals with Prometheus: A Practical Guide

Baidu Geek Talk

Apr 23, 2025 · Operations

Baidu SRE Digital Immunity System: Construction, Evolution, and Practice

Baidu’s SRE digital‑immune system, evolved into an AI‑powered intelligent immunity platform, quantifies and mitigates risk across thousands of services by integrating data‑driven monitoring, rule‑based detection, and large‑model GraphRAG knowledge mining, cutting degradation cases by ~40% and shifting operations from reactive troubleshooting to proactive, data‑centric quality assurance.

AICloud NativeDigital Immunity

0 likes · 14 min read

Baidu SRE Digital Immunity System: Construction, Evolution, and Practice

Baidu Intelligent Cloud Tech Hub

Apr 18, 2025 · Operations

How Baidu’s AI‑Powered Digital Immune System Reinvents SRE Risk Management

This article explains why modern SRE teams need a digital immune system, describes Baidu’s data‑driven approach to improve system resilience, outlines the three‑phase evolution from digital transformation to AI‑enhanced risk mining, and shares concrete results and future directions for sustainable operations.

AICloud NativeDigital Immune System

0 likes · 15 min read

How Baidu’s AI‑Powered Digital Immune System Reinvents SRE Risk Management

21CTO

Apr 9, 2025 · Operations

9 Must‑Have Container Monitoring Tools and Best Practices for Modern Cloud‑Native Environments

This article reviews nine practical container‑monitoring solutions—from Last9 and Prometheus to Dynatrace and Elastic Observability—detailing their key features, pricing, and why developers prefer them, and then offers comprehensive best‑practice guidance for metrics, tagging, alerts, and advanced observability strategies in Kubernetes‑driven cloud‑native deployments.

AlertingCloud NativeMetrics

0 likes · 25 min read

9 Must‑Have Container Monitoring Tools and Best Practices for Modern Cloud‑Native Environments

Liangxu Linux

Apr 6, 2025 · Operations

How to Define SLIs, SLOs, and SLAs for Effective SRE Practices

This guide explains how SRE teams should collaborate early in the software development lifecycle to define Service Level Indicators (SLIs), set realistic Service Level Objectives (SLOs) and Service Level Agreements (SLAs), and integrate observability signals, error budgeting, risk management, and incident handling into reliable operations.

Error BudgetIncident ManagementObservability

0 likes · 13 min read

How to Define SLIs, SLOs, and SLAs for Effective SRE Practices

Efficient Ops

Mar 24, 2025 · Operations

15 Common Ops Mistakes That Can Crash Your System – How to Avoid Them

This article outlines fifteen common operational and management mistakes—such as frequent incidents, excessive new hires, lack of automation, and missing rollback plans—that can trigger system outages, and offers guidance on how teams can strengthen testing, processes, and team capabilities to prevent downtime.

Incident ManagementOperationsSRE

0 likes · 6 min read

15 Common Ops Mistakes That Can Crash Your System – How to Avoid Them

Ops Development Stories

Mar 24, 2025 · Operations

Why Do Some Ops Teams Face Value Challenges? Insights for CTOs

Operations leaders and CTOs often confront the question of the true value of their teams, and this article explores who asks it, why it matters, typical challenges, and practical ways to define and protect the operational role through unified platforms, processes, and strategic collaboration with development.

CTOOperationsPlatform Engineering

0 likes · 13 min read

Why Do Some Ops Teams Face Value Challenges? Insights for CTOs

FunTester

Mar 23, 2025 · Operations

The Origin, Development, and Future of Chaos Engineering

Chaos engineering, introduced by Netflix in 2011 to proactively inject failures and test system resilience, has evolved over the past decade into a widely adopted practice integrated with SRE, automation, AI, and Kubernetes, offering best‑practice guidelines and future trends for improving distributed system reliability.

ReliabilitySREkubernetes

0 likes · 8 min read

The Origin, Development, and Future of Chaos Engineering

Efficient Ops

Mar 18, 2025 · Operations

Is 24/7 On‑Call a Nightmare? Real Ops Insights from Zhihu Discussions

This article compiles diverse Zhihu comments on the reality of 24 × 7 on‑call duties, contrasting exaggerated myths with practical team‑based solutions, global shift models, backup strategies, and actionable tips for improving operations without sacrificing personal life.

On-CallSREteamwork

0 likes · 7 min read

Is 24/7 On‑Call a Nightmare? Real Ops Insights from Zhihu Discussions

MaGe Linux Operations

Mar 6, 2025 · Operations

How Large Language Models Are Revolutionizing SRE from Firefighting to Proactive Ops

This article explores how open‑source large language models like DeepSeek empower SRE teams to shift from reactive firefighting to proactive, predictive operations, detailing technical principles, real‑world case studies, essential skill sets, and future trends that reshape the operations landscape.

AI OpsAutomationObservability

0 likes · 8 min read

How Large Language Models Are Revolutionizing SRE from Firefighting to Proactive Ops

dbaplus Community

Mar 5, 2025 · Operations

How to Build a Resilient Content‑Risk System: From Diagnosis to Continuous Improvement

This article shares a step‑by‑step account of taking over a content‑risk stability role in early 2024, defining system stability, diagnosing recurring issues, and implementing a three‑phase framework—pre‑emptive reduction, impact mitigation, and post‑incident improvement—to boost success rates, cut incident response time, and achieve a modular architecture.

JVM OptimizationMonitoringSRE

0 likes · 20 min read

How to Build a Resilient Content‑Risk System: From Diagnosis to Continuous Improvement

Efficient Ops

Mar 4, 2025 · Operations

Mastering SRE: How to Define SLIs, SLOs, and Build Reliable Cloud‑Native Systems

This article explains how SRE teams should collaboratively define Service Level Indicators, Objectives, and Agreements, and then cover reliability, performance, observability signals, error budgeting, risk management, incident handling, and the engineering work needed to build robust cloud‑native platforms.

Error BudgetSLISLO

0 likes · 13 min read

Mastering SRE: How to Define SLIs, SLOs, and Build Reliable Cloud‑Native Systems

Efficient Ops

Feb 26, 2025 · Databases

Efficient Operations for Heterogeneous Databases: Insights from Guangdong Mobile

The article summarizes Lai Kunchi's presentation at the 24th GOPS Global Operations Conference, covering the current state and challenges of database development, Guangdong Mobile's database operation system, and future directions for managing heterogeneous databases in evolving business architectures.

AIOpsDatabase operationsSRE

0 likes · 3 min read

Efficient Operations for Heterogeneous Databases: Insights from Guangdong Mobile

Efficient Ops

Feb 17, 2025 · Operations

From Bronze to AI‑Powered Ops: Mastering the Operations Career Ladder

This article explores the hierarchy of operations roles, outlines five career stages from entry‑level to AI‑driven expert, and offers practical advice on building foundations, automation, high‑availability design, and embracing emerging technologies.

AIAutomationOperations

0 likes · 6 min read

From Bronze to AI‑Powered Ops: Mastering the Operations Career Ladder

ITPUB

Feb 11, 2025 · Operations

Why Your Monitoring Fails and How to Build Effective Observability Data

Many companies deploy fragmented monitoring and observability tools yet still struggle to pinpoint incidents; this article analyzes the root causes—under‑utilized tools and scenario‑agnostic data—and offers practical steps to organize metrics, build layered insights, and improve fault‑resolution efficiency.

Data EngineeringMonitoringObservability

0 likes · 12 min read

Why Your Monitoring Fails and How to Build Effective Observability Data

Efficient Ops

Feb 6, 2025 · Operations

Inside Alipay’s Full‑Ecosystem Availability Monitoring: Architecture and Practices

At the 2024 GOPS Global Operations Conference in Shanghai, Alipay’s monitoring lead Tang Liang presented the challenges, architecture, risk‑prevention practices, and implementation details of the company’s full‑ecosystem availability monitoring system, highlighting its role in DevOps, SRE, and AIOps initiatives.

AIOpsCloud NativeMonitoring

0 likes · 4 min read

Inside Alipay’s Full‑Ecosystem Availability Monitoring: Architecture and Practices

JD Tech Talk

Feb 6, 2025 · Operations

Stability Assurance Mechanisms and Practices for Site Reliability Engineering (SRE)

This article outlines comprehensive stability assurance mechanisms—including standards, process workflows, the distinction between developers and SREs, personal responsibilities, and practical construction directions—to guide teams in building resilient, high‑availability systems through proactive, daily, and incident‑response practices.

Reliability EngineeringSREStability

0 likes · 10 min read

Stability Assurance Mechanisms and Practices for Site Reliability Engineering (SRE)

JD Cloud Developers

Feb 6, 2025 · Operations

How to Build a Robust Stability Framework: Key Mechanisms for SRE Success

This article outlines a comprehensive stability framework for SRE teams, detailing essential mechanisms such as review processes, coding standards, incident management, on‑call responsibilities, and daily operational practices, while also highlighting the cultural shift needed to achieve reliable, high‑availability systems.

Incident ManagementOperationsSRE

0 likes · 11 min read

How to Build a Robust Stability Framework: Key Mechanisms for SRE Success

Ops Development Stories

Jan 23, 2025 · Operations

How SREs Can Boost Their Influence Within Teams

This article explores why influence matters for Site Reliability Engineers, outlines the challenges they face in gaining recognition, and provides practical strategies—enhancing technical expertise, improving communication, quantifying achievements, and sharing knowledge—to elevate their impact within organizations.

OperationsSREcommunication

0 likes · 19 min read

How SREs Can Boost Their Influence Within Teams

Efficient Ops

Jan 20, 2025 · Operations

Inside Qunar’s Pre‑Release Platform: Design, Practice, and Future Outlook

The article recaps Li Jingkang’s presentation at the 2024 GOPS Global Operations Conference, detailing the background, principles, design, and real‑world implementation of Qunar’s pre‑release platform, and outlines its future direction within DevOps, SRE, AIOps, and cloud‑native practices.

AIOpsCloud NativeOperations

0 likes · 3 min read

Inside Qunar’s Pre‑Release Platform: Design, Practice, and Future Outlook

Ops Development Stories

Jan 16, 2025 · Operations

How AI is Transforming Site Reliability Engineering (SRE)

This article examines how the rapid rise of AI reshapes Site Reliability Engineering by enhancing monitoring, automating operations, improving fault diagnosis, and presenting new challenges such as data security and model explainability, ultimately driving more efficient and reliable system management.

AIReliabilitySRE

0 likes · 21 min read

How AI is Transforming Site Reliability Engineering (SRE)

Efficient Ops

Jan 14, 2025 · Operations

What Ops Professionals Learn from Real-World Incident Stories

This article compiles real‑world operations incidents—from accidental database deletions and faulty deployments to hidden data tampering and network device failures—highlighting how quick diagnosis, preventive maintenance, and SRE practices can mitigate impact on users, reputation, and revenue.

Case StudiesSREdatabase backup

0 likes · 6 min read

What Ops Professionals Learn from Real-World Incident Stories

Tencent Cloud Developer

Jan 7, 2025 · Operations

Designing High‑Availability Systems: Principles, Architecture, and Operations

This comprehensive guide explains how to design, build, and operate high‑availability systems by covering availability metrics, fault‑tolerance strategies, capacity planning, code and data layer architecture, automated testing, monitoring, and clear role responsibilities to ensure services stay reliable and resilient under load.

Cloud NativeHigh AvailabilitySRE

0 likes · 32 min read

Designing High‑Availability Systems: Principles, Architecture, and Operations

Efficient Ops

Jan 1, 2025 · Operations

What 2024’s Biggest Outages Teach Us About Building Resilient Systems

Reviewing the major service disruptions—from Alibaba Cloud to OpenAI—this article extracts key SRE lessons such as early disaster‑recovery planning, regular backups, load balancing, real‑time monitoring, performance tuning, and capacity planning, urging enterprises to adopt resilient practices for a more stable future.

Disaster RecoveryOperationsOutage Management

0 likes · 6 min read

What 2024’s Biggest Outages Teach Us About Building Resilient Systems

Tech Architecture Stories

Dec 28, 2024 · Operations

Why Preventing Small Issues Is the Key to System Stability

The article explains how early detection and preventive measures—such as comprehensive monitoring, rate limiting, chaos testing, and proper SLOs—are essential for maintaining system stability and avoiding larger incidents, drawing on SRE principles and the incident triangle theory.

Error BudgetIncident PreventionOperations

0 likes · 4 min read

Why Preventing Small Issues Is the Key to System Stability

Efficient Ops

Dec 8, 2024 · Operations

Unlocking BizDevOps: Key Insights from Shanghai’s Enterprise Summit

The article recaps Shanghai’s BizDevOps Enterprise Summit, highlighting five expert sessions on R&D‑operations integration in securities, platform engineering breakthroughs, large‑model agents in financial ops, Ctrip’s 10 PB JuiceFS practice, and core SRE stability strategies for financial firms.

AI AgentsBizDevOpsCloud Native

0 likes · 4 min read

Unlocking BizDevOps: Key Insights from Shanghai’s Enterprise Summit

Efficient Ops

Nov 20, 2024 · Operations

How China’s Telecom Leaders Accelerate DevOps & AIOps Standards for Faster Delivery

The article outlines China’s 2024‑2027 information standard action plan, the rollout of ITU DevOps and AIOps assessments, and showcases dozens of telecom projects that achieved significant improvements in delivery speed, reliability, automation and observability through standardized DevOps, SRE and AI‑ops practices.

AIOpsSREStandardization

0 likes · 23 min read

How China’s Telecom Leaders Accelerate DevOps & AIOps Standards for Faster Delivery

Efficient Ops

Nov 19, 2024 · Operations

Mastering System Stability: Proven SRE Practices for Reliable, High‑Availability Services

This article explains how system stability depends on architecture and code details, defines SLA and the “nines” metric, outlines Google’s SRE hierarchy, and provides practical governance steps—including development and release processes, high‑availability design, capacity planning, monitoring, incident response, and team culture—to achieve reliable, high‑availability services.

High AvailabilitySREcapacity planning

0 likes · 34 min read

Mastering System Stability: Proven SRE Practices for Reliable, High‑Availability Services

Efficient Ops

Nov 14, 2024 · Operations

Why Alipay Crashed: Lessons on Backup and Disaster Recovery

The recent Alipay outage during Double‑11 revealed a partial failure in its system message database, prompting users to experience payment errors, duplicate charges, and delayed withdrawals, while the company’s response highlighted the importance of comprehensive backup, redundancy, disaster‑recovery planning, monitoring, and security measures to ensure service continuity.

AlipaySREdisaster-recovery

0 likes · 10 min read

Why Alipay Crashed: Lessons on Backup and Disaster Recovery

Efficient Ops

Nov 14, 2024 · Operations

How SRE Standards Boost System Reliability in China’s Digital Era

Amid a surge of high‑profile outages, the CAICT introduces a comprehensive SRE framework that addresses large‑scale, high‑frequency changes, complex tech stacks, and massive traffic, outlining development and operational reliability practices, maturity levels, and actionable guidelines to enhance system stability.

Digital GovernanceIT ManagementSRE

0 likes · 12 min read

How SRE Standards Boost System Reliability in China’s Digital Era