Tagged articles

403 articles

Page 2 of 5

Jun 30, 2024 · Operations

Effective Incident Mitigation and Recovery: Practical SRE Strategies

The article outlines SRE‑based incident mitigation and recovery practices, covering urgent mitigations, impact reduction, key metrics such as TTD, TTR, TBF, and detailed strategies for shortening detection and repair times, preventing fatigue, improving observability, and designing resilient systems.

MitigationOperationsReliability

0 likes · 23 min read

Effective Incident Mitigation and Recovery: Practical SRE Strategies

DevOps Coach

Jun 27, 2024 · Operations

How to Run Effective Incident Response Drills for Resilient Systems

This article explains why regular disaster role‑playing, systematic testing, and focused responder preparation are essential for building robust incident response capabilities and reducing operational risk in production environments.

OperationsResilienceSRE

0 likes · 7 min read

How to Run Effective Incident Response Drills for Resilient Systems

Efficient Ops

Jun 25, 2024 · Operations

Mastering the Four Golden Signals: A Practical Guide to System Monitoring

This guide explains how to use the four golden signals—latency, traffic, errors, and saturation—to design effective monitoring across servers, services, and external dependencies, helping teams detect issues early and maintain reliable, high‑performance systems.

SREmonitoringsystem reliability

0 likes · 20 min read

Mastering the Four Golden Signals: A Practical Guide to System Monitoring

G7 EasyFlow Tech Circle

Jun 13, 2024 · Operations

Boost Service Availability: MTBF, MTTR, and Practical High‑Availability Tactics

This article explores how service availability is quantified, explains the impact of MTBF and MTTR on reliability, and presents concrete operational practices—including redundancy, traffic control, and change‑management techniques—to move systems from basic uptime to true high‑availability levels.

AvailabilityMTBFMTTR

0 likes · 13 min read

Practical DevOps Architecture

May 22, 2024 · Operations

SRE & Linux Operations Course Outline

This article presents a detailed curriculum covering fundamental infrastructure, cluster architecture, automation, log collection, Linux system administration, containerization, monitoring, security, and related DevOps tools across multiple phases and daily modules for comprehensive SRE training.

AutomationSREcloud

0 likes · 8 min read

Efficient Ops

May 21, 2024 · Operations

What Is an SRE? Roles, Skills, and Best Practices Explained

This article demystifies Site Reliability Engineering (SRE) by explaining its origins, core responsibilities, essential skill sets, and key practices such as observability, incident response, testing, capacity planning, automation, user support, on‑call duties, and the definition of SLI/SLO/SLA, providing a comprehensive guide for modern operations teams.

SREcapacity planningincident response

0 likes · 29 min read

What Is an SRE? Roles, Skills, and Best Practices Explained

Efficient Ops

May 12, 2024 · Operations

From Firefighting to Fire‑Starting: Mastering Operations for System Reliability

The article outlines a three‑stage evolution of operations—from rapid incident response to proactive fault‑injection—while offering practical guidance on improving availability, visualizing changes, and aligning technical metrics with business value to elevate the role of operations engineers.

AvailabilityFault InjectionSRE

0 likes · 7 min read

From Firefighting to Fire‑Starting: Mastering Operations for System Reliability

Efficient Ops

May 7, 2024 · Operations

11 Hard‑Earned Lessons from Two Decades of Google Site Reliability

Drawing on twenty years of Google’s SRE experience, this article shares eleven practical lessons—from proportional incident mitigation and pre‑tested recovery mechanisms to canary releases, disaster‑resilience testing, and frequent deployments—aimed at improving reliability and operational efficiency.

GoogleSREincident management

0 likes · 12 min read

11 Hard‑Earned Lessons from Two Decades of Google Site Reliability

DevOps Engineer

Apr 22, 2024 · Operations

From Jenkins X Contributor to Jenkins Infrastructure SRE: A Career Journey

This interview recounts Hervé Le Meur’s path from a B2B marketing consultant to a Jenkins infrastructure SRE, detailing his work with Jenkins X, Kubernetes, CI/CD pipelines, and the lessons he shares for newcomers to open‑source contributions.

JenkinsKubernetesSRE

0 likes · 12 min read

From Jenkins X Contributor to Jenkins Infrastructure SRE: A Career Journey

dbaplus Community

Apr 21, 2024 · Databases

Why Teams Hesitate to Upgrade Databases and How to Overcome the Risks

The article examines why many organizations avoid upgrading databases despite end‑of‑life warnings, highlighting cultural resistance, skill gaps, security and performance risks, and offers practical strategies for planning and executing safe database upgrades.

DBAEOLSRE

0 likes · 10 min read

Why Teams Hesitate to Upgrade Databases and How to Overcome the Risks

Efficient Ops

Apr 14, 2024 · Operations

How to Ensure System Stability and High Availability: An SRE Playbook

This article explains the definitions of stability and high availability, distinguishes their relationship, outlines key performance indicators, and provides a comprehensive framework—including fault prevention, detection, and recovery, as well as design, coding, testing, monitoring, and emergency response practices—to help teams build reliable, highly available systems.

SREcapacity planninghigh availability

0 likes · 10 min read

How to Ensure System Stability and High Availability: An SRE Playbook

Bilibili Tech

Apr 9, 2024 · Operations

BCM – Building and Deploying Bilibili’s Chaos Engineering Platform

At the 2024 GOPS Global Operations Conference, Bilibili senior R&D engineer Gu Lintao will present BCM—Bilibili’s Chaos Engineering Platform—showcasing how its design and capabilities let developers, testers, and SREs safely inject faults, uncover hidden architectural risks, and improve service stability through real‑world drills and systematic reliability engineering.

BilibiliDevOpsReliability

0 likes · 3 min read

BCM – Building and Deploying Bilibili’s Chaos Engineering Platform

Efficient Ops

Apr 8, 2024 · Operations

What Exactly Is SRE? A Deep Dive into Roles, Responsibilities, and Best Practices

This article explains what Site Reliability Engineering (SRE) is, outlines the three main layers of SRE work—Infrastructure, Platform, and Business—covers hiring challenges, daily duties such as deployment, on‑call, SLI/SLO management, capacity planning, user support, and offers practical interview and career advice.

OncallOperationsSRE

0 likes · 22 min read

What Exactly Is SRE? A Deep Dive into Roles, Responsibilities, and Best Practices

Efficient Ops

Apr 2, 2024 · Operations

What Do Leading Tech Giants Expect from SREs? Job Posting Insights

Amid economic growth and frequent continuity incidents, major internet firms are redefining SRE roles, emphasizing cost reduction, system resilience, risk management, AI‑driven operations, and close collaboration with development teams, as revealed by a detailed analysis of recent job postings from Ant Group, Alibaba, ByteDance and others.

AI-nativeCost OptimizationSRE

0 likes · 9 min read

What Do Leading Tech Giants Expect from SREs? Job Posting Insights

Efficient Ops

Mar 25, 2024 · Operations

Why SRE Exists and How It Solves Modern Reliability Challenges

This article explains why Site Reliability Engineering (SRE) emerged, outlines its core responsibilities, required skill set, and how SRE teams use SLOs, monitoring, and scenario drills to improve system reliability, performance, and observability in complex production environments.

DevOpsOperationsReliability

0 likes · 12 min read

Why SRE Exists and How It Solves Modern Reliability Challenges

Efficient Ops

Mar 25, 2024 · Operations

How CAICT’s SRE Standards Strengthen System Reliability and Continuity

This article outlines the rising frequency of system outages, explains the key characteristics and challenges of modern large‑scale distributed systems, introduces China’s CAICT SRE framework and its two‑part reliability model, showcases a successful SRE case, and announces the 2024 SRE maturity assessment program.

Digital GovernanceSREsoftware reliability

0 likes · 12 min read

How CAICT’s SRE Standards Strengthen System Reliability and Continuity

Efficient Ops

Mar 19, 2024 · Operations

Is a Career in Operations Worth It? SRE, Network, DBA & Real‑World Insights

This article compiles several Zhihu answers that explain the various operations (运维) roles—SRE, network, system, DBA and development—describe their responsibilities, challenges, and career prospects, and answer whether the work is boring and if it’s worth pursuing.

DevOpsITOperations

0 likes · 14 min read

Is a Career in Operations Worth It? SRE, Network, DBA & Real‑World Insights

Efficient Ops

Mar 13, 2024 · Operations

What Does an Operations Engineer Do? Skills, Tools, and Career Path

This article explains the role of an operations (运维) engineer, covering daily responsibilities, essential knowledge such as Linux and networking, common monitoring tools, and emerging career paths like DevOps, AIOps, and SRE, helping newcomers understand how to start and grow in the field.

DevOpsLinuxOperations

0 likes · 6 min read

What Does an Operations Engineer Do? Skills, Tools, and Career Path

Efficient Ops

Mar 13, 2024 · Operations

Why Traditional Ops Stalls and How AI‑Driven Solutions Can Revitalize It

The article examines common operational pain points such as cumbersome release processes, lack of standardization, and weak security controls, then explores how AI‑powered SRE tools and automation can address these challenges and guide teams toward more efficient, standardized, and resilient operations.

AILLMSRE

0 likes · 9 min read

Why Traditional Ops Stalls and How AI‑Driven Solutions Can Revitalize It

DevOps

Feb 21, 2024 · Operations

When Operations Are Stuck: Embrace Change or Remain Stagnant?

The article examines common operational pain points such as cumbersome release processes, lack of standardization, and weak security controls, then presents community suggestions emphasizing automation, standardized workflows, CMDB, and AI‑driven SRE tools, concluding that clear direction outweighs perfect execution.

AICMDBSRE

0 likes · 6 min read

When Operations Are Stuck: Embrace Change or Remain Stagnant?

Efficient Ops

Jan 28, 2024 · Operations

Can One Person Really Manage 40,000 Servers? Real‑World Ops Insights

A collection of Zhihu contributors share practical experiences and opinions on whether a single operations engineer can handle the massive scale of 40,000 servers, covering workload, automation gaps, budgeting, hardware failure rates, and the necessity of team‑based high‑availability practices.

InfrastructureSREScale

0 likes · 9 min read

Can One Person Really Manage 40,000 Servers? Real‑World Ops Insights

DevOps Operations Practice

Jan 26, 2024 · Operations

Career Prospects and Advancement Strategies for Operations Engineers

The article examines the wide range of opportunities for operations engineers, highlighting the low‑entry, high‑potential nature of the field, and offers practical advice on city selection, education, in‑demand technologies, soft‑skill development, and health to help professionals climb from basic desktop support to high‑paying DevOps or SRE roles.

DevOpsITSRE

0 likes · 7 min read

Career Prospects and Advancement Strategies for Operations Engineers

Efficient Ops

Jan 23, 2024 · Operations

How Shenwan Hongyuan Securities Automated Operations: Key Takeaways from GOPS 2023

The 21st GOPS Global Operations Conference in Shanghai featured Shenwan Hongyuan Securities' Yusi Song presenting an in‑depth look at automated operations, covering achievements, experience summaries, and future plans, with slide images and a downloadable PPT for attendees.

AutomationDevOpsOperations

0 likes · 2 min read

How Shenwan Hongyuan Securities Automated Operations: Key Takeaways from GOPS 2023

Efficient Ops

Jan 23, 2024 · Operations

Why Building Truly High‑Availability Systems Is Harder Than You Think

The article examines why 2023 saw a surge in major online outages, linking layoffs and cost‑cutting to lost expertise, and explores the entropy and Murphy laws that make perpetual high availability impossible without continuous, systematic investment and cultural change.

SRETechnical Debthigh availability

0 likes · 13 min read

Why Building Truly High‑Availability Systems Is Harder Than You Think

Efficient Ops

Jan 22, 2024 · Operations

How New Oriental Standardized Its Observability System to Cut Costs and Boost Efficiency

At the 21st GOPS Global Operations Conference, New Oriental's senior operations manager Qi Chen detailed the demand, technical, and focus pressures that drove a phased, full‑process observability standardization, leveraging OpenTelemetry, Telegraf, Loki and CMDB tagging to achieve cost reduction and higher stability.

Cost reductionDevOpsOpenTelemetry

0 likes · 8 min read

How New Oriental Standardized Its Observability System to Cut Costs and Boost Efficiency

Efficient Ops

Jan 17, 2024 · Operations

How China’s Telecom Giants Accelerate IT Efficiency with DevOps Maturity Assessments

In the context of digital transformation, six leading Chinese telecom operators applied the CAICT DevOps Capability Maturity Model to evaluate dozens of projects, achieving significant improvements in continuous delivery, technical operations, security, and AIOps, providing valuable references for the industry.

Continuous DeliveryDevOpsIT Operations

0 likes · 18 min read

How China’s Telecom Giants Accelerate IT Efficiency with DevOps Maturity Assessments

Efficient Ops

Jan 14, 2024 · Operations

Mastering Incident Command: A Practical Guide for SRE Fault Handling

This article outlines a comprehensive, step‑by‑step approach for SRE incident commanders, covering fault perception, grading, team organization, remediation tactics, transparent communication, and post‑mortem practices to efficiently resolve service disruptions.

SREfault handlingincident management

0 likes · 14 min read

Mastering Incident Command: A Practical Guide for SRE Fault Handling

DevOps

Jan 12, 2024 · Operations

Why Building a Never‑Failing System Is Impossible and How to Pursue Continuous High Availability

The article analyses why truly never‑failing systems cannot exist—citing entropy and Murphy’s laws—examines the organizational and technical obstacles to continuous high availability, and offers practical cultural and engineering practices such as testing, code review, monitoring, and regular system health checks to mitigate risk.

Murphy's LawOperationsSRE

0 likes · 14 min read

Why Building a Never‑Failing System Is Impossible and How to Pursue Continuous High Availability

Tencent Cloud Developer

Jan 10, 2024 · Operations

The Challenges of Building Continuously Available Systems: Entropy, Murphy's Law, and the 'Divine Doctor Paradox'

Building continuously available systems in 2023 is hampered by entropy‑driven technical debt and Murphy’s Law failures, and the “Divine Doctor Paradox” shows that successful availability work goes unnoticed while blame follows any outage, making cultural commitment—not just technology—the essential solution.

Murphy's LawSRETechnical Debt

0 likes · 14 min read

The Challenges of Building Continuously Available Systems: Entropy, Murphy's Law, and the 'Divine Doctor Paradox'

Bilibili Tech

Jan 5, 2024 · Cloud Native

ChangePilot: Bilibili’s Unified Change Management Platform and Practices

ChangePilot is Bilibili’s unified change‑management platform that standardizes change definition, lifecycle, and risk governance through a platform‑scenario model and five control levels (G0‑G4), offering built‑in checks, searchable records, subscription alerts, intelligent correlation, and emergency channels to boost production stability while maintaining operational efficiency.

SREchange managementrisk control

0 likes · 29 min read

ChangePilot: Bilibili’s Unified Change Management Platform and Practices

21CTO

Dec 30, 2023 · Operations

How G Bank Turns Application Monitoring into Business‑Driven Visual Operations

This article examines how G Bank builds an application monitoring system based on ITIL and Google SRE principles, identifies its shortcomings, and evolves the platform into a visualized operations solution that aligns technical and business perspectives for faster incident resolution and improved customer experience.

BankingITILOperations

0 likes · 11 min read

How G Bank Turns Application Monitoring into Business‑Driven Visual Operations

AntTech

Dec 18, 2023 · Cloud Native

AlterShield Open‑Source Change Risk Control Platform: Architecture, Features, and Future Roadmap

AlterShield is an open‑source change‑risk prevention solution originally built by Ant Group that provides lifecycle‑aware change defense, cloud‑native operator integration, KDE‑based anomaly detection, and extensible plug‑in frameworks, with detailed module descriptions, recent v1.0 releases, and a roadmap for advanced monitoring and noise‑reduction capabilities.

Cloud NativeKubernetesSRE

0 likes · 13 min read

AlterShield Open‑Source Change Risk Control Platform: Architecture, Features, and Future Roadmap

Bilibili Tech

Dec 15, 2023 · Operations

Bilibili Alert Monitoring System: Design, Optimization, and Root‑Cause Analysis

Bilibili revamped its alert monitoring platform to meet rapid growth, focusing on effectiveness, timeliness, and coverage; it introduced a closed‑loop design and governance that cut weekly alerts by 90%, built a knowledge‑graph root‑cause system achieving 87.9% accuracy with sub‑minute latency, and integrated AIOps for ongoing refinement.

Alert MonitoringBilibiliRoot Cause Analysis

0 likes · 21 min read

Bilibili Alert Monitoring System: Design, Optimization, and Root‑Cause Analysis

DevOps Operations Practice

Dec 11, 2023 · Operations

Career Prospects for Operations Professionals: From Low‑End Tasks to High‑Paying Roles

This article examines the wide salary range in the operations field, explains why high‑level DevOps/SRE positions command premium pay, and offers practical advice—such as relocating to major cities, obtaining a bachelor's degree, mastering in‑demand technologies, and developing soft skills—to help operations engineers advance their careers.

DevOpsSRESkill development

0 likes · 8 min read

Career Prospects for Operations Professionals: From Low‑End Tasks to High‑Paying Roles

dbaplus Community

Dec 10, 2023 · Operations

11 Hard‑Earned Lessons from Two Decades of Google Site Reliability Engineering

Drawing on twenty years of Google SRE experience, this article outlines eleven practical lessons—from scaling mitigation to disaster‑resilience testing—that help teams design, operate, and evolve reliable large‑scale services.

SREcanary releasesdisaster recovery

0 likes · 12 min read

11 Hard‑Earned Lessons from Two Decades of Google Site Reliability Engineering

DeWu Technology

Dec 8, 2023 · Operations

SRE Secrets: How Alibaba, Tencent & Dewu Build Ultra-Stable Cloud‑Native Services

On November 25, Dewu Technology hosted an SRE Stability Engineering salon in Hangzhou where experts from Alibaba, Tencent, Ant Group and Dewu shared practical insights on C‑end link reliability, Alibaba’s system stability operations, Tencent Game’s cloud‑native SRE practices, and Ant Group’s chaos engineering, concluding with a Q&A and resource distribution.

Cloud NativeOperationsSRE

0 likes · 7 min read

SRE Secrets: How Alibaba, Tencent & Dewu Build Ultra-Stable Cloud‑Native Services

Bilibili Tech

Dec 1, 2023 · Operations

Safe Production Practices: Change Management Platform Design and Implementation at Bilibili

After a series of change‑induced outages in early 2023, Bilibili instituted a comprehensive change‑management framework—including a preventive change platform, a central control system, quality and monitoring tools, strict gray‑release policies, observability checks, and rapid rollback mechanisms—to dramatically cut emergency incidents and embed a reliability‑first culture.

ObservabilityReliabilitySRE

0 likes · 16 min read

Safe Production Practices: Change Management Platform Design and Implementation at Bilibili

Efficient Ops

Nov 26, 2023 · Operations

Beijing Mobile’s SRE Success: Automation, Cloud‑Native Ops & Reliability

The article details how Beijing Mobile’s SRE Smart Operations team applied SRE principles, automation, and cloud‑native tools to transform traditional DevOps into a reliable, scalable operation, highlighting their fault‑prevention, monitoring, incident response, and continuous improvement practices that earned them the 2023 IT Technology Leadership award.

AutomationOperationsSRE

0 likes · 7 min read

Beijing Mobile’s SRE Success: Automation, Cloud‑Native Ops & Reliability

Efficient Ops

Nov 15, 2023 · Operations

How a Unified Metadata Platform Boosts SRE Efficiency and Cuts Costs

This article describes how Huya built a unified metadata platform to break data silos across its numerous operations systems, enabling standardized data ingestion, association, visualization and analysis that improve resource governance, root‑cause diagnosis, and overall cost‑control for SRE teams.

Root Cause AnalysisSREgraph database

0 likes · 13 min read

How a Unified Metadata Platform Boosts SRE Efficiency and Cuts Costs

JD Tech

Nov 10, 2023 · Operations

Reducing MTTR: Monitoring, Fast Incident Response, and Team Practices

This article explains the concept and importance of MTTR (Mean Time To Repair), shows how to calculate it, and provides a comprehensive set of monitoring, alerting, rapid mitigation, tool‑assisted analysis, and team coordination techniques to significantly shorten incident resolution time and improve system reliability.

MTTROperationsReliability

0 likes · 26 min read

Reducing MTTR: Monitoring, Fast Incident Response, and Team Practices

Huya Tech Engineering

Nov 10, 2023 · Operations

How a Unified Metadata Platform Boosts SRE Efficiency and Cuts Costs

This article describes how Huya built a unified metadata platform to break data silos across its SRE systems, enabling standardized data ingestion, correlation, and analysis that improve resource governance, root‑cause diagnosis, and overall cost‑efficiency for large‑scale live streaming services.

DevOpsObservabilitySRE

0 likes · 13 min read

Efficient Ops

Nov 7, 2023 · Operations

Mastering SRE: How MTBF, MTTR, SLI, SLO & Error Budget Drive Reliability

This article explains Site Reliability Engineering (SRE) as a collaborative methodology, outlines its stability goals measured by MTBF and MTTR, details how SLI/SLO and the VALET selection guide fault detection, and shows how error budgets quantify reliability work and drive precise alerting.

ErrorBudgetMTBFMTTR

0 likes · 14 min read

Mastering SRE: How MTBF, MTTR, SLI, SLO & Error Budget Drive Reliability

Efficient Ops

Nov 6, 2023 · Operations

How Beijing Mobile Achieved Leading SRE Maturity: Insights from a Level‑3 Assessment

Beijing Mobile’s Order Center project passed the CAICT’s Level‑3 System Reliability and Continuity Engineering assessment, showcasing how SRE practices, cultural shifts, and tool automation boosted system stability, reduced incidents by 77%, cut recovery time by 54%, and set a benchmark for large‑scale IT operations in China’s telecom sector.

China MobileDevOpsDigital Transformation

0 likes · 17 min read

How Beijing Mobile Achieved Leading SRE Maturity: Insights from a Level‑3 Assessment

Efficient Ops

Nov 2, 2023 · Operations

How ICBC’s SRE Team Built a Panoramic Monitoring System for Digital Ops Transformation

The Industrial and Commercial Bank of China software development center created an SRE panoramic monitoring view system that unifies data channels, standardizes metrics, offers multi‑dimensional dashboards, and introduces an intelligent Ops Assistant, dramatically improving fault detection, response speed, and cross‑team operational efficiency.

Digital TransformationICBCObservability

0 likes · 6 min read

How ICBC’s SRE Team Built a Panoramic Monitoring System for Digital Ops Transformation

Efficient Ops

Oct 30, 2023 · Operations

How Beijing Mobile Achieved Leading SRE Maturity: Insights from the 2023 GOPS Conference

The article details Beijing Mobile's successful System Reliability Engineering (SRE) assessment at the 2023 GOPS Global Operations Conference, highlighting the company's SRE transformation, the benefits achieved, challenges faced, and future plans for scaling reliable IT operations across the enterprise.

DevOpsDigital TransformationIT Operations

0 likes · 18 min read

How Beijing Mobile Achieved Leading SRE Maturity: Insights from the 2023 GOPS Conference

Efficient Ops

Oct 26, 2023 · Operations

How China Agricultural Bank Achieved Advanced DevOps Maturity Across Two Core Projects

China Agricultural Bank’s two flagship projects—Distributed Core Open System Control and Online Payment Platform—successfully passed the CAICT DevOps Technical Operations Level‑2 assessment, showcasing advanced domestic maturity, detailed implementation practices, and the broader impact of standardized DevOps on banking digital transformation.

BankingContinuous DeliveryDevOps

0 likes · 16 min read

How China Agricultural Bank Achieved Advanced DevOps Maturity Across Two Core Projects

Continuous Delivery 2.0

Oct 20, 2023 · Operations

Understanding Platform Engineering: Definition, Scope, and Its Relationship with DevOps and SRE

The article explains platform engineering as the evolution of DevOps into a productized internal infrastructure function, detailing its definition, target users, responsibilities, organizational placement, differences from DevOps and SRE, industry trends, implementation practices, and criteria for evaluating the need for a dedicated platform team.

IT OperationsInfrastructureSRE

0 likes · 10 min read

Understanding Platform Engineering: Definition, Scope, and Its Relationship with DevOps and SRE

Qunhe Technology Quality Tech

Oct 13, 2023 · Operations

How KuJiaLe Built a Scalable Stability System: Real‑World SRE Lessons

This article shares KuJiaLe's experience tackling stability challenges caused by rapid user growth and system complexity, detailing their organizational, process, cultural, and technical approaches—including goal setting, a stability committee, monitoring, incident response, change control, and regular drills—to achieve measurable improvements in reliability and performance.

DevOpsSREincident management

0 likes · 20 min read

How KuJiaLe Built a Scalable Stability System: Real‑World SRE Lessons

Continuous Delivery 2.0

Sep 27, 2023 · Operations

Applying the VALET Pattern Language for SRE Transformation at Home Depot (THD)

The article explains how Home Depot (THD) adopted the VALET pattern language—Volume, Availability, Latency, Error, and Ticket—to unify service‑level objectives, automate data collection, build dashboards, and improve SRE practices across its massive retail and e‑commerce infrastructure.

Home DepotOperationsSLO

0 likes · 9 min read

Applying the VALET Pattern Language for SRE Transformation at Home Depot (THD)

Continuous Delivery 2.0

Sep 25, 2023 · Operations

Understanding MTTR, MTBF, and MTTF: Fault Metrics for Reliability Engineering

This article explains the essential fault metrics MTTR, MTBF, and MTTF, their definitions, calculations, and practical importance for SRE and operations teams to improve system availability, guide maintenance strategies, and make data‑driven reliability decisions.

MTBFMTTFMTTR

0 likes · 11 min read

Understanding MTTR, MTBF, and MTTF: Fault Metrics for Reliability Engineering

Efficient Ops

Sep 19, 2023 · Operations

Mastering the ‘Dao, Fa, Shu, Qi’ of IT Operations: From Philosophy to Tools

This article explores the four pillars of technical operations—Dao (philosophy), Fa (methodology), Shu (techniques), and Qi (tools)—detailing concepts from business continuity to SRE, FinOps, GitOps, CMDB design, and the underlying logic of modern ops platforms.

AutomationCMDBDevOps

0 likes · 10 min read

Mastering the ‘Dao, Fa, Shu, Qi’ of IT Operations: From Philosophy to Tools

JD Cloud Developers

Sep 13, 2023 · Operations

Stability Engineering Explained: From Entropy Theory to Practical SRE

The article explores why building system stability is crucial by linking entropy theory to software reliability, introduces the availability formula, discusses common pitfalls and industry practices, and proposes a three‑stage governance framework—prevention, mitigation, and post‑mortem—to systematically improve operational resilience.

AvailabilityOperationsReliability

0 likes · 13 min read

Stability Engineering Explained: From Entropy Theory to Practical SRE

Bilibili Tech

Sep 8, 2023 · Operations

Design, Implementation, and Governance of an Alert Management Platform

The article details Bilibili’s comprehensive alert‑management platform—its background, cloud‑vs‑self‑built solution comparison, closed‑loop design, distributed architecture, rule configuration, noise‑reduction, automated root‑cause analysis, and governance practices that cut weekly alerts from 1,000 to under 80, while outlining future enhancements.

Alert ManagementDevOpsSRE

0 likes · 19 min read

Design, Implementation, and Governance of an Alert Management Platform

Continuous Delivery 2.0

Sep 1, 2023 · Operations

Project Health Metrics and Practices in Google’s SRE and Development Process

The article explains how Google measures and improves software quality before release by separating development and operations responsibilities, using monorepo and trunk‑based development, daily release candidates, automated testing, performance benchmarks, and a comprehensive Project Health (pH) metric system that balances speed, reliability, and quality.

GoogleMetricsOperations

0 likes · 11 min read

Project Health Metrics and Practices in Google’s SRE and Development Process

dbaplus Community

Aug 28, 2023 · Operations

How to Define SLIs, SLOs, SLAs and Build Reliable, Observable Systems

This guide explains how SRE teams should define service level indicators, objectives, and agreements, design reliable and observable architectures, manage error budgets, assess risks, handle incidents, and integrate development practices to improve system stability and performance.

Error BudgetReliabilitySLI

0 likes · 15 min read

How to Define SLIs, SLOs, SLAs and Build Reliable, Observable Systems

DeWu Technology

Aug 14, 2023 · Operations

Capital Loss Prevention Practices and Technical System

Dewu’s capital‑loss prevention framework embeds risk assessment and technical safeguards—such as idempotency, distributed consistency, and active‑active multi‑region design—into architecture, organizes three defensive lines (development, QA, SRE), and employs real‑time, near‑real‑time, and offline verification plus regular drills, while advancing automated analysis and intelligent scaling.

Data ConsistencySREfinancial loss prevention

0 likes · 10 min read

Capital Loss Prevention Practices and Technical System

Tech Architecture Stories

Aug 14, 2023 · Operations

Why Governing Microservices Is Essential for Stability and Scalability

The article explains why microservice governance—through measurement, targeted remediation, and verification—is crucial for maintaining system stability, reducing complexity, and improving availability in large‑scale, rapidly evolving architectures.

MicroservicesObservabilitySLO

0 likes · 9 min read

Why Governing Microservices Is Essential for Stability and Scalability

dbaplus Community

Aug 13, 2023 · Operations

Mastering SRE: Key Questions on Monitoring, Capacity, and Change Management

This article provides a comprehensive SRE guide covering senior role definitions, monitoring objectives and implementation, core metric selection, link and event monitoring, capacity planning and mitigation strategies, a real‑world health‑code outage case, and change‑management best practices to improve reliability and efficiency.

SREcapacitychange management

0 likes · 9 min read

Mastering SRE: Key Questions on Monitoring, Capacity, and Change Management

Tech Architecture Stories

Aug 8, 2023 · Operations

Mastering Fault Postmortems: Proven Methods to Boost System Reliability

This comprehensive guide explains the origins, methodologies, and practical steps of fault postmortems—including PDCA, GRIA, aviation safety lessons, industrial accident theory, and software reliability metrics—to help teams systematically investigate incidents, derive actionable improvements, and continuously enhance system availability.

GRIAPDCAReliability

0 likes · 22 min read

Mastering Fault Postmortems: Proven Methods to Boost System Reliability

Tech Architecture Stories

Aug 7, 2023 · Operations

Mastering Fault Postmortems: Proven Methods to Boost System Reliability

This article explains the essence, purpose, and step‑by‑step process of fault postmortems—including preparation, root‑cause analysis, improvement actions, and decision making—while covering PDCA and GRIA methodologies, industry examples, MTTR/MTBF metrics, and practical templates for lasting reliability.

GRIAMTTRPDCA

0 likes · 24 min read

dbaplus Community

Jul 27, 2023 · Operations

How to Build Scalable Observability for Cloud‑Native Environments: Lessons from SRE

This article summarizes a technical talk on the challenges of cloud‑native transformation, the design of an application‑centric observability platform using CMDB, Prometheus, Thanos and VictoriaMetrics, practical solutions for high‑cardinality metrics and alerting, and future directions such as eBPF and AI‑driven fault detection.

CMDBMetricsObservability

0 likes · 14 min read

How to Build Scalable Observability for Cloud‑Native Environments: Lessons from SRE

DevOps

Jul 27, 2023 · Operations

An Overview of the Google SRE Workbook and Core SRE Foundations

The article introduces the Google SRE Workbook as a practical supplement to the original SRE book, explains the five core SRE foundations—including SLO, SLI, SLA, monitoring, and real‑world case studies from Google and Kingsoft Office—while also promoting an upcoming SRE‑DevOps live session.

GoogleSLISLO

0 likes · 4 min read

An Overview of the Google SRE Workbook and Core SRE Foundations

Aikesheng Open Source Community

Jul 24, 2023 · Operations

Exploring On‑Call Duty Models and SRE‑Driven Operations Management

This article examines the challenges of traditional on‑call duty systems for operations teams, proposes an SRE‑inspired rotation model that involves developers, defines concrete KPI targets, and describes how automation and chat‑bot tools can streamline incident response and reduce internal friction.

AutomationKPIOn-Call

0 likes · 12 min read

Exploring On‑Call Duty Models and SRE‑Driven Operations Management

Tech Architecture Stories

Jul 23, 2023 · Operations

Why Every Backend Engineer Should Read Google’s SRE Handbook

The article recommends two essential Google SRE books for backend developers, explains what SRE is, how it differs from traditional operations, and shows how the concepts like SLI/SLO, incident postmortems, and reliability engineering can be applied to improve system availability and stability.

Backend DevelopmentOperationsSRE

0 likes · 4 min read

Why Every Backend Engineer Should Read Google’s SRE Handbook

AntTech

Jul 20, 2023 · Operations

AlterShield: An Open‑Source Change Management Platform for Risk Control and Observability

AlterShield is an open‑source, end‑to‑end change‑control platform that systematizes change perception, risk analysis, and defense across distributed cloud‑native environments, enabling SRE teams to mitigate stability risks through standardized protocols, incremental rollout, and automated observability checks.

Cloud NativeSREchange management

0 likes · 24 min read

AlterShield: An Open‑Source Change Management Platform for Risk Control and Observability

Huolala Tech

Jul 13, 2023 · Operations

How HuoLaLa Built a 0‑to‑1 Stability Metric System in 2 Years

This article explains how HuoLaLa’s stability team tackled the challenge of proving their work’s value by designing and implementing a comprehensive stability metric system from scratch, detailing the motivations, principles, step‑by‑step construction, data platform, cultural adoption, measurable results, and future plans.

Data-drivenMetricsOperations

0 likes · 18 min read

How HuoLaLa Built a 0‑to‑1 Stability Metric System in 2 Years

FunTester

Jul 12, 2023 · Operations

Mastering Load Testing: Practical Wrk, GoReplay, and SRE Strategies

This article explains why automation testing often lags behind product changes, outlines essential load‑testing concepts such as bottleneck analysis and capacity planning, and provides hands‑on guidance for using Wrk and GoReplay tools within an SRE‑driven operations workflow.

GoReplayLoad TestingPerformance Testing

0 likes · 8 min read

Mastering Load Testing: Practical Wrk, GoReplay, and SRE Strategies

dbaplus Community

Jul 1, 2023 · Operations

How We Rebuilt a Private Cloud Platform to Supercharge Developer Efficiency

This article recounts a year‑long effort by a senior SRE engineer to redesign a private cloud platform, detailing the motivations, architectural choices, SSO and RBAC implementations, workflow automation, GitOps deployment, release engineering improvements, and the cultural shift toward metrics‑driven development.

DevOpsGitOpsKubernetes

0 likes · 22 min read

How We Rebuilt a Private Cloud Platform to Supercharge Developer Efficiency

Efficient Ops

Jun 25, 2023 · Operations

How to Build a Next‑Gen “Big Operations” System for Reliability and Observability

This article outlines the evolution from manual operations to DevOps and SRE‑driven “big operations,” detailing system reliability and continuity practices, observability concepts, and the development of AIOps maturity standards, offering a comprehensive guide for building stable, efficient, and secure operational frameworks.

DevOpsObservabilityOperations

0 likes · 14 min read

How to Build a Next‑Gen “Big Operations” System for Reliability and Observability

dbaplus Community

Jun 24, 2023 · Operations

How Bilibili Scales Capacity: VPA, HPA, and Cost‑Saving Strategies

This article summarizes Zhang He’s Bilibili SRE talk on building a capacity‑management system that visualizes resource usage, reduces costs, improves stability, and leverages Kubernetes VPA, HPA, pooling, and quota management to support massive live‑stream events and rapid feature releases.

Cost OptimizationHPAKubernetes

0 likes · 21 min read

How Bilibili Scales Capacity: VPA, HPA, and Cost‑Saving Strategies

Efficient Ops

Jun 20, 2023 · Operations

Mastering SRE: How Error Budgets and SLOs Drive System Reliability

This article explains the fundamentals of Site Reliability Engineering, detailing how SRE combines development and operations to improve stability through metrics like MTBF and MTTR, the roles of SLI/SLO, the VALET selection method, and the practical use of error budgets for quantifying work and guiding alerts.

Error BudgetMTBFOperations

0 likes · 14 min read

Mastering SRE: How Error Budgets and SLOs Drive System Reliability

DevOps

Jun 16, 2023 · Operations

DevOps/SRE Best Practices: Hiding Provider Addresses, Minimal Dependencies, Service/Port Management, and Bastion Host Protection

This article presents a comprehensive set of DevOps/SRE best practices—including hiding service provider resource addresses, installing only required dependencies, running only necessary services and ports, and using bastion hosts—to improve system security, reliability, and operational efficiency.

SRESecurity

0 likes · 16 min read

DevOps/SRE Best Practices: Hiding Provider Addresses, Minimal Dependencies, Service/Port Management, and Bastion Host Protection

Ops Development Stories

Jun 6, 2023 · Operations

When Fancy PPTs Meet Real Outages: Lessons from a Major E‑commerce Crash

The article examines Vipshop's massive March 2023 outage caused by an IDC cooling failure, critiques superficial PPT‑driven reliability claims, and offers practical SRE insights on fault drills, true multi‑active architectures, and how ops teams can gain influence despite budget constraints.

OperationsSREfault tolerance

0 likes · 7 min read

When Fancy PPTs Meet Real Outages: Lessons from a Major E‑commerce Crash

Efficient Ops

Jun 1, 2023 · Operations

How Tencent’s On‑Call System Transforms Incident Management and Quality Ops

This article explores how Tencent builds and practices its SRE quality operation system, focusing on On‑Call incident management, standardized channels, alert handling, data quality models, and the resulting improvements in reliability, MTTR reduction, and data‑driven decision making.

ObservabilityOn-CallOperations

0 likes · 14 min read

How Tencent’s On‑Call System Transforms Incident Management and Quality Ops

Efficient Ops

May 31, 2023 · Operations

How Tencent Scales SRE: Building a SLO‑Based Quality Operations System

This article examines Tencent's end‑to‑end SRE quality‑operation framework built on Service Level Objectives (SLO) and On‑Call, detailing industry background, problem statements, SLO management, On‑Call benefits, product architecture, large‑scale deployment, and future plans for reliability engineering.

On-CallQuality OperationsSLO

0 likes · 11 min read

How Tencent Scales SRE: Building a SLO‑Based Quality Operations System

dbaplus Community

May 29, 2023 · Operations

How Bilibili Built a High‑Availability Multi‑Active Architecture for SRE

This article details Bilibili's SRE team's design and implementation of a high‑availability multi‑active architecture, covering zone types, same‑city and cross‑region deployments, traffic routing, cache consistency, message handling, governance, and practical lessons learned from real‑world incidents.

BilibiliOperationsSRE

0 likes · 20 min read

How Bilibili Built a High‑Availability Multi‑Active Architecture for SRE

Ops Development Stories

May 25, 2023 · Operations

Why 100% Service Uptime Isn’t Worth the Cost: SRE Insights on Risk and ROI

The article explains why striving for perfect service availability is unnecessary, outlines the cost of high reliability, shows how to measure availability and SLOs, discusses who should set SLOs, and highlights the importance of ROI when improving reliability.

ROIReliabilitySLO

0 likes · 8 min read

Why 100% Service Uptime Isn’t Worth the Cost: SRE Insights on Risk and ROI

dbaplus Community

May 22, 2023 · Operations

Mastering SLOs: From Theory to Practical SRE Operations at Bilibili

This article outlines Bilibili's end‑to‑end SLO framework, covering metric selection, SLO definition, error‑budget calculation, alerting strategies, operational workflows, and lessons learned from real‑world deployments.

Error BudgetSLOSRE

0 likes · 28 min read

Mastering SLOs: From Theory to Practical SRE Operations at Bilibili

Efficient Ops

May 21, 2023 · Operations

From Apollo to Google: How Margaret Hamilton Shaped Modern SRE

This article traces the origins of Site Reliability Engineering from Margaret Hamilton’s pioneering work on the Apollo program, through Google’s formal SRE team creation, and highlights the key differences between SRE and traditional operations practices.

GoogleMargaret HamiltonOperations

0 likes · 7 min read

From Apollo to Google: How Margaret Hamilton Shaped Modern SRE

DeWu Technology

May 19, 2023 · Operations

Investigation and Resolution of In‑flight Wi‑Fi Connectivity Issues for a Mobile E‑Commerce App

The SRE team diagnosed an in‑flight Wi‑Fi outage for the DeWu e‑commerce app by reproducing the problem, capturing packets with ping, traceroute and tcpdump, discovered a firewall rule misclassifying the domain as a download site, and resolved it through a vendor‑issued policy update, restoring connectivity on both ATG and SATCOM links.

SRETCPWiFi

0 likes · 18 min read

Investigation and Resolution of In‑flight Wi‑Fi Connectivity Issues for a Mobile E‑Commerce App

iQIYI Technical Product Team

May 12, 2023 · Operations

Performance Troubleshooting and Optimization of Prometheus Monitoring Queries

The article explains that high metric cardinality in Prometheus causes long query times and timeouts, and demonstrates how using recording rules to pre‑compute aggregates dramatically reduces cardinality and latency, while recommending scrape interval tuning and metric design best practices to keep charts responsive.

PrometheusRecording RulesSRE

0 likes · 10 min read

Performance Troubleshooting and Optimization of Prometheus Monitoring Queries

Efficient Ops

May 10, 2023 · Operations

Mastering XOps: From DevOps to FinOps – A Comprehensive Guide

This article presents a systematic overview of the emerging XOps ecosystem—including DevOps, BizDevOps, AIOps, FinOps, and SRE—detailing their relationships, maturity models, standards, and practical guidance for enterprises seeking to achieve efficient, secure, and data‑driven digital transformation.

BizDevOpsDevOpsFinOps

0 likes · 13 min read

Mastering XOps: From DevOps to FinOps – A Comprehensive Guide

DevOps Cloud Academy

May 10, 2023 · Operations

Understanding the Role of Site Reliability Engineering (SRE) in DevOps

This article explains why Site Reliability Engineering (SRE) and DevOps are both essential for modern software development, compares their objectives, outlines their complementary roles, and highlights the fundamental differences that help organizations achieve faster releases with higher reliability.

DevOpsSRESite Reliability Engineering

0 likes · 8 min read

Understanding the Role of Site Reliability Engineering (SRE) in DevOps

MaGe Linux Operations

May 7, 2023 · Operations

How Meta’s SLICK Transforms SLO Management for Reliable Services

This article explains how Meta built SLICK, a centralized SLO/SLI platform that improves service reliability through discoverability, long‑term insights, integrated workflows, and scalable architecture, and shares real‑world examples and lessons learned from its deployment across thousands of services.

MetaObservabilityReliability

0 likes · 13 min read

How Meta’s SLICK Transforms SLO Management for Reliable Services

DataFunSummit

Apr 29, 2023 · Operations

Application Monitoring Principles and Non‑Intrusive Data Collection at Huya

This article explains the fundamentals of distributed application monitoring, describes Huya's non‑intrusive data‑collection techniques using SDKs and plugins, outlines the design and correlation of observable metrics, and demonstrates practical results and troubleshooting scenarios for backend services.

Distributed TracingMetrics DesignObservability

0 likes · 16 min read

Application Monitoring Principles and Non‑Intrusive Data Collection at Huya

Efficient Ops

Apr 26, 2023 · Operations

Building a Chaos Engineering Platform for Financial Services: Key Lessons

This talk outlines the challenges of maintaining system stability in fast‑moving, cloud‑native financial services, describes a risk‑identification model, high‑fidelity fault simulation, and a comprehensive stability engineering platform, and shares future plans for automated, data‑driven risk mitigation.

Financial ServicesOperationsSRE

0 likes · 15 min read

Building a Chaos Engineering Platform for Financial Services: Key Lessons

Efficient Ops

Apr 16, 2023 · Operations

How Capability Platforms Empower Intelligent Container Cloud Operations

At the 20th GOPS Global Operations Conference, China Mobile Jiangsu showcased how its capability platform leverages AI, big data, and blockchain to automate health scoring and intelligent inspection, dramatically improving container‑cloud operational efficiency and paving the way for smarter, SRE‑driven DevOps practices.

Big DataCapability PlatformIntelligent Operations

0 likes · 5 min read

How Capability Platforms Empower Intelligent Container Cloud Operations

Huolala Tech

Apr 7, 2023 · Operations

How Huolala Built a Scalable Tech Stability System – Key Lessons for Reliability

This article details Huolala's journey in establishing a comprehensive technical stability framework, covering organizational challenges, risk governance, incident response, cultural initiatives, and future automation to enhance system reliability at scale.

OperationsSREincident response

0 likes · 16 min read

How Huolala Built a Scalable Tech Stability System – Key Lessons for Reliability

DeWu Technology

Mar 30, 2023 · Cloud Computing

得物 FinOps Practice: Cloud Cost Management and Optimization

Since 2021, 得物 has built a FinOps practice that combines finance and DevOps to allocate, forecast, and optimize multi‑cloud spend, delivering over ¥100 million in annual savings through cost‑center platforms, container‑utilization improvements, custom PaaS services, automated analysis, and a cross‑functional virtual team governed by visibility, optimization, and governance KPIs.

Cloud Cost ManagementCost OptimizationFinOps

0 likes · 27 min read

得物 FinOps Practice: Cloud Cost Management and Optimization

Efficient Ops

Mar 28, 2023 · Operations

Why SRE Matters: Bridging Product Development and Reliability Engineering

This article explains the role of Site Reliability Engineering (SRE), its responsibilities, how it complements product development, the software lifecycle perspective, and practical approaches to ensure system stability through controllability, observability, and best‑practice implementation.

ObservabilityOperationsSRE

0 likes · 14 min read

Why SRE Matters: Bridging Product Development and Reliability Engineering

Bilibili Tech

Mar 28, 2023 · Operations

Bilibili's Capacity Management Platform: Design, Implementation, and S12 Event Support

Bilibili's capacity management platform integrates foundational data, VPA/HPA scaling, quota control, and visual dashboards to streamline resource usage, cut costs, and boost stability, delivering event‑specific support such as for S12 that slashes release issues by 80% and online failures by 90%, while planning predictive scaling and risk control.

BilibiliResource OptimizationSRE

0 likes · 13 min read

Bilibili's Capacity Management Platform: Design, Implementation, and S12 Event Support

DevOps

Mar 14, 2023 · Operations

15 Essential DevOps and SRE Tools to Watch in 2023

This guide outlines fifteen key DevOps and SRE tools for 2023—including monitoring, application platforms, chat‑ops, incident management, diagramming, and CI/CD solutions—explaining their core features, benefits, and how they help teams maintain reliable, observable, and automated software delivery pipelines.

SRETooling

0 likes · 11 min read

15 Essential DevOps and SRE Tools to Watch in 2023

NetEase Smart Enterprise Tech+

Mar 1, 2023 · Operations

Stability Quality Assurance: Definitions, Metrics, and Implementation Guide

This article explains the origins and meaning of software stability and stability testing, outlines key standards such as GB/T 16260 and industry definitions, and presents a comprehensive framework for stability quality assurance covering system elements, external disturbances, baseline setting, robust design, monitoring, and rapid incident response.

OperationsSREquality assurance

0 likes · 17 min read

Stability Quality Assurance: Definitions, Metrics, and Implementation Guide

dbaplus Community

Feb 28, 2023 · Operations

How Container SRE at DeWu Boosts Reliability: Practices, Metrics, and Incident Playbooks

This article details DeWu's container SRE approach, covering SRE fundamentals, on‑call response, SLO/SLA design, change management, capacity planning, kernel‑parameter monitoring, security safeguards, and a real‑world incident analysis, providing actionable insights for building resilient cloud‑native services.

CapacityPlanningIncidentResponseKubernetes

0 likes · 24 min read

How Container SRE at DeWu Boosts Reliability: Practices, Metrics, and Incident Playbooks

Efficient Ops

Feb 20, 2023 · Operations

How Tencent Scaled Health Code: Cloud‑Native Architecture, Monitoring, and Chaos Engineering

This article reviews how Tencent Health Code handled billions of daily scans by adopting cloud‑native architecture, comprehensive observability, capacity stress testing, chaos engineering, and disciplined change control to ensure high availability and resilience as pandemic demand waned.

SREcapacity testingcloud-native

0 likes · 16 min read

How Tencent Scaled Health Code: Cloud‑Native Architecture, Monitoring, and Chaos Engineering

Bilibili Tech

Feb 17, 2023 · Backend Development

Design and Implementation of the Comet Workflow Engine at Bilibili

The article details Bilibili’s Comet workflow engine—a low‑code, plugin‑extensible platform built since 2019 that uses visual DAG templates, graph‑based legality checks, and asynchronous execution to automate diverse business processes such as SRE automation, permission requests, and push‑task approvals, improving operational efficiency across mobile and web services.

AutomationDAGGo

0 likes · 18 min read

Design and Implementation of the Comet Workflow Engine at Bilibili

Bilibili Tech

Feb 14, 2023 · Cloud Native

Bilibili's Vertical Pod Autoscaler (VPA) Practice and Cluster Resource Governance

Bilibili extended Kubernetes with a custom in‑place Vertical Pod Autoscaler framework—including generator, recommender, updater, and webhook controllers plus a management platform for strategy tuning, avoidance, analysis, and anomaly detection—reducing over‑provisioned resources across its ten‑thousand‑node private cloud and achieving up to 60 % CPU and 30 % memory savings.

KubernetesSREvertical pod autoscaler

0 likes · 19 min read

Bilibili's Vertical Pod Autoscaler (VPA) Practice and Cluster Resource Governance

DeWu Technology

Feb 8, 2023 · Operations

Container SRE Practices and Incident Management at DeWu

DeWu’s container SRE team combines software‑engineered reliability with routine operations, using defined on‑call roles, SLO/SLA targets, progressive change management, capacity forecasting, four‑metric monitoring, MTTR/MTTF tracking, kernel‑parameter tuning, and namespace‑protected security policies to swiftly resolve incidents such as Redis latency spikes.

ContainerPerformance OptimizationSRE

0 likes · 23 min read

Container SRE Practices and Incident Management at DeWu

Efficient Ops

Feb 7, 2023 · Operations

Why SRE Is Essential for Reliable Internet Services – Chinese Experts Share Insights

Site Reliability Engineering (SRE), introduced by Google in 2003, has become a cornerstone for ensuring the reliability and stability of large‑scale internet platforms, and Chinese experts now share home‑grown practices and a new book that distills two decades of SRE experience for building high‑availability applications.

BookDevOpsOperations

0 likes · 3 min read

Why SRE Is Essential for Reliable Internet Services – Chinese Experts Share Insights

dbaplus Community

Jan 16, 2023 · Operations

Beyond Success‑Ratio: How User‑Uptime Reveals Real Product Availability

The article reviews traditional availability metrics such as Success‑Ratio, Error‑Budget, MTTR/MTTF, SLA/SLO, and highlights their limitations, then introduces Google’s User‑Uptime and Windowed User‑Uptime metrics, explains their definitions, challenges, experimental results, and why they provide a more user‑centric view of service reliability.

AvailabilityMetricsSRE

0 likes · 27 min read

Beyond Success‑Ratio: How User‑Uptime Reveals Real Product Availability