Tagged articles
403 articles
Page 2 of 5
DevOps Coach
DevOps Coach
Jun 30, 2024 · Operations

Effective Incident Mitigation and Recovery: Practical SRE Strategies

The article outlines SRE‑based incident mitigation and recovery practices, covering urgent mitigations, impact reduction, key metrics such as TTD, TTR, TBF, and detailed strategies for shortening detection and repair times, preventing fatigue, improving observability, and designing resilient systems.

MitigationOperationsReliability
0 likes · 23 min read
Effective Incident Mitigation and Recovery: Practical SRE Strategies
Efficient Ops
Efficient Ops
Jun 25, 2024 · Operations

Mastering the Four Golden Signals: A Practical Guide to System Monitoring

This guide explains how to use the four golden signals—latency, traffic, errors, and saturation—to design effective monitoring across servers, services, and external dependencies, helping teams detect issues early and maintain reliable, high‑performance systems.

SREmonitoringsystem reliability
0 likes · 20 min read
Mastering the Four Golden Signals: A Practical Guide to System Monitoring
Practical DevOps Architecture
Practical DevOps Architecture
May 22, 2024 · Operations

SRE & Linux Operations Course Outline

This article presents a detailed curriculum covering fundamental infrastructure, cluster architecture, automation, log collection, Linux system administration, containerization, monitoring, security, and related DevOps tools across multiple phases and daily modules for comprehensive SRE training.

AutomationSREcloud
0 likes · 8 min read
SRE & Linux Operations Course Outline
Efficient Ops
Efficient Ops
May 21, 2024 · Operations

What Is an SRE? Roles, Skills, and Best Practices Explained

This article demystifies Site Reliability Engineering (SRE) by explaining its origins, core responsibilities, essential skill sets, and key practices such as observability, incident response, testing, capacity planning, automation, user support, on‑call duties, and the definition of SLI/SLO/SLA, providing a comprehensive guide for modern operations teams.

SREcapacity planningincident response
0 likes · 29 min read
What Is an SRE? Roles, Skills, and Best Practices Explained
Efficient Ops
Efficient Ops
May 12, 2024 · Operations

From Firefighting to Fire‑Starting: Mastering Operations for System Reliability

The article outlines a three‑stage evolution of operations—from rapid incident response to proactive fault‑injection—while offering practical guidance on improving availability, visualizing changes, and aligning technical metrics with business value to elevate the role of operations engineers.

AvailabilityFault InjectionSRE
0 likes · 7 min read
From Firefighting to Fire‑Starting: Mastering Operations for System Reliability
Efficient Ops
Efficient Ops
May 7, 2024 · Operations

11 Hard‑Earned Lessons from Two Decades of Google Site Reliability

Drawing on twenty years of Google’s SRE experience, this article shares eleven practical lessons—from proportional incident mitigation and pre‑tested recovery mechanisms to canary releases, disaster‑resilience testing, and frequent deployments—aimed at improving reliability and operational efficiency.

GoogleSREincident management
0 likes · 12 min read
11 Hard‑Earned Lessons from Two Decades of Google Site Reliability
Efficient Ops
Efficient Ops
Apr 14, 2024 · Operations

How to Ensure System Stability and High Availability: An SRE Playbook

This article explains the definitions of stability and high availability, distinguishes their relationship, outlines key performance indicators, and provides a comprehensive framework—including fault prevention, detection, and recovery, as well as design, coding, testing, monitoring, and emergency response practices—to help teams build reliable, highly available systems.

SREcapacity planninghigh availability
0 likes · 10 min read
How to Ensure System Stability and High Availability: An SRE Playbook
Bilibili Tech
Bilibili Tech
Apr 9, 2024 · Operations

BCM – Building and Deploying Bilibili’s Chaos Engineering Platform

At the 2024 GOPS Global Operations Conference, Bilibili senior R&D engineer Gu Lintao will present BCM—Bilibili’s Chaos Engineering Platform—showcasing how its design and capabilities let developers, testers, and SREs safely inject faults, uncover hidden architectural risks, and improve service stability through real‑world drills and systematic reliability engineering.

BilibiliDevOpsReliability
0 likes · 3 min read
BCM – Building and Deploying Bilibili’s Chaos Engineering Platform
Efficient Ops
Efficient Ops
Apr 8, 2024 · Operations

What Exactly Is SRE? A Deep Dive into Roles, Responsibilities, and Best Practices

This article explains what Site Reliability Engineering (SRE) is, outlines the three main layers of SRE work—Infrastructure, Platform, and Business—covers hiring challenges, daily duties such as deployment, on‑call, SLI/SLO management, capacity planning, user support, and offers practical interview and career advice.

OncallOperationsSRE
0 likes · 22 min read
What Exactly Is SRE? A Deep Dive into Roles, Responsibilities, and Best Practices
Efficient Ops
Efficient Ops
Apr 2, 2024 · Operations

What Do Leading Tech Giants Expect from SREs? Job Posting Insights

Amid economic growth and frequent continuity incidents, major internet firms are redefining SRE roles, emphasizing cost reduction, system resilience, risk management, AI‑driven operations, and close collaboration with development teams, as revealed by a detailed analysis of recent job postings from Ant Group, Alibaba, ByteDance and others.

AI-nativeCost OptimizationSRE
0 likes · 9 min read
What Do Leading Tech Giants Expect from SREs? Job Posting Insights
Efficient Ops
Efficient Ops
Mar 25, 2024 · Operations

Why SRE Exists and How It Solves Modern Reliability Challenges

This article explains why Site Reliability Engineering (SRE) emerged, outlines its core responsibilities, required skill set, and how SRE teams use SLOs, monitoring, and scenario drills to improve system reliability, performance, and observability in complex production environments.

DevOpsOperationsReliability
0 likes · 12 min read
Why SRE Exists and How It Solves Modern Reliability Challenges
Efficient Ops
Efficient Ops
Mar 25, 2024 · Operations

How CAICT’s SRE Standards Strengthen System Reliability and Continuity

This article outlines the rising frequency of system outages, explains the key characteristics and challenges of modern large‑scale distributed systems, introduces China’s CAICT SRE framework and its two‑part reliability model, showcases a successful SRE case, and announces the 2024 SRE maturity assessment program.

Digital GovernanceSREsoftware reliability
0 likes · 12 min read
How CAICT’s SRE Standards Strengthen System Reliability and Continuity
Efficient Ops
Efficient Ops
Mar 13, 2024 · Operations

What Does an Operations Engineer Do? Skills, Tools, and Career Path

This article explains the role of an operations (运维) engineer, covering daily responsibilities, essential knowledge such as Linux and networking, common monitoring tools, and emerging career paths like DevOps, AIOps, and SRE, helping newcomers understand how to start and grow in the field.

DevOpsLinuxOperations
0 likes · 6 min read
What Does an Operations Engineer Do? Skills, Tools, and Career Path
Efficient Ops
Efficient Ops
Mar 13, 2024 · Operations

Why Traditional Ops Stalls and How AI‑Driven Solutions Can Revitalize It

The article examines common operational pain points such as cumbersome release processes, lack of standardization, and weak security controls, then explores how AI‑powered SRE tools and automation can address these challenges and guide teams toward more efficient, standardized, and resilient operations.

AILLMSRE
0 likes · 9 min read
Why Traditional Ops Stalls and How AI‑Driven Solutions Can Revitalize It
DevOps
DevOps
Feb 21, 2024 · Operations

When Operations Are Stuck: Embrace Change or Remain Stagnant?

The article examines common operational pain points such as cumbersome release processes, lack of standardization, and weak security controls, then presents community suggestions emphasizing automation, standardized workflows, CMDB, and AI‑driven SRE tools, concluding that clear direction outweighs perfect execution.

AICMDBSRE
0 likes · 6 min read
When Operations Are Stuck: Embrace Change or Remain Stagnant?
Efficient Ops
Efficient Ops
Jan 28, 2024 · Operations

Can One Person Really Manage 40,000 Servers? Real‑World Ops Insights

A collection of Zhihu contributors share practical experiences and opinions on whether a single operations engineer can handle the massive scale of 40,000 servers, covering workload, automation gaps, budgeting, hardware failure rates, and the necessity of team‑based high‑availability practices.

InfrastructureSREScale
0 likes · 9 min read
Can One Person Really Manage 40,000 Servers? Real‑World Ops Insights
DevOps Operations Practice
DevOps Operations Practice
Jan 26, 2024 · Operations

Career Prospects and Advancement Strategies for Operations Engineers

The article examines the wide range of opportunities for operations engineers, highlighting the low‑entry, high‑potential nature of the field, and offers practical advice on city selection, education, in‑demand technologies, soft‑skill development, and health to help professionals climb from basic desktop support to high‑paying DevOps or SRE roles.

DevOpsITSRE
0 likes · 7 min read
Career Prospects and Advancement Strategies for Operations Engineers
Efficient Ops
Efficient Ops
Jan 23, 2024 · Operations

Why Building Truly High‑Availability Systems Is Harder Than You Think

The article examines why 2023 saw a surge in major online outages, linking layoffs and cost‑cutting to lost expertise, and explores the entropy and Murphy laws that make perpetual high availability impossible without continuous, systematic investment and cultural change.

SRETechnical Debthigh availability
0 likes · 13 min read
Why Building Truly High‑Availability Systems Is Harder Than You Think
Efficient Ops
Efficient Ops
Jan 22, 2024 · Operations

How New Oriental Standardized Its Observability System to Cut Costs and Boost Efficiency

At the 21st GOPS Global Operations Conference, New Oriental's senior operations manager Qi Chen detailed the demand, technical, and focus pressures that drove a phased, full‑process observability standardization, leveraging OpenTelemetry, Telegraf, Loki and CMDB tagging to achieve cost reduction and higher stability.

Cost reductionDevOpsOpenTelemetry
0 likes · 8 min read
How New Oriental Standardized Its Observability System to Cut Costs and Boost Efficiency
Efficient Ops
Efficient Ops
Jan 17, 2024 · Operations

How China’s Telecom Giants Accelerate IT Efficiency with DevOps Maturity Assessments

In the context of digital transformation, six leading Chinese telecom operators applied the CAICT DevOps Capability Maturity Model to evaluate dozens of projects, achieving significant improvements in continuous delivery, technical operations, security, and AIOps, providing valuable references for the industry.

Continuous DeliveryDevOpsIT Operations
0 likes · 18 min read
How China’s Telecom Giants Accelerate IT Efficiency with DevOps Maturity Assessments
Efficient Ops
Efficient Ops
Jan 14, 2024 · Operations

Mastering Incident Command: A Practical Guide for SRE Fault Handling

This article outlines a comprehensive, step‑by‑step approach for SRE incident commanders, covering fault perception, grading, team organization, remediation tactics, transparent communication, and post‑mortem practices to efficiently resolve service disruptions.

SREfault handlingincident management
0 likes · 14 min read
Mastering Incident Command: A Practical Guide for SRE Fault Handling
DevOps
DevOps
Jan 12, 2024 · Operations

Why Building a Never‑Failing System Is Impossible and How to Pursue Continuous High Availability

The article analyses why truly never‑failing systems cannot exist—citing entropy and Murphy’s laws—examines the organizational and technical obstacles to continuous high availability, and offers practical cultural and engineering practices such as testing, code review, monitoring, and regular system health checks to mitigate risk.

Murphy's LawOperationsSRE
0 likes · 14 min read
Why Building a Never‑Failing System Is Impossible and How to Pursue Continuous High Availability
Tencent Cloud Developer
Tencent Cloud Developer
Jan 10, 2024 · Operations

The Challenges of Building Continuously Available Systems: Entropy, Murphy's Law, and the 'Divine Doctor Paradox'

Building continuously available systems in 2023 is hampered by entropy‑driven technical debt and Murphy’s Law failures, and the “Divine Doctor Paradox” shows that successful availability work goes unnoticed while blame follows any outage, making cultural commitment—not just technology—the essential solution.

Murphy's LawSRETechnical Debt
0 likes · 14 min read
The Challenges of Building Continuously Available Systems: Entropy, Murphy's Law, and the 'Divine Doctor Paradox'
Bilibili Tech
Bilibili Tech
Jan 5, 2024 · Cloud Native

ChangePilot: Bilibili’s Unified Change Management Platform and Practices

ChangePilot is Bilibili’s unified change‑management platform that standardizes change definition, lifecycle, and risk governance through a platform‑scenario model and five control levels (G0‑G4), offering built‑in checks, searchable records, subscription alerts, intelligent correlation, and emergency channels to boost production stability while maintaining operational efficiency.

SREchange managementrisk control
0 likes · 29 min read
ChangePilot: Bilibili’s Unified Change Management Platform and Practices
21CTO
21CTO
Dec 30, 2023 · Operations

How G Bank Turns Application Monitoring into Business‑Driven Visual Operations

This article examines how G Bank builds an application monitoring system based on ITIL and Google SRE principles, identifies its shortcomings, and evolves the platform into a visualized operations solution that aligns technical and business perspectives for faster incident resolution and improved customer experience.

BankingITILOperations
0 likes · 11 min read
How G Bank Turns Application Monitoring into Business‑Driven Visual Operations
AntTech
AntTech
Dec 18, 2023 · Cloud Native

AlterShield Open‑Source Change Risk Control Platform: Architecture, Features, and Future Roadmap

AlterShield is an open‑source change‑risk prevention solution originally built by Ant Group that provides lifecycle‑aware change defense, cloud‑native operator integration, KDE‑based anomaly detection, and extensible plug‑in frameworks, with detailed module descriptions, recent v1.0 releases, and a roadmap for advanced monitoring and noise‑reduction capabilities.

Cloud NativeKubernetesSRE
0 likes · 13 min read
AlterShield Open‑Source Change Risk Control Platform: Architecture, Features, and Future Roadmap
Bilibili Tech
Bilibili Tech
Dec 15, 2023 · Operations

Bilibili Alert Monitoring System: Design, Optimization, and Root‑Cause Analysis

Bilibili revamped its alert monitoring platform to meet rapid growth, focusing on effectiveness, timeliness, and coverage; it introduced a closed‑loop design and governance that cut weekly alerts by 90%, built a knowledge‑graph root‑cause system achieving 87.9% accuracy with sub‑minute latency, and integrated AIOps for ongoing refinement.

Alert MonitoringBilibiliRoot Cause Analysis
0 likes · 21 min read
Bilibili Alert Monitoring System: Design, Optimization, and Root‑Cause Analysis
DevOps Operations Practice
DevOps Operations Practice
Dec 11, 2023 · Operations

Career Prospects for Operations Professionals: From Low‑End Tasks to High‑Paying Roles

This article examines the wide salary range in the operations field, explains why high‑level DevOps/SRE positions command premium pay, and offers practical advice—such as relocating to major cities, obtaining a bachelor's degree, mastering in‑demand technologies, and developing soft skills—to help operations engineers advance their careers.

DevOpsSRESkill development
0 likes · 8 min read
Career Prospects for Operations Professionals: From Low‑End Tasks to High‑Paying Roles
DeWu Technology
DeWu Technology
Dec 8, 2023 · Operations

SRE Secrets: How Alibaba, Tencent & Dewu Build Ultra-Stable Cloud‑Native Services

On November 25, Dewu Technology hosted an SRE Stability Engineering salon in Hangzhou where experts from Alibaba, Tencent, Ant Group and Dewu shared practical insights on C‑end link reliability, Alibaba’s system stability operations, Tencent Game’s cloud‑native SRE practices, and Ant Group’s chaos engineering, concluding with a Q&A and resource distribution.

Cloud NativeOperationsSRE
0 likes · 7 min read
SRE Secrets: How Alibaba, Tencent & Dewu Build Ultra-Stable Cloud‑Native Services
Bilibili Tech
Bilibili Tech
Dec 1, 2023 · Operations

Safe Production Practices: Change Management Platform Design and Implementation at Bilibili

After a series of change‑induced outages in early 2023, Bilibili instituted a comprehensive change‑management framework—including a preventive change platform, a central control system, quality and monitoring tools, strict gray‑release policies, observability checks, and rapid rollback mechanisms—to dramatically cut emergency incidents and embed a reliability‑first culture.

ObservabilityReliabilitySRE
0 likes · 16 min read
Safe Production Practices: Change Management Platform Design and Implementation at Bilibili
Efficient Ops
Efficient Ops
Nov 26, 2023 · Operations

Beijing Mobile’s SRE Success: Automation, Cloud‑Native Ops & Reliability

The article details how Beijing Mobile’s SRE Smart Operations team applied SRE principles, automation, and cloud‑native tools to transform traditional DevOps into a reliable, scalable operation, highlighting their fault‑prevention, monitoring, incident response, and continuous improvement practices that earned them the 2023 IT Technology Leadership award.

AutomationOperationsSRE
0 likes · 7 min read
Beijing Mobile’s SRE Success: Automation, Cloud‑Native Ops & Reliability
Efficient Ops
Efficient Ops
Nov 15, 2023 · Operations

How a Unified Metadata Platform Boosts SRE Efficiency and Cuts Costs

This article describes how Huya built a unified metadata platform to break data silos across its numerous operations systems, enabling standardized data ingestion, association, visualization and analysis that improve resource governance, root‑cause diagnosis, and overall cost‑control for SRE teams.

Root Cause AnalysisSREgraph database
0 likes · 13 min read
How a Unified Metadata Platform Boosts SRE Efficiency and Cuts Costs
JD Tech
JD Tech
Nov 10, 2023 · Operations

Reducing MTTR: Monitoring, Fast Incident Response, and Team Practices

This article explains the concept and importance of MTTR (Mean Time To Repair), shows how to calculate it, and provides a comprehensive set of monitoring, alerting, rapid mitigation, tool‑assisted analysis, and team coordination techniques to significantly shorten incident resolution time and improve system reliability.

MTTROperationsReliability
0 likes · 26 min read
Reducing MTTR: Monitoring, Fast Incident Response, and Team Practices
Huya Tech Engineering
Huya Tech Engineering
Nov 10, 2023 · Operations

How a Unified Metadata Platform Boosts SRE Efficiency and Cuts Costs

This article describes how Huya built a unified metadata platform to break data silos across its SRE systems, enabling standardized data ingestion, correlation, and analysis that improve resource governance, root‑cause diagnosis, and overall cost‑efficiency for large‑scale live streaming services.

DevOpsObservabilitySRE
0 likes · 13 min read
How a Unified Metadata Platform Boosts SRE Efficiency and Cuts Costs
Efficient Ops
Efficient Ops
Nov 7, 2023 · Operations

Mastering SRE: How MTBF, MTTR, SLI, SLO & Error Budget Drive Reliability

This article explains Site Reliability Engineering (SRE) as a collaborative methodology, outlines its stability goals measured by MTBF and MTTR, details how SLI/SLO and the VALET selection guide fault detection, and shows how error budgets quantify reliability work and drive precise alerting.

ErrorBudgetMTBFMTTR
0 likes · 14 min read
Mastering SRE: How MTBF, MTTR, SLI, SLO & Error Budget Drive Reliability
Efficient Ops
Efficient Ops
Nov 6, 2023 · Operations

How Beijing Mobile Achieved Leading SRE Maturity: Insights from a Level‑3 Assessment

Beijing Mobile’s Order Center project passed the CAICT’s Level‑3 System Reliability and Continuity Engineering assessment, showcasing how SRE practices, cultural shifts, and tool automation boosted system stability, reduced incidents by 77%, cut recovery time by 54%, and set a benchmark for large‑scale IT operations in China’s telecom sector.

China MobileDevOpsDigital Transformation
0 likes · 17 min read
How Beijing Mobile Achieved Leading SRE Maturity: Insights from a Level‑3 Assessment
Efficient Ops
Efficient Ops
Nov 2, 2023 · Operations

How ICBC’s SRE Team Built a Panoramic Monitoring System for Digital Ops Transformation

The Industrial and Commercial Bank of China software development center created an SRE panoramic monitoring view system that unifies data channels, standardizes metrics, offers multi‑dimensional dashboards, and introduces an intelligent Ops Assistant, dramatically improving fault detection, response speed, and cross‑team operational efficiency.

Digital TransformationICBCObservability
0 likes · 6 min read
How ICBC’s SRE Team Built a Panoramic Monitoring System for Digital Ops Transformation
Efficient Ops
Efficient Ops
Oct 26, 2023 · Operations

How China Agricultural Bank Achieved Advanced DevOps Maturity Across Two Core Projects

China Agricultural Bank’s two flagship projects—Distributed Core Open System Control and Online Payment Platform—successfully passed the CAICT DevOps Technical Operations Level‑2 assessment, showcasing advanced domestic maturity, detailed implementation practices, and the broader impact of standardized DevOps on banking digital transformation.

BankingContinuous DeliveryDevOps
0 likes · 16 min read
How China Agricultural Bank Achieved Advanced DevOps Maturity Across Two Core Projects
Continuous Delivery 2.0
Continuous Delivery 2.0
Oct 20, 2023 · Operations

Understanding Platform Engineering: Definition, Scope, and Its Relationship with DevOps and SRE

The article explains platform engineering as the evolution of DevOps into a productized internal infrastructure function, detailing its definition, target users, responsibilities, organizational placement, differences from DevOps and SRE, industry trends, implementation practices, and criteria for evaluating the need for a dedicated platform team.

IT OperationsInfrastructureSRE
0 likes · 10 min read
Understanding Platform Engineering: Definition, Scope, and Its Relationship with DevOps and SRE
Qunhe Technology Quality Tech
Qunhe Technology Quality Tech
Oct 13, 2023 · Operations

How KuJiaLe Built a Scalable Stability System: Real‑World SRE Lessons

This article shares KuJiaLe's experience tackling stability challenges caused by rapid user growth and system complexity, detailing their organizational, process, cultural, and technical approaches—including goal setting, a stability committee, monitoring, incident response, change control, and regular drills—to achieve measurable improvements in reliability and performance.

DevOpsSREincident management
0 likes · 20 min read
How KuJiaLe Built a Scalable Stability System: Real‑World SRE Lessons
JD Cloud Developers
JD Cloud Developers
Sep 13, 2023 · Operations

Stability Engineering Explained: From Entropy Theory to Practical SRE

The article explores why building system stability is crucial by linking entropy theory to software reliability, introduces the availability formula, discusses common pitfalls and industry practices, and proposes a three‑stage governance framework—prevention, mitigation, and post‑mortem—to systematically improve operational resilience.

AvailabilityOperationsReliability
0 likes · 13 min read
Stability Engineering Explained: From Entropy Theory to Practical SRE
Bilibili Tech
Bilibili Tech
Sep 8, 2023 · Operations

Design, Implementation, and Governance of an Alert Management Platform

The article details Bilibili’s comprehensive alert‑management platform—its background, cloud‑vs‑self‑built solution comparison, closed‑loop design, distributed architecture, rule configuration, noise‑reduction, automated root‑cause analysis, and governance practices that cut weekly alerts from 1,000 to under 80, while outlining future enhancements.

Alert ManagementDevOpsSRE
0 likes · 19 min read
Design, Implementation, and Governance of an Alert Management Platform
Continuous Delivery 2.0
Continuous Delivery 2.0
Sep 1, 2023 · Operations

Project Health Metrics and Practices in Google’s SRE and Development Process

The article explains how Google measures and improves software quality before release by separating development and operations responsibilities, using monorepo and trunk‑based development, daily release candidates, automated testing, performance benchmarks, and a comprehensive Project Health (pH) metric system that balances speed, reliability, and quality.

GoogleMetricsOperations
0 likes · 11 min read
Project Health Metrics and Practices in Google’s SRE and Development Process
dbaplus Community
dbaplus Community
Aug 28, 2023 · Operations

How to Define SLIs, SLOs, SLAs and Build Reliable, Observable Systems

This guide explains how SRE teams should define service level indicators, objectives, and agreements, design reliable and observable architectures, manage error budgets, assess risks, handle incidents, and integrate development practices to improve system stability and performance.

Error BudgetReliabilitySLI
0 likes · 15 min read
How to Define SLIs, SLOs, SLAs and Build Reliable, Observable Systems
DeWu Technology
DeWu Technology
Aug 14, 2023 · Operations

Capital Loss Prevention Practices and Technical System

Dewu’s capital‑loss prevention framework embeds risk assessment and technical safeguards—such as idempotency, distributed consistency, and active‑active multi‑region design—into architecture, organizes three defensive lines (development, QA, SRE), and employs real‑time, near‑real‑time, and offline verification plus regular drills, while advancing automated analysis and intelligent scaling.

Data ConsistencySREfinancial loss prevention
0 likes · 10 min read
Capital Loss Prevention Practices and Technical System
dbaplus Community
dbaplus Community
Aug 13, 2023 · Operations

Mastering SRE: Key Questions on Monitoring, Capacity, and Change Management

This article provides a comprehensive SRE guide covering senior role definitions, monitoring objectives and implementation, core metric selection, link and event monitoring, capacity planning and mitigation strategies, a real‑world health‑code outage case, and change‑management best practices to improve reliability and efficiency.

SREcapacitychange management
0 likes · 9 min read
Mastering SRE: Key Questions on Monitoring, Capacity, and Change Management
Tech Architecture Stories
Tech Architecture Stories
Aug 8, 2023 · Operations

Mastering Fault Postmortems: Proven Methods to Boost System Reliability

This comprehensive guide explains the origins, methodologies, and practical steps of fault postmortems—including PDCA, GRIA, aviation safety lessons, industrial accident theory, and software reliability metrics—to help teams systematically investigate incidents, derive actionable improvements, and continuously enhance system availability.

GRIAPDCAReliability
0 likes · 22 min read
Mastering Fault Postmortems: Proven Methods to Boost System Reliability
dbaplus Community
dbaplus Community
Jul 27, 2023 · Operations

How to Build Scalable Observability for Cloud‑Native Environments: Lessons from SRE

This article summarizes a technical talk on the challenges of cloud‑native transformation, the design of an application‑centric observability platform using CMDB, Prometheus, Thanos and VictoriaMetrics, practical solutions for high‑cardinality metrics and alerting, and future directions such as eBPF and AI‑driven fault detection.

CMDBMetricsObservability
0 likes · 14 min read
How to Build Scalable Observability for Cloud‑Native Environments: Lessons from SRE
DevOps
DevOps
Jul 27, 2023 · Operations

An Overview of the Google SRE Workbook and Core SRE Foundations

The article introduces the Google SRE Workbook as a practical supplement to the original SRE book, explains the five core SRE foundations—including SLO, SLI, SLA, monitoring, and real‑world case studies from Google and Kingsoft Office—while also promoting an upcoming SRE‑DevOps live session.

GoogleSLISLO
0 likes · 4 min read
An Overview of the Google SRE Workbook and Core SRE Foundations
Tech Architecture Stories
Tech Architecture Stories
Jul 23, 2023 · Operations

Why Every Backend Engineer Should Read Google’s SRE Handbook

The article recommends two essential Google SRE books for backend developers, explains what SRE is, how it differs from traditional operations, and shows how the concepts like SLI/SLO, incident postmortems, and reliability engineering can be applied to improve system availability and stability.

Backend DevelopmentOperationsSRE
0 likes · 4 min read
Why Every Backend Engineer Should Read Google’s SRE Handbook
AntTech
AntTech
Jul 20, 2023 · Operations

AlterShield: An Open‑Source Change Management Platform for Risk Control and Observability

AlterShield is an open‑source, end‑to‑end change‑control platform that systematizes change perception, risk analysis, and defense across distributed cloud‑native environments, enabling SRE teams to mitigate stability risks through standardized protocols, incremental rollout, and automated observability checks.

Cloud NativeSREchange management
0 likes · 24 min read
AlterShield: An Open‑Source Change Management Platform for Risk Control and Observability
Huolala Tech
Huolala Tech
Jul 13, 2023 · Operations

How HuoLaLa Built a 0‑to‑1 Stability Metric System in 2 Years

This article explains how HuoLaLa’s stability team tackled the challenge of proving their work’s value by designing and implementing a comprehensive stability metric system from scratch, detailing the motivations, principles, step‑by‑step construction, data platform, cultural adoption, measurable results, and future plans.

Data-drivenMetricsOperations
0 likes · 18 min read
How HuoLaLa Built a 0‑to‑1 Stability Metric System in 2 Years
FunTester
FunTester
Jul 12, 2023 · Operations

Mastering Load Testing: Practical Wrk, GoReplay, and SRE Strategies

This article explains why automation testing often lags behind product changes, outlines essential load‑testing concepts such as bottleneck analysis and capacity planning, and provides hands‑on guidance for using Wrk and GoReplay tools within an SRE‑driven operations workflow.

GoReplayLoad TestingPerformance Testing
0 likes · 8 min read
Mastering Load Testing: Practical Wrk, GoReplay, and SRE Strategies
dbaplus Community
dbaplus Community
Jul 1, 2023 · Operations

How We Rebuilt a Private Cloud Platform to Supercharge Developer Efficiency

This article recounts a year‑long effort by a senior SRE engineer to redesign a private cloud platform, detailing the motivations, architectural choices, SSO and RBAC implementations, workflow automation, GitOps deployment, release engineering improvements, and the cultural shift toward metrics‑driven development.

DevOpsGitOpsKubernetes
0 likes · 22 min read
How We Rebuilt a Private Cloud Platform to Supercharge Developer Efficiency
Efficient Ops
Efficient Ops
Jun 25, 2023 · Operations

How to Build a Next‑Gen “Big Operations” System for Reliability and Observability

This article outlines the evolution from manual operations to DevOps and SRE‑driven “big operations,” detailing system reliability and continuity practices, observability concepts, and the development of AIOps maturity standards, offering a comprehensive guide for building stable, efficient, and secure operational frameworks.

DevOpsObservabilityOperations
0 likes · 14 min read
How to Build a Next‑Gen “Big Operations” System for Reliability and Observability
dbaplus Community
dbaplus Community
Jun 24, 2023 · Operations

How Bilibili Scales Capacity: VPA, HPA, and Cost‑Saving Strategies

This article summarizes Zhang He’s Bilibili SRE talk on building a capacity‑management system that visualizes resource usage, reduces costs, improves stability, and leverages Kubernetes VPA, HPA, pooling, and quota management to support massive live‑stream events and rapid feature releases.

Cost OptimizationHPAKubernetes
0 likes · 21 min read
How Bilibili Scales Capacity: VPA, HPA, and Cost‑Saving Strategies
Efficient Ops
Efficient Ops
Jun 20, 2023 · Operations

Mastering SRE: How Error Budgets and SLOs Drive System Reliability

This article explains the fundamentals of Site Reliability Engineering, detailing how SRE combines development and operations to improve stability through metrics like MTBF and MTTR, the roles of SLI/SLO, the VALET selection method, and the practical use of error budgets for quantifying work and guiding alerts.

Error BudgetMTBFOperations
0 likes · 14 min read
Mastering SRE: How Error Budgets and SLOs Drive System Reliability
Efficient Ops
Efficient Ops
May 31, 2023 · Operations

How Tencent Scales SRE: Building a SLO‑Based Quality Operations System

This article examines Tencent's end‑to‑end SRE quality‑operation framework built on Service Level Objectives (SLO) and On‑Call, detailing industry background, problem statements, SLO management, On‑Call benefits, product architecture, large‑scale deployment, and future plans for reliability engineering.

On-CallQuality OperationsSLO
0 likes · 11 min read
How Tencent Scales SRE: Building a SLO‑Based Quality Operations System
dbaplus Community
dbaplus Community
May 29, 2023 · Operations

How Bilibili Built a High‑Availability Multi‑Active Architecture for SRE

This article details Bilibili's SRE team's design and implementation of a high‑availability multi‑active architecture, covering zone types, same‑city and cross‑region deployments, traffic routing, cache consistency, message handling, governance, and practical lessons learned from real‑world incidents.

BilibiliOperationsSRE
0 likes · 20 min read
How Bilibili Built a High‑Availability Multi‑Active Architecture for SRE
Efficient Ops
Efficient Ops
May 21, 2023 · Operations

From Apollo to Google: How Margaret Hamilton Shaped Modern SRE

This article traces the origins of Site Reliability Engineering from Margaret Hamilton’s pioneering work on the Apollo program, through Google’s formal SRE team creation, and highlights the key differences between SRE and traditional operations practices.

GoogleMargaret HamiltonOperations
0 likes · 7 min read
From Apollo to Google: How Margaret Hamilton Shaped Modern SRE
DeWu Technology
DeWu Technology
May 19, 2023 · Operations

Investigation and Resolution of In‑flight Wi‑Fi Connectivity Issues for a Mobile E‑Commerce App

The SRE team diagnosed an in‑flight Wi‑Fi outage for the DeWu e‑commerce app by reproducing the problem, capturing packets with ping, traceroute and tcpdump, discovered a firewall rule misclassifying the domain as a download site, and resolved it through a vendor‑issued policy update, restoring connectivity on both ATG and SATCOM links.

SRETCPWiFi
0 likes · 18 min read
Investigation and Resolution of In‑flight Wi‑Fi Connectivity Issues for a Mobile E‑Commerce App
iQIYI Technical Product Team
iQIYI Technical Product Team
May 12, 2023 · Operations

Performance Troubleshooting and Optimization of Prometheus Monitoring Queries

The article explains that high metric cardinality in Prometheus causes long query times and timeouts, and demonstrates how using recording rules to pre‑compute aggregates dramatically reduces cardinality and latency, while recommending scrape interval tuning and metric design best practices to keep charts responsive.

PrometheusRecording RulesSRE
0 likes · 10 min read
Performance Troubleshooting and Optimization of Prometheus Monitoring Queries
Efficient Ops
Efficient Ops
May 10, 2023 · Operations

Mastering XOps: From DevOps to FinOps – A Comprehensive Guide

This article presents a systematic overview of the emerging XOps ecosystem—including DevOps, BizDevOps, AIOps, FinOps, and SRE—detailing their relationships, maturity models, standards, and practical guidance for enterprises seeking to achieve efficient, secure, and data‑driven digital transformation.

BizDevOpsDevOpsFinOps
0 likes · 13 min read
Mastering XOps: From DevOps to FinOps – A Comprehensive Guide
DevOps Cloud Academy
DevOps Cloud Academy
May 10, 2023 · Operations

Understanding the Role of Site Reliability Engineering (SRE) in DevOps

This article explains why Site Reliability Engineering (SRE) and DevOps are both essential for modern software development, compares their objectives, outlines their complementary roles, and highlights the fundamental differences that help organizations achieve faster releases with higher reliability.

DevOpsSRESite Reliability Engineering
0 likes · 8 min read
Understanding the Role of Site Reliability Engineering (SRE) in DevOps
MaGe Linux Operations
MaGe Linux Operations
May 7, 2023 · Operations

How Meta’s SLICK Transforms SLO Management for Reliable Services

This article explains how Meta built SLICK, a centralized SLO/SLI platform that improves service reliability through discoverability, long‑term insights, integrated workflows, and scalable architecture, and shares real‑world examples and lessons learned from its deployment across thousands of services.

MetaObservabilityReliability
0 likes · 13 min read
How Meta’s SLICK Transforms SLO Management for Reliable Services
DataFunSummit
DataFunSummit
Apr 29, 2023 · Operations

Application Monitoring Principles and Non‑Intrusive Data Collection at Huya

This article explains the fundamentals of distributed application monitoring, describes Huya's non‑intrusive data‑collection techniques using SDKs and plugins, outlines the design and correlation of observable metrics, and demonstrates practical results and troubleshooting scenarios for backend services.

Distributed TracingMetrics DesignObservability
0 likes · 16 min read
Application Monitoring Principles and Non‑Intrusive Data Collection at Huya
Efficient Ops
Efficient Ops
Apr 26, 2023 · Operations

Building a Chaos Engineering Platform for Financial Services: Key Lessons

This talk outlines the challenges of maintaining system stability in fast‑moving, cloud‑native financial services, describes a risk‑identification model, high‑fidelity fault simulation, and a comprehensive stability engineering platform, and shares future plans for automated, data‑driven risk mitigation.

Financial ServicesOperationsSRE
0 likes · 15 min read
Building a Chaos Engineering Platform for Financial Services: Key Lessons
Efficient Ops
Efficient Ops
Apr 16, 2023 · Operations

How Capability Platforms Empower Intelligent Container Cloud Operations

At the 20th GOPS Global Operations Conference, China Mobile Jiangsu showcased how its capability platform leverages AI, big data, and blockchain to automate health scoring and intelligent inspection, dramatically improving container‑cloud operational efficiency and paving the way for smarter, SRE‑driven DevOps practices.

Big DataCapability PlatformIntelligent Operations
0 likes · 5 min read
How Capability Platforms Empower Intelligent Container Cloud Operations
DeWu Technology
DeWu Technology
Mar 30, 2023 · Cloud Computing

得物 FinOps Practice: Cloud Cost Management and Optimization

Since 2021, 得物 has built a FinOps practice that combines finance and DevOps to allocate, forecast, and optimize multi‑cloud spend, delivering over ¥100 million in annual savings through cost‑center platforms, container‑utilization improvements, custom PaaS services, automated analysis, and a cross‑functional virtual team governed by visibility, optimization, and governance KPIs.

Cloud Cost ManagementCost OptimizationFinOps
0 likes · 27 min read
得物 FinOps Practice: Cloud Cost Management and Optimization
Efficient Ops
Efficient Ops
Mar 28, 2023 · Operations

Why SRE Matters: Bridging Product Development and Reliability Engineering

This article explains the role of Site Reliability Engineering (SRE), its responsibilities, how it complements product development, the software lifecycle perspective, and practical approaches to ensure system stability through controllability, observability, and best‑practice implementation.

ObservabilityOperationsSRE
0 likes · 14 min read
Why SRE Matters: Bridging Product Development and Reliability Engineering
Bilibili Tech
Bilibili Tech
Mar 28, 2023 · Operations

Bilibili's Capacity Management Platform: Design, Implementation, and S12 Event Support

Bilibili's capacity management platform integrates foundational data, VPA/HPA scaling, quota control, and visual dashboards to streamline resource usage, cut costs, and boost stability, delivering event‑specific support such as for S12 that slashes release issues by 80% and online failures by 90%, while planning predictive scaling and risk control.

BilibiliResource OptimizationSRE
0 likes · 13 min read
Bilibili's Capacity Management Platform: Design, Implementation, and S12 Event Support
DevOps
DevOps
Mar 14, 2023 · Operations

15 Essential DevOps and SRE Tools to Watch in 2023

This guide outlines fifteen key DevOps and SRE tools for 2023—including monitoring, application platforms, chat‑ops, incident management, diagramming, and CI/CD solutions—explaining their core features, benefits, and how they help teams maintain reliable, observable, and automated software delivery pipelines.

SRETooling
0 likes · 11 min read
15 Essential DevOps and SRE Tools to Watch in 2023
NetEase Smart Enterprise Tech+
NetEase Smart Enterprise Tech+
Mar 1, 2023 · Operations

Stability Quality Assurance: Definitions, Metrics, and Implementation Guide

This article explains the origins and meaning of software stability and stability testing, outlines key standards such as GB/T 16260 and industry definitions, and presents a comprehensive framework for stability quality assurance covering system elements, external disturbances, baseline setting, robust design, monitoring, and rapid incident response.

OperationsSREquality assurance
0 likes · 17 min read
Stability Quality Assurance: Definitions, Metrics, and Implementation Guide
dbaplus Community
dbaplus Community
Feb 28, 2023 · Operations

How Container SRE at DeWu Boosts Reliability: Practices, Metrics, and Incident Playbooks

This article details DeWu's container SRE approach, covering SRE fundamentals, on‑call response, SLO/SLA design, change management, capacity planning, kernel‑parameter monitoring, security safeguards, and a real‑world incident analysis, providing actionable insights for building resilient cloud‑native services.

CapacityPlanningIncidentResponseKubernetes
0 likes · 24 min read
How Container SRE at DeWu Boosts Reliability: Practices, Metrics, and Incident Playbooks
Bilibili Tech
Bilibili Tech
Feb 17, 2023 · Backend Development

Design and Implementation of the Comet Workflow Engine at Bilibili

The article details Bilibili’s Comet workflow engine—a low‑code, plugin‑extensible platform built since 2019 that uses visual DAG templates, graph‑based legality checks, and asynchronous execution to automate diverse business processes such as SRE automation, permission requests, and push‑task approvals, improving operational efficiency across mobile and web services.

AutomationDAGGo
0 likes · 18 min read
Design and Implementation of the Comet Workflow Engine at Bilibili
Bilibili Tech
Bilibili Tech
Feb 14, 2023 · Cloud Native

Bilibili's Vertical Pod Autoscaler (VPA) Practice and Cluster Resource Governance

Bilibili extended Kubernetes with a custom in‑place Vertical Pod Autoscaler framework—including generator, recommender, updater, and webhook controllers plus a management platform for strategy tuning, avoidance, analysis, and anomaly detection—reducing over‑provisioned resources across its ten‑thousand‑node private cloud and achieving up to 60 % CPU and 30 % memory savings.

KubernetesSREvertical pod autoscaler
0 likes · 19 min read
Bilibili's Vertical Pod Autoscaler (VPA) Practice and Cluster Resource Governance
DeWu Technology
DeWu Technology
Feb 8, 2023 · Operations

Container SRE Practices and Incident Management at DeWu

DeWu’s container SRE team combines software‑engineered reliability with routine operations, using defined on‑call roles, SLO/SLA targets, progressive change management, capacity forecasting, four‑metric monitoring, MTTR/MTTF tracking, kernel‑parameter tuning, and namespace‑protected security policies to swiftly resolve incidents such as Redis latency spikes.

ContainerPerformance OptimizationSRE
0 likes · 23 min read
Container SRE Practices and Incident Management at DeWu
dbaplus Community
dbaplus Community
Jan 16, 2023 · Operations

Beyond Success‑Ratio: How User‑Uptime Reveals Real Product Availability

The article reviews traditional availability metrics such as Success‑Ratio, Error‑Budget, MTTR/MTTF, SLA/SLO, and highlights their limitations, then introduces Google’s User‑Uptime and Windowed User‑Uptime metrics, explains their definitions, challenges, experimental results, and why they provide a more user‑centric view of service reliability.

AvailabilityMetricsSRE
0 likes · 27 min read
Beyond Success‑Ratio: How User‑Uptime Reveals Real Product Availability