Tag

Reliability Engineering

0 views collected around this technical thread.

JD Tech Talk
JD Tech Talk
Feb 6, 2025 · Operations

Stability Assurance Mechanisms and Practices for Site Reliability Engineering (SRE)

This article outlines comprehensive stability assurance mechanisms—including standards, process workflows, the distinction between developers and SREs, personal responsibilities, and practical construction directions—to guide teams in building resilient, high‑availability systems through proactive, daily, and incident‑response practices.

OperationsProcessReliability Engineering
0 likes · 10 min read
Stability Assurance Mechanisms and Practices for Site Reliability Engineering (SRE)
Efficient Ops
Efficient Ops
Jan 1, 2025 · Operations

What 2024’s Biggest Outages Teach Us About Building Resilient Systems

Reviewing the major service disruptions—from Alibaba Cloud to OpenAI—this article extracts key SRE lessons such as early disaster‑recovery planning, regular backups, load balancing, real‑time monitoring, performance tuning, and capacity planning, urging enterprises to adopt resilient practices for a more stable future.

OperationsOutage ManagementReliability Engineering
0 likes · 6 min read
What 2024’s Biggest Outages Teach Us About Building Resilient Systems
Cognitive Technology Team
Cognitive Technology Team
May 16, 2024 · Operations

Guide to Building Stability in Distributed Systems

This guide presents comprehensive principles, best practices, and techniques for designing, deploying, and maintaining stable distributed systems, covering fault tolerance, monitoring, capacity planning, incident response, and operational reliability to help engineers achieve high availability.

MonitoringOperationsReliability Engineering
0 likes · 1 min read
Guide to Building Stability in Distributed Systems
Efficient Ops
Efficient Ops
Dec 20, 2023 · Operations

How Bilibili Implements SLO Engineering to Boost Service Reliability

This article details Bilibili's practical SLO engineering approach, covering foundational components, SLI selection, application and business level SLIs, alerting strategies, SLO‑driven quality operations, and the GOC framework for rapid fault discovery, localization, and recovery, illustrating how reliability is systematically improved.

AlertingMonitoringOperations
0 likes · 16 min read
How Bilibili Implements SLO Engineering to Boost Service Reliability
Efficient Ops
Efficient Ops
Nov 26, 2023 · Operations

Beijing Mobile’s SRE Success: Automation, Cloud‑Native Ops & Reliability

The article details how Beijing Mobile’s SRE Smart Operations team applied SRE principles, automation, and cloud‑native tools to transform traditional DevOps into a reliable, scalable operation, highlighting their fault‑prevention, monitoring, incident response, and continuous improvement practices that earned them the 2023 IT Technology Leadership award.

AutomationOperationsReliability Engineering
0 likes · 7 min read
Beijing Mobile’s SRE Success: Automation, Cloud‑Native Ops & Reliability
Bilibili Tech
Bilibili Tech
Nov 17, 2023 · Operations

Bilibili CDN/SLB Outage Analysis and Cloud‑Edge Coordination Strategies

The August 4 2023 Bilibili outage, triggered by automatic back‑origin and domain‑disable policies that flooded the BFS load balancer with traffic, caused widespread white‑screens, but was mitigated within the 1‑5‑10 framework through rapid CDN switching, rate‑limit enhancements, storage backup, and client‑side fallback, illustrating the need for tighter cloud‑edge coordination.

CDNEdge ComputingReliability Engineering
0 likes · 19 min read
Bilibili CDN/SLB Outage Analysis and Cloud‑Edge Coordination Strategies
Efficient Ops
Efficient Ops
Jun 20, 2023 · Operations

Mastering SRE: How Error Budgets and SLOs Drive System Reliability

This article explains the fundamentals of Site Reliability Engineering, detailing how SRE combines development and operations to improve stability through metrics like MTBF and MTTR, the roles of SLI/SLO, the VALET selection method, and the practical use of error budgets for quantifying work and guiding alerts.

Error BudgetMTBFOperations
0 likes · 14 min read
Mastering SRE: How Error Budgets and SLOs Drive System Reliability
Efficient Ops
Efficient Ops
May 31, 2023 · Operations

How Tencent Scales SRE: Building a SLO‑Based Quality Operations System

This article examines Tencent's end‑to‑end SRE quality‑operation framework built on Service Level Objectives (SLO) and On‑Call, detailing industry background, problem statements, SLO management, On‑Call benefits, product architecture, large‑scale deployment, and future plans for reliability engineering.

Quality OperationsReliability EngineeringSLO
0 likes · 11 min read
How Tencent Scales SRE: Building a SLO‑Based Quality Operations System
Efficient Ops
Efficient Ops
Mar 28, 2023 · Operations

Why SRE Matters: Bridging Product Development and Reliability Engineering

This article explains the role of Site Reliability Engineering (SRE), its responsibilities, how it complements product development, the software lifecycle perspective, and practical approaches to ensure system stability through controllability, observability, and best‑practice implementation.

ObservabilityOperationsReliability Engineering
0 likes · 14 min read
Why SRE Matters: Bridging Product Development and Reliability Engineering
Bilibili Tech
Bilibili Tech
Oct 29, 2022 · Operations

Stability Building and SLO Operations After the “713 Incident”

The deck outlines post‑incident stability enhancements and the adoption of Service Level Objectives after the “713” fault, detailing failure analysis, reliability upgrades, monitoring practices, and the definition and operation of SLOs to sustain system quality, illustrated through architecture diagrams and reliability metrics.

OperationsReliability EngineeringSLO
0 likes · 1 min read
Stability Building and SLO Operations After the “713 Incident”
Bilibili Tech
Bilibili Tech
Aug 2, 2022 · Operations

Lessons Learned from Implementing SLOs at Bilibili: Practices, Pitfalls, and Reflections

Bilibili adopted Google‑SRE SLO practices—selecting SLIs, defining availability and latency targets, grading services, and tracking error budgets—but encountered costly grading inconsistencies, hidden error detection, and inaccurate business‑level metrics, leading them to realize SLOs are chiefly valuable for early alerting rather than exhaustive reporting.

Error BudgetOperationsReliability Engineering
0 likes · 21 min read
Lessons Learned from Implementing SLOs at Bilibili: Practices, Pitfalls, and Reflections
Bilibili Tech
Bilibili Tech
Jun 21, 2022 · Cloud Native

Evolution of SRE in the Cloud‑Native Era – Insights from Industry Experts

Industry experts from Zhejiang Mobile, Bilibili, and Xiaomi discuss how SRE has evolved in the cloud‑native era, sharing concrete frameworks, observability practices, and cost‑focused platforms while emphasizing stability, metrics, on‑call processes, and the need to adapt Google’s model to real‑world product and operational contexts.

DevOpsObservabilityReliability Engineering
0 likes · 31 min read
Evolution of SRE in the Cloud‑Native Era – Insights from Industry Experts
DataFunTalk
DataFunTalk
Jun 19, 2022 · Artificial Intelligence

FMEA Knowledge Graph: Integrating Failure Analysis with AI for Intelligent Manufacturing

This article explains how integrating FMEA with knowledge graph and AI technologies can enhance product quality and reliability across high‑end manufacturing sectors such as semiconductors, automotive, and medical devices, presenting case studies, standards, and a platform built by Daguan Data.

Artificial IntelligenceFMEAKnowledge Graph
0 likes · 19 min read
FMEA Knowledge Graph: Integrating Failure Analysis with AI for Intelligent Manufacturing
Bilibili Tech
Bilibili Tech
May 20, 2022 · Operations

Bilibili SRE Practices: Stability Operations, Incident Management, and Platform Enablement

Bilibili’s SRE team, confronting rapid growth and complex systems, built a systematic stability operation that includes emergency response, incident handling, on‑call scheduling, and an Event Operations Center platform, using metrics like MTTR, MTTI and AI‑assisted automation to reduce downtime and improve reliability.

BilibiliMetricsOncall
0 likes · 27 min read
Bilibili SRE Practices: Stability Operations, Incident Management, and Platform Enablement
HelloTech
HelloTech
Jul 12, 2021 · Operations

Introduction to System Stability: Concepts, Metrics, and Practices

The article explains Haro’s approach to system stability—defining high‑availability, key metrics such as SLA, RPO/RTO, MTTR/MTBF, and the 5‑5‑10 rule—while outlining cultural and technical safeguards, full‑team participation, process integration, and incremental tooling to prevent faults and ensure rapid recovery.

High AvailabilityMTTROperations
0 likes · 11 min read
Introduction to System Stability: Concepts, Metrics, and Practices
Efficient Ops
Efficient Ops
Jun 7, 2021 · Operations

How Alibaba’s ECS Team Built a Scalable SRE System: Lessons for Large R&D Teams

This article summarizes Alibaba Cloud Elastic Compute Service's four‑year SRE journey, covering why ECS created its own SRE organization, the five‑layer SRE framework, standards, automation platforms, empowerment practices, and team‑building insights that can guide large development teams toward reliable, high‑availability operations.

AutomationMonitoringOperations
0 likes · 24 min read
How Alibaba’s ECS Team Built a Scalable SRE System: Lessons for Large R&D Teams
Efficient Ops
Efficient Ops
Jan 19, 2021 · Operations

How SRE Bridges Development and Operations to Boost System Reliability

This article explores the role of Site Reliability Engineering (SRE) as a bridge between product development and operations, detailing its responsibilities, core principles, lifecycle perspective, stability value, and practical frameworks for controllability, observability, and best‑practice implementation to enhance system reliability.

ObservabilityOperationsReliability Engineering
0 likes · 13 min read
How SRE Bridges Development and Operations to Boost System Reliability
Efficient Ops
Efficient Ops
Mar 26, 2020 · Operations

Why SRE Exists and How It Solves Reliability Challenges

This article explains why Site Reliability Engineering (SRE) emerged, outlines its core responsibilities, required skill set, and how it addresses reliability challenges through decoupling, SLO‑driven monitoring, and scenario‑based drills, while highlighting key observations and focus areas for modern operations teams.

MonitoringOperationsReliability Engineering
0 likes · 13 min read
Why SRE Exists and How It Solves Reliability Challenges
Efficient Ops
Efficient Ops
May 6, 2019 · Operations

How Live Streaming Ops Ensure Real-Time Reliability at Scale

Zhang Guanshi, the operations director at Huya Live, shares how his team designs a hybrid‑cloud architecture, implements a six‑pillar reliability framework, and leverages real‑time monitoring, AIOps, and rapid‑recovery tools to maintain stable, low‑latency live video streams for millions of viewers.

Cloud ArchitectureLive StreamingMonitoring
0 likes · 22 min read
How Live Streaming Ops Ensure Real-Time Reliability at Scale
Efficient Ops
Efficient Ops
Dec 29, 2017 · Operations

How Alibaba’s Global Operations Center Guarantees Seamless Service During Double‑11

This article outlines Alibaba's Global Operations Center (GOC) practices for ensuring stable, high‑performance online services during massive traffic spikes like Double‑11, covering current challenges, the operational assurance framework, best‑practice implementations, and future directions such as automation and AI‑driven monitoring.

AutomationMonitoringOperations
0 likes · 28 min read
How Alibaba’s Global Operations Center Guarantees Seamless Service During Double‑11