Tag

site reliability

0 views collected around this technical thread.

Bilibili Tech
Bilibili Tech
Nov 19, 2024 · Operations

Building a Lightweight Disaster‑Recovery Drill System at Bilibili: Architecture, Practices, and Lessons

Bilibili’s infrastructure team created a lightweight, multi‑layered disaster‑recovery drill platform—combining an atomic fault library, scenario catalogs, chaos‑experiment orchestration, real‑time observation, and a product‑level interface—backed by standardized governance and CI‑integrated automation, cutting drill preparation from weeks to days and boosting weekly resilience testing across the organization.

AutomationChaos EngineeringHigh Availability
0 likes · 39 min read
Building a Lightweight Disaster‑Recovery Drill System at Bilibili: Architecture, Practices, and Lessons
Efficient Ops
Efficient Ops
May 7, 2024 · Operations

11 Hard‑Earned Lessons from Two Decades of Google Site Reliability

Drawing on twenty years of Google’s SRE experience, this article shares eleven practical lessons—from proportional incident mitigation and pre‑tested recovery mechanisms to canary releases, disaster‑resilience testing, and frequent deployments—aimed at improving reliability and operational efficiency.

GoogleSREincident management
0 likes · 12 min read
11 Hard‑Earned Lessons from Two Decades of Google Site Reliability
Bilibili Tech
Bilibili Tech
Oct 29, 2022 · Operations

Stability Building and SLO Operations After the “713 Incident”

The deck outlines post‑incident stability enhancements and the adoption of Service Level Objectives after the “713” fault, detailing failure analysis, reliability upgrades, monitoring practices, and the definition and operation of SLOs to sustain system quality, illustrated through architecture diagrams and reliability metrics.

Reliability EngineeringSLOincident management
0 likes · 1 min read
Stability Building and SLO Operations After the “713 Incident”
Architects Research Society
Architects Research Society
Sep 7, 2022 · Operations

An Introduction to Chaos Engineering: Principles, Practices, and Tools

Chaos engineering deliberately injects failures into distributed systems to measure resilience, using scientific experimentation to uncover hidden weaknesses, guide robust design, and improve reliability across development, testing, and production environments.

Chaos Engineeringdistributed systemsfault injection
0 likes · 18 min read
An Introduction to Chaos Engineering: Principles, Practices, and Tools
DaTaobao Tech
DaTaobao Tech
Apr 20, 2022 · Operations

Understanding Wireless Operations and Maintenance: Origins, Challenges, and Future Directions

Wireless operations and maintenance (O&M) evolved from backend‑focused practices to address stability and performance of mobile‑device services, tackling low issue detection rates and delayed responses through improved monitoring, gray‑release tagging, phased rollouts, AI‑driven diagnostics, and automated release gates, while inviting collaborative development.

Incident ResponseMonitoringgray release
0 likes · 13 min read
Understanding Wireless Operations and Maintenance: Origins, Challenges, and Future Directions
Efficient Ops
Efficient Ops
Jan 24, 2022 · Operations

How Qunar Turned Chaos Engineering into Reliable Operations: A Deep Dive

This article explores Qunar's practical implementation of chaos engineering, detailing its value, the four strategic directions, shutdown and application drills, strong‑weak dependency handling, container support, and automated closed‑loop testing that together boost system resilience, process robustness, and user experience.

AutomationChaos Engineeringcloud native
0 likes · 20 min read
How Qunar Turned Chaos Engineering into Reliable Operations: A Deep Dive
Efficient Ops
Efficient Ops
Sep 8, 2020 · Operations

From Firefighting to Arson: Mastering Ops Availability in Three Stages

The article outlines a three‑stage ops maturity model—firefighting, fire prevention, and arson—explains how proactive fault‑injection drills, continuous availability improvements, and aligning technical metrics with business value can transform operations from reactive responders into strategic value creators.

DevOpsavailabilityfault injection
0 likes · 8 min read
From Firefighting to Arson: Mastering Ops Availability in Three Stages
Ctrip Technology
Ctrip Technology
Jun 4, 2020 · Operations

Applying Chaos Engineering at Ctrip: Practices, Experiments, and Platform Evolution

This article describes Ctrip's SRE team's journey in adopting chaos engineering, outlining the motivations, roadmap, concrete experiments, platform maturity, and future automation goals to improve system resilience and operational reliability in a large‑scale microservice environment.

Chaos EngineeringCtripMicroservices
0 likes · 12 min read
Applying Chaos Engineering at Ctrip: Practices, Experiments, and Platform Evolution
Efficient Ops
Efficient Ops
Mar 10, 2019 · Operations

Why Operations Won’t Die: A Veteran’s Perspective

A seasoned operations professional argues that despite sensational claims, the ops function remains essential—driven by its core responsibilities of quality, cost, efficiency, and security, evolving with cloud computing, DevOps, and emerging IoT demands.

Cloud ComputingDevOpsIT infrastructure
0 likes · 11 min read
Why Operations Won’t Die: A Veteran’s Perspective
Efficient Ops
Efficient Ops
Dec 29, 2016 · Operations

Meet the Top Operations Experts of 2016: Profiles and Must‑Read Articles

This article introduces the standout operations professionals featured by the High‑Efficiency Operations community in 2016, summarizing each expert’s background, key achievements, and a curated list of their most influential technical articles for readers seeking deep insights into modern ops practices.

AutomationCloud ComputingDevOps
0 likes · 12 min read
Meet the Top Operations Experts of 2016: Profiles and Must‑Read Articles
Efficient Ops
Efficient Ops
Oct 6, 2016 · Operations

How Ctrip Scales Application Operations: Practices, Automation, and Reliability

This talk details Ctrip's application operations framework, covering data‑center scale, multi‑application deployment on Windows, high availability goals, capacity‑prediction models, disaster‑recovery design, incident response, and the evolution from manual tooling to automated, intelligent operations.

AutomationCloud Computingcapacity planning
0 likes · 21 min read
How Ctrip Scales Application Operations: Practices, Automation, and Reliability
Efficient Ops
Efficient Ops
Dec 30, 2015 · Operations

E‑Commerce vs. General Internet Ops: Veteran Insights on Key Differences

A seasoned operations leader discusses how e‑commerce operational support differs from general internet applications, covering longer support chains, consistency models, seasonal traffic spikes, team role separation, mobile‑internet challenges, future planning, and the rise of enterprise‑level ops services.

e-commercemobile operationsoperations
0 likes · 14 min read
E‑Commerce vs. General Internet Ops: Veteran Insights on Key Differences