Tagged articles
18 articles
Page 1 of 1
Bilibili Tech
Bilibili Tech
Nov 19, 2024 · Operations

Building a Lightweight Disaster‑Recovery Drill System at Bilibili: Architecture, Practices, and Lessons

Bilibili’s infrastructure team created a lightweight, multi‑layered disaster‑recovery drill platform—combining an atomic fault library, scenario catalogs, chaos‑experiment orchestration, real‑time observation, and a product‑level interface—backed by standardized governance and CI‑integrated automation, cutting drill preparation from weeks to days and boosting weekly resilience testing across the organization.

disaster recoveryhigh availabilitysite reliability
0 likes · 39 min read
Building a Lightweight Disaster‑Recovery Drill System at Bilibili: Architecture, Practices, and Lessons
Volcano Engine Developer Services
Volcano Engine Developer Services
Sep 2, 2024 · Operations

How ByteDance Scales Disaster Recovery: From Single Data Center to Multi‑Region Active‑Active

This article details ByteDance’s disaster‑recovery evolution—from a single‑room deployment to same‑city multi‑data‑center setups and finally to active‑active multi‑region architectures—explaining the challenges, specific failure scenarios, and the strategic practices used to ensure continuous service during outages.

InfrastructureOperationsdisaster recovery
0 likes · 15 min read
How ByteDance Scales Disaster Recovery: From Single Data Center to Multi‑Region Active‑Active
Efficient Ops
Efficient Ops
May 7, 2024 · Operations

11 Hard‑Earned Lessons from Two Decades of Google Site Reliability

Drawing on twenty years of Google’s SRE experience, this article shares eleven practical lessons—from proportional incident mitigation and pre‑tested recovery mechanisms to canary releases, disaster‑resilience testing, and frequent deployments—aimed at improving reliability and operational efficiency.

GoogleSREincident management
0 likes · 12 min read
11 Hard‑Earned Lessons from Two Decades of Google Site Reliability
Bilibili Tech
Bilibili Tech
Oct 29, 2022 · Operations

Stability Building and SLO Operations After the “713 Incident”

The deck outlines post‑incident stability enhancements and the adoption of Service Level Objectives after the “713” fault, detailing failure analysis, reliability upgrades, monitoring practices, and the definition and operation of SLOs to sustain system quality, illustrated through architecture diagrams and reliability metrics.

SLOreliability engineeringsite reliability
0 likes · 1 min read
Stability Building and SLO Operations After the “713 Incident”
FunTester
FunTester
Jul 24, 2022 · Operations

Boost Service Reliability with Chaos Engineering: Practical Steps & Evaluation

Chaos engineering, a discipline for experimenting on distributed systems, helps teams identify hidden weaknesses, improve high‑availability, and build confidence in production by defining stable states, injecting realistic failures, and measuring impact through observability metrics, with practical steps, tool choices, maturity stages, and evaluation methods.

Distributed SystemsFault InjectionObservability
0 likes · 11 min read
Boost Service Reliability with Chaos Engineering: Practical Steps & Evaluation
DaTaobao Tech
DaTaobao Tech
Apr 20, 2022 · Operations

Understanding Wireless Operations and Maintenance: Origins, Challenges, and Future Directions

Wireless operations and maintenance (O&M) evolved from backend‑focused practices to address stability and performance of mobile‑device services, tackling low issue detection rates and delayed responses through improved monitoring, gray‑release tagging, phased rollouts, AI‑driven diagnostics, and automated release gates, while inviting collaborative development.

gray releaseincident responsemobile maintenance
0 likes · 13 min read
Understanding Wireless Operations and Maintenance: Origins, Challenges, and Future Directions
Efficient Ops
Efficient Ops
Jan 24, 2022 · Operations

How Qunar Turned Chaos Engineering into Reliable Operations: A Deep Dive

This article explores Qunar's practical implementation of chaos engineering, detailing its value, the four strategic directions, shutdown and application drills, strong‑weak dependency handling, container support, and automated closed‑loop testing that together boost system resilience, process robustness, and user experience.

Automationchaos engineeringsite reliability
0 likes · 20 min read
How Qunar Turned Chaos Engineering into Reliable Operations: A Deep Dive
Efficient Ops
Efficient Ops
Sep 8, 2020 · Operations

From Firefighting to Arson: Mastering Ops Availability in Three Stages

The article outlines a three‑stage ops maturity model—firefighting, fire prevention, and arson—explains how proactive fault‑injection drills, continuous availability improvements, and aligning technical metrics with business value can transform operations from reactive responders into strategic value creators.

AvailabilityFault InjectionOperations
0 likes · 8 min read
From Firefighting to Arson: Mastering Ops Availability in Three Stages
Efficient Ops
Efficient Ops
Mar 10, 2019 · Operations

Why Operations Won’t Die: A Veteran’s Perspective

A seasoned operations professional argues that despite sensational claims, the ops function remains essential—driven by its core responsibilities of quality, cost, efficiency, and security, evolving with cloud computing, DevOps, and emerging IoT demands.

DevOpsIT infrastructureOperations
0 likes · 11 min read
Why Operations Won’t Die: A Veteran’s Perspective
MaGe Linux Operations
MaGe Linux Operations
Apr 18, 2018 · Operations

Essential Skills and Challenges for Large‑Scale Website Operations Engineers

This article outlines what large‑scale website operations entail, describes the full product lifecycle involvement of ops engineers, lists the technical skills and personal qualities required, examines current industry issues, and highlights key technologies such as cluster management, monitoring, fault handling, and automation.

large-scale systemssite reliability
0 likes · 19 min read
Essential Skills and Challenges for Large‑Scale Website Operations Engineers
MaGe Linux Operations
MaGe Linux Operations
Sep 2, 2017 · Operations

From Traditional Ops to DevOps: The One Step You’re Missing

This talk walks through the transition from classic application operations to a DevOps culture, highlighting common pain points, the need for standardization and automation, and practical steps for engineers to evolve their skills and boost organizational efficiency.

AutomationDevOpsIT Culture
0 likes · 14 min read
From Traditional Ops to DevOps: The One Step You’re Missing
Efficient Ops
Efficient Ops
Dec 29, 2016 · Operations

Meet the Top Operations Experts of 2016: Profiles and Must‑Read Articles

This article introduces the standout operations professionals featured by the High‑Efficiency Operations community in 2016, summarizing each expert’s background, key achievements, and a curated list of their most influential technical articles for readers seeking deep insights into modern ops practices.

AutomationOperationscloud computing
0 likes · 12 min read
Meet the Top Operations Experts of 2016: Profiles and Must‑Read Articles
Efficient Ops
Efficient Ops
Oct 6, 2016 · Operations

How Ctrip Scales Application Operations: Practices, Automation, and Reliability

This talk details Ctrip's application operations framework, covering data‑center scale, multi‑application deployment on Windows, high availability goals, capacity‑prediction models, disaster‑recovery design, incident response, and the evolution from manual tooling to automated, intelligent operations.

AutomationOperationscapacity planning
0 likes · 21 min read
How Ctrip Scales Application Operations: Practices, Automation, and Reliability
Efficient Ops
Efficient Ops
Dec 30, 2015 · Operations

E‑Commerce vs. General Internet Ops: Veteran Insights on Key Differences

A seasoned operations leader discusses how e‑commerce operational support differs from general internet applications, covering longer support chains, consistency models, seasonal traffic spikes, team role separation, mobile‑internet challenges, future planning, and the rise of enterprise‑level ops services.

Operationse‑commercemobile operations
0 likes · 14 min read
E‑Commerce vs. General Internet Ops: Veteran Insights on Key Differences