Tag

Oncall

0 views collected around this technical thread.

Efficient Ops
Efficient Ops
Jun 9, 2025 · Operations

How OnCall Platforms Transform Incident Management and Reduce Manual Overhead

This article explains the purpose and key features of OnCall platforms, compares popular solutions like PagerDuty, Opsgenie, Grafana OnCall and Alibaba Cloud ARMS, clarifies webhooks with a simple analogy, and summarizes how centralized on‑call management boosts operational efficiency while minimizing manual intervention.

Incident ResponseOncallmonitoring
0 likes · 5 min read
How OnCall Platforms Transform Incident Management and Reduce Manual Overhead
Efficient Ops
Efficient Ops
Apr 8, 2024 · Operations

What Exactly Is SRE? A Deep Dive into Roles, Responsibilities, and Best Practices

This article explains what Site Reliability Engineering (SRE) is, outlines the three main layers of SRE work—Infrastructure, Platform, and Business—covers hiring challenges, daily duties such as deployment, on‑call, SLI/SLO management, capacity planning, user support, and offers practical interview and career advice.

OncallSRESite Reliability Engineering
0 likes · 22 min read
What Exactly Is SRE? A Deep Dive into Roles, Responsibilities, and Best Practices
DevOps
DevOps
Jul 25, 2022 · Operations

Understanding the Role and Responsibilities of Site Reliability Engineering (SRE)

This article provides a comprehensive overview of Site Reliability Engineering, explaining its origins, core responsibilities across infrastructure, platform, and business layers, daily tasks such as deployment, on‑call duties, SLI/SLO management, incident post‑mortems, capacity planning, and user support, as well as career advice for aspiring SREs.

OncallSLISLO
0 likes · 21 min read
Understanding the Role and Responsibilities of Site Reliability Engineering (SRE)
Bilibili Tech
Bilibili Tech
May 20, 2022 · Operations

Bilibili SRE Practices: Stability Operations, Incident Management, and Platform Enablement

Bilibili’s SRE team, confronting rapid growth and complex systems, built a systematic stability operation that includes emergency response, incident handling, on‑call scheduling, and an Event Operations Center platform, using metrics like MTTR, MTTI and AI‑assisted automation to reduce downtime and improve reliability.

BilibiliMetricsOncall
0 likes · 27 min read
Bilibili SRE Practices: Stability Operations, Incident Management, and Platform Enablement
Bilibili Tech
Bilibili Tech
Apr 26, 2022 · Operations

Bilibili's SRE Practice for Business Stability: Theory, Metrics, and Operational Implementation

Bilibili’s SRE team combines stability theory, detailed fault‑stage and operational metrics, and a unified emergency‑response platform—including on‑call scheduling, fault‑command incident commanders, automated fault portraits, and rapid post‑mortems—to transform frequent incidents into data‑driven, collaborative recoveries and lay groundwork for AI‑assisted self‑healing.

Business StabilityIncident ResponseMetrics
0 likes · 23 min read
Bilibili's SRE Practice for Business Stability: Theory, Metrics, and Operational Implementation
IT Architects Alliance
IT Architects Alliance
Apr 17, 2022 · Operations

Understanding the SRE Role: Responsibilities, Types, and Practices

This article explains what Site Reliability Engineering (SRE) is, why it was created, the challenges in hiring SREs, and breaks the role into three layers—Infrastructure, Platform, and Business—detailing their duties, deployment processes, on‑call practices, SLI/SLO management, incident post‑mortems, capacity planning, user support, and career advice.

OncallSLISLO
0 likes · 21 min read
Understanding the SRE Role: Responsibilities, Types, and Practices
Architect
Architect
Apr 16, 2022 · Operations

A Comprehensive Overview of Site Reliability Engineering (SRE) Roles and Practices

This article explains what SRE is, why it was created, how its responsibilities differ across companies, and breaks the work into Infrastructure, Platform, and Business SRE while covering deployment, on‑call, SLI/SLO, incident post‑mortems, capacity planning, user support, and career advice.

OncallSLI/SLOSRE
0 likes · 22 min read
A Comprehensive Overview of Site Reliability Engineering (SRE) Roles and Practices