Tag

on-call

0 views collected around this technical thread.

Efficient Ops
Efficient Ops
Mar 18, 2025 · Operations

Is 24/7 On‑Call a Nightmare? Real Ops Insights from Zhihu Discussions

This article compiles diverse Zhihu comments on the reality of 24 × 7 on‑call duties, contrasting exaggerated myths with practical team‑based solutions, global shift models, backup strategies, and actionable tips for improving operations without sacrificing personal life.

AutomationOperationsSRE
0 likes · 7 min read
Is 24/7 On‑Call a Nightmare? Real Ops Insights from Zhihu Discussions
Efficient Ops
Efficient Ops
Jan 2, 2024 · Operations

Is 24/7 On‑Call Ops Really Terrifying? Real Insights from Chinese Ops Professionals

A collection of Zhihu answers reveals how large tech firms use multi‑timezone teams for 24/7 on‑call coverage, while smaller companies rely on rotation, backup, and automation to keep operations manageable, showing that constant availability need not be a nightmare.

24x7AutomationOperations
0 likes · 8 min read
Is 24/7 On‑Call Ops Really Terrifying? Real Insights from Chinese Ops Professionals
Aikesheng Open Source Community
Aikesheng Open Source Community
Jul 24, 2023 · Operations

Exploring On‑Call Duty Models and SRE‑Driven Operations Management

This article examines the challenges of traditional on‑call duty systems for operations teams, proposes an SRE‑inspired rotation model that involves developers, defines concrete KPI targets, and describes how automation and chat‑bot tools can streamline incident response and reduce internal friction.

AutomationKPIOperations
0 likes · 12 min read
Exploring On‑Call Duty Models and SRE‑Driven Operations Management
Efficient Ops
Efficient Ops
Jun 1, 2023 · Operations

How Tencent’s On‑Call System Transforms Incident Management and Quality Ops

This article explores how Tencent builds and practices its SRE quality operation system, focusing on On‑Call incident management, standardized channels, alert handling, data quality models, and the resulting improvements in reliability, MTTR reduction, and data‑driven decision making.

OperationsSREincident management
0 likes · 14 min read
How Tencent’s On‑Call System Transforms Incident Management and Quality Ops
Efficient Ops
Efficient Ops
May 31, 2023 · Operations

How Tencent Scales SRE: Building a SLO‑Based Quality Operations System

This article examines Tencent's end‑to‑end SRE quality‑operation framework built on Service Level Objectives (SLO) and On‑Call, detailing industry background, problem statements, SLO management, On‑Call benefits, product architecture, large‑scale deployment, and future plans for reliability engineering.

Quality OperationsReliability EngineeringSLO
0 likes · 11 min read
How Tencent Scales SRE: Building a SLO‑Based Quality Operations System
Efficient Ops
Efficient Ops
Feb 5, 2020 · Operations

Balancing Stability and Speed: Google SRE Lessons for Modern Ops Teams

This article examines the inherent tension between operations and development, explains Google’s error‑budget and SLO approach, and shares practical DevOps, on‑call, automation, and talent strategies that help ops teams improve efficiency while maintaining product reliability.

AutomationDevOpsError Budget
0 likes · 9 min read
Balancing Stability and Speed: Google SRE Lessons for Modern Ops Teams
Sohu Tech Products
Sohu Tech Products
Oct 23, 2019 · Operations

Google SRE Weekly Alert Limits and Practical Strategies for Reducing Alert Fatigue

This article examines how Google SRE limits weekly alerts to ten, compares it with typical Chinese internet operations teams, and provides practical strategies—including on‑call scheduling, alert escalation, automation, dashboard optimization, and team management—to dramatically reduce alert volume and improve incident response.

OperationsSREalert management
0 likes · 15 min read
Google SRE Weekly Alert Limits and Practical Strategies for Reducing Alert Fatigue
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Sep 28, 2018 · Operations

8 Practical Tips for Operations Teams to Manage the Golden Week Holiday

This article offers eight practical operations‑team strategies—inspection, monitoring alerts, capacity planning, network restrictions, risk pre‑plans, data backup, on‑call mechanisms, and staying connected—to ensure system stability and enjoy the Golden Week holiday without incidents.

Capacity PlanningOperationsbackup
0 likes · 4 min read
8 Practical Tips for Operations Teams to Manage the Golden Week Holiday
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
Jul 11, 2017 · Operations

Mastering On-Call: Practical Lessons from Google SRE for Effective Ops

This article shares practical insights from Google SRE on on‑call duty, covering why on‑call is needed, common challenges, effective scheduling, evaluation methods, and actionable tips to improve team resilience and reduce stress for operations engineers.

OperationsSREincident management
0 likes · 9 min read
Mastering On-Call: Practical Lessons from Google SRE for Effective Ops
Efficient Ops
Efficient Ops
Nov 7, 2016 · Operations

How to Train New SREs Effectively: Proven Practices and Playbooks

This article outlines a systematic approach to onboarding and training new Site Reliability Engineers, covering trust building, readiness assessment, diverse learning methods, structured curricula, on‑call milestones, project‑focused work, reverse‑engineering skills, statistical thinking, and improvisation techniques to develop high‑performing SRE teams.

OperationsReverse EngineeringSRE
0 likes · 17 min read
How to Train New SREs Effectively: Proven Practices and Playbooks