Operations 11 min read

How Tencent Scales SRE: Building a SLO‑Based Quality Operations System

This article examines Tencent's end‑to‑end SRE quality‑operation framework built on Service Level Objectives (SLO) and On‑Call, detailing industry background, problem statements, SLO management, On‑Call benefits, product architecture, large‑scale deployment, and future plans for reliability engineering.

Efficient Ops

May 31, 2023

How Tencent Scales SRE: Building a SLO‑Based Quality Operations System

This article examines how Tencent constructs and practices an SRE quality‑operation system based on Service Level Objectives (SLO) and On‑Call, sharing experiences, insights, and future outlook.

01 Industry Background

Stability engineering is challenging; inspired by Google SRE, Tencent adopted a large‑scale SLO‑and‑On‑Call quality‑operation system to quantify stability.

2.1 Problem Background

Product stability cannot be quantified : In massive scale environments, lacking quantitative stability makes it hard to set team goals, OKRs, and justify resource investment.

Fault process is opaque and uncontrollable

Different teams have divergent processes; a unified SLO‑On‑Call system standardizes actions, improves incident handling, and generates data.

Traditional methods lack modernity

Without a systematic solution or DevOps mindset, individual motivation for stability is low; SLO and On‑Call make stability improvements tangible.

2.2 SLO Management

SLOs describe current system stability. Rapid releases increase code changes, potentially reducing stability. SLOs provide a reasonable way to evaluate the trade‑off between feature velocity and reliability, aligning SRE and development teams on user‑centric quality goals.

Common language based on user perspective facilitates collaborative problem solving.

Applications include error‑budget‑based burn‑rate alerts and development strategy decisions.

2.3 On‑Call Management

On‑Call covers system events, cloud component events, big‑data events, and user feedback, managed via a centralized platform, delivering five key benefits:

Visible : Real‑time observation of system incidents.

Orchestration : Enables alert governance and orchestration.

Automation : Acts as a hub to integrate various automation tools.

Teamwork : Enhances collaboration across R&D, SRE, and operations as the organization grows.

Analytics : Generates high‑quality operational data for stability analysis.

2.4 Product Architecture

Tencent built an internal On‑Call platform that centralizes incident management for hundreds of development teams.

Event channels (SLO, observability, risk detection, user feedback) feed into On‑Call, which integrates with third‑party tools such as TAPD, WeChat Work, and Tencent Meeting, solving many workflow issues.

2.5 Actual Deployment

The system now serves dozens of products (video, QQ, docs, news, middle‑platform) and hundreds of teams, offering a reference for other enterprises.

03 Large‑Scale Practice at Tencent

3.1 SLO Management

3.1.1 Core Scenarios and SLI Indicators

SLOs are user‑oriented; Tencent defines hierarchical scenarios (primary for external users, secondary for internal users) so each team can set responsibilities and use the On‑Call service.

3.1.2 SLO Targets and Error Budgets

Targets are calculated using Google’s method, analyzing the past 28‑day error consumption and recommending appropriate SLO values; SRE leads the process with development teams.

3.1.3 Error‑Budget‑Based Burn‑Rate Alerts

After defining SLOs, Tencent applies Google‑style short‑ and long‑window burn‑rate alerts at scale, improving alert timeliness.

3.1.4 Establishing SLO Operations Mechanism

SRE organizes weekly SLO operation meetings, coordinating with product lines, platform teams, and middle‑platform to iterate and roll out features quickly across dozens of business lines.

3.1.5 Future Plans

Future work focuses on core scenarios and indicators, reducing SLO configuration cost, providing alert templates, and leveraging error budgets for development investment decisions.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

SRE Tencent reliability engineering SLO On-Call Quality Operations

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.