How Tencent Scales SRE: Building a SLO‑Based Quality Operations System
This article examines Tencent's end‑to‑end SRE quality‑operation framework built on Service Level Objectives (SLO) and On‑Call, detailing industry background, problem statements, SLO management, On‑Call benefits, product architecture, large‑scale deployment, and future plans for reliability engineering.
This article examines how Tencent constructs and practices an SRE quality‑operation system based on Service Level Objectives (SLO) and On‑Call, sharing experiences, insights, and future outlook.
01 Industry Background
Stability engineering is challenging; inspired by Google SRE, Tencent adopted a large‑scale SLO‑and‑On‑Call quality‑operation system to quantify stability.
2.1 Problem Background
Product stability cannot be quantified : In massive scale environments, lacking quantitative stability makes it hard to set team goals, OKRs, and justify resource investment.
Fault process is opaque and uncontrollable
Different teams have divergent processes; a unified SLO‑On‑Call system standardizes actions, improves incident handling, and generates data.
Traditional methods lack modernity
Without a systematic solution or DevOps mindset, individual motivation for stability is low; SLO and On‑Call make stability improvements tangible.
2.2 SLO Management
SLOs describe current system stability. Rapid releases increase code changes, potentially reducing stability. SLOs provide a reasonable way to evaluate the trade‑off between feature velocity and reliability, aligning SRE and development teams on user‑centric quality goals.
Common language based on user perspective facilitates collaborative problem solving.
Applications include error‑budget‑based burn‑rate alerts and development strategy decisions.
2.3 On‑Call Management
On‑Call covers system events, cloud component events, big‑data events, and user feedback, managed via a centralized platform, delivering five key benefits:
Visible : Real‑time observation of system incidents.
Orchestration : Enables alert governance and orchestration.
Automation : Acts as a hub to integrate various automation tools.
Teamwork : Enhances collaboration across R&D, SRE, and operations as the organization grows.
Analytics : Generates high‑quality operational data for stability analysis.
2.4 Product Architecture
Tencent built an internal On‑Call platform that centralizes incident management for hundreds of development teams.
Event channels (SLO, observability, risk detection, user feedback) feed into On‑Call, which integrates with third‑party tools such as TAPD, WeChat Work, and Tencent Meeting, solving many workflow issues.
2.5 Actual Deployment
The system now serves dozens of products (video, QQ, docs, news, middle‑platform) and hundreds of teams, offering a reference for other enterprises.
03 Large‑Scale Practice at Tencent
3.1 SLO Management
3.1.1 Core Scenarios and SLI Indicators
SLOs are user‑oriented; Tencent defines hierarchical scenarios (primary for external users, secondary for internal users) so each team can set responsibilities and use the On‑Call service.
3.1.2 SLO Targets and Error Budgets
Targets are calculated using Google’s method, analyzing the past 28‑day error consumption and recommending appropriate SLO values; SRE leads the process with development teams.
3.1.3 Error‑Budget‑Based Burn‑Rate Alerts
After defining SLOs, Tencent applies Google‑style short‑ and long‑window burn‑rate alerts at scale, improving alert timeliness.
3.1.4 Establishing SLO Operations Mechanism
SRE organizes weekly SLO operation meetings, coordinating with product lines, platform teams, and middle‑platform to iterate and roll out features quickly across dozens of business lines.
3.1.5 Future Plans
Future work focuses on core scenarios and indicators, reducing SLO configuration cost, providing alert templates, and leveraging error budgets for development investment decisions.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.