Lessons Learned from Implementing SLOs at Bilibili: Practices, Pitfalls, and Reflections
Bilibili adopted Google‑SRE SLO practices—selecting SLIs, defining availability and latency targets, grading services, and tracking error budgets—but encountered costly grading inconsistencies, hidden error detection, and inaccurate business‑level metrics, leading them to realize SLOs are chiefly valuable for early alerting rather than exhaustive reporting.
01 Background
In recent years, Google SRE has become very popular in China. The Google SRE methodology emphasizes that SLO is the core of SRE practice; SLO sets a target level for service reliability and is a key factor for reliability decisions. This article discusses how Bilibili chooses and calculates SLI, sets SLO, and puts the practice into production, sharing the pitfalls and experience.
02 Definition of SLO in Google SRE
Service Level Objective (SLO) specifies the target level of service reliability. Because SLOs are the key for data‑driven reliability decisions, they are the core of SRE practice.
The original Google SRE "Site Reliability Handbook" states that SLOs are needed for:
Engineers are scarce; time should be spent on core problems of important services.
SLOs are crucial for prioritising work and reliability‑related tasks.
SRE’s core responsibilities include automation and incident handling, but daily work must follow SLOs.
Without SLOs, there is no SRE.
For error‑budget‑based reliability engineering, the handbook also stresses:
Stakeholders must recognise the SLO.
The service can meet the SLO under normal conditions.
The organisation must accept the error budget and use it in decision‑making.
A complete SLO process must exist.
Otherwise, SLO compliance becomes a KPI or reporting metric rather than a decision‑making tool.
03 Implementation of SLO in Google SRE
The second chapter of the handbook outlines the SLO implementation process:
1. SLI Selection
For request‑driven services, typical SLIs are availability (success‑response ratio), latency, and quality.
2. SLI Calculation
SLI can be calculated from application server logs, load‑balancer monitoring, black‑box monitoring, client plugins, etc.
Load‑balancer metrics are usually chosen because they represent the total request processing time across all modules and network hops, and they are cheaper to implement than client plugins.
3. SLO Definition
Based on the calculated availability and latency data, define appropriate service SLOs.
Example: annual availability ≥ 99.99%.
Example: 99% of requests ≤ 200 ms, 90% ≤ 100 ms.
SLOs can be defined for different time windows (monthly, quarterly, etc.).
Obtain stakeholder approval.
4. Error Budget
With SLI and SLO, the allowed number of failures in a time window is known.
If the error budget is exhausted, adopt strategies such as:
5. Record SLO and Error Budget
Document author, reviewer, approval date, next review date, background, etc.
Track platform, process, policy, and change events for traceability.
Detail SLI implementation, calculation, and error‑budget usage.
6. Dashboards and Reports
Provide published SLOs, error budgets, and visual dashboards/reports.
7. Continuous Improvement of SLOs
04 Our SLO Practice
From Google’s SLO description we extracted key information to guide our own construction.
Service Grading
Application (Technical View) – one appid per application, includes front‑end and back‑end, can be built and deployed independently.
Business (Product View) – a set of related product functions, relatively independent business modules, contains a group of related applications.
Grading Levels (L0‑L3)
The grading is applied first to the business, then to applications under the business, and finally to APIs, ensuring that an API’s grade never exceeds its application’s grade.
SLO System
1. SLI Selection
For online services we choose availability, latency, and throughput.
Availability is measured by error count and request success rate.
Latency is measured by p90 and p99.
Throughput is measured by total daily requests and request rate.
2. SLI Model
API metrics reflect business functionality; we measure API availability, latency, and throughput.
Only L0 and L1 APIs are measured.
Business SLI aggregates the selected API SLIs, focusing on availability.
3. SLI Calculation
Use load‑balancer (SLB) metrics for all public‑facing services.
Internal services are measured indirectly via public‑facing services.
API availability: per‑minute error count (HTTP 5XX), total requests, success rate, latency percentiles, throughput; aggregated daily.
Business availability: aggregate error count and success rate across L0 and L1 APIs, weighted by level.
4. SLO Definition
Define SLOs only for availability and latency; throughput is shown in dashboards.
Default API SLOs based on grade (e.g., L0 API availability ≥ 99.99%, L0 API 99th‑percentile latency ≤ 200 ms).
Business‑level SLOs target annual availability ≥ 99.99%.
5. Error Budget
Initially we did not focus on error‑budget construction; we only used SLO‑based alerts.
6. Record SLO and Error Budget
Provide a platform to support SLO definition, review, and recording.
7. Dashboards and Reports
API dashboards and reports.
05 Problems Encountered
1. Business Grading
Inconsistent business abstraction; product scope can be large or small.
L0 and core L1 grading is fairly accurate, but other business grades are uncertain.
2. Application Grading
Number of applications far exceeds number of businesses, making grading difficult.
Legacy applications lag behind business refactoring.
Grades for non‑core applications have low accuracy.
3. Interface Grading
Even more interfaces than applications; grading is time‑consuming and costly.
Interface grades lag behind business changes.
4. SLI Calculation
When upstream/downstream failures cause HTTP 200 responses, real errors are hidden in business error codes; SLB cannot parse the body.
This can lead to a half‑hour outage being reported as 100% availability.
Switching to application‑reported metrics (Prometheus) solves the issue, but when the app is down it cannot report metrics.
Changing business error codes to HTTP codes is costly and conflicts with micro‑service standards.
5. Business SLI
Need to filter APIs that affect core business functions; other APIs should not impact business SLI.
API changes during product iteration can render previous SLI data irrelevant.
6. Summary of Problems
Grading model is idealistic and costly.
Business‑API SLI relationships and metadata updates are delayed, reducing data accuracy.
Single‑metric availability SLI cannot cover all failure scenarios.
Some departments only provide internal services, lacking SLI data.
SLI data is mainly used for reports and provides little other value.
We attempted to use business availability SLI as the annual availability report, but the calculation diverged significantly from fault‑time‑based availability, leading to a costly data‑compensation mechanism that required manual incident review and SLI correction. Ultimately, the obsession with perfect SLI accuracy caused the entire SLO system to stall.
06 Reflection
We questioned where the problem lay and what the true value of SLO is. SLO’s value is not mandatory; its core is to enable timely alerts for SLI anomalies. Without proper alerting, SLO becomes a mere report.
Key reflections:
SLO is a decision‑making factor, not an absolute requirement.
The main value of SLO is early detection of alerts that affect availability SLI.
Error‑budget strategy should guide short‑term goals after a quarter‑level incident, but freezing product iteration for months is unrealistic.
SLI measurement should focus on business functions and applications, not on aggregated business SLI.
07 Conclusion
This article detailed our thought process, implementation steps, and the problems we encountered while building an SLO system without fully understanding its value. In the next article we will share the revised SLO construction approach based on our new insights.
Bilibili Tech
Provides introductions and tutorials on Bilibili-related technologies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.