How to Implement SLI/SLO Monitoring with Service Level Operator on Kubernetes
This article explains the concepts of SLI and SLO, shows how to select appropriate indicators, introduces Google’s VALET method, and provides step‑by‑step instructions for deploying the Service Level Operator on a Kubernetes cluster with Prometheus and Grafana for full SLI/SLO monitoring and alerting.
What is SLI/SLO
SLI (Service Level Indicator) is a metric that measures system stability, while SLO (Service Level Objective) is the target stability level, such as "four nines" or "five nines". SRE uses these indicators to assess whether the system meets the desired availability goals.
How to Choose SLI
Common metrics include:
System level: CPU usage, memory usage, disk usage, etc.
Application server level: port health, JVM status, etc.
Application runtime level: status codes, latency, QPS, etc.
Middleware level: QPS, TPS, latency, etc.
Business level: success rate, growth speed, etc.
Selection principles:
Choose metrics that directly reflect the stability of the target entity; discard metrics that do not.
Prefer metrics that are strongly related to user experience or are easily perceived by users.
Google’s VALET method can be used directly:
V – Volume: maximum capacity promised by the service.
A – Availability: whether the service is operational.
L – Latency: response time of the service.
E – Error: request error rate.
T – Ticket: need for manual intervention.
For deeper study, see "SRE: Google Site Reliability Engineering" and Zhao Cheng’s "SRE Practice Handbook".
service-level-operator
The Service Level Operator measures SLI/SLO for applications running in Kubernetes and can display the results in Grafana.
Example ServiceLevel custom resource:
apiVersion: monitoring.spotahome.com/v1alpha1
kind: ServiceLevel
metadata:
name: awesome-service
spec:
serviceLevelObjectives:
- name: "9999_http_request_lt_500"
description: 99.99% of requests must be served with <500 status code.
disable: false
availabilityObjectivePercent: 99.99
serviceLevelIndicator:
prometheus:
address: http://myprometheus:9090
totalQuery: sum(increase(http_request_total{host="awesome_service_io"}[2m]))
errorQuery: sum(increase(http_request_total{host="awesome_service_io",code=~"5.."}[2m]))
output:
prometheus:
labels:
team: a-team
iteration: "3"Key fields:
availabilityObjectivePercent – the SLO target.
totalQuery – total request count.
errorQuery – error request count.
Operator calculates the SLO from these queries.
Deploying service-level-operator
1. Create RBAC
apiVersion: v1
kind: ServiceAccount
metadata:
name: service-level-operator
namespace: monitoring
labels:
app: service-level-operator
component: app
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: service-level-operator
labels:
app: service-level-operator
component: app
rules:
- apiGroups: ["apiextensions.k8s.io"]
resources: ["customresourcedefinitions"]
verbs: ["*"]
- apiGroups: ["monitoring.spotahome.com"]
resources: ["servicelevels", "servicelevels/status"]
verbs: ["*"]
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: service-level-operator
subjects:
- kind: ServiceAccount
name: service-level-operator
namespace: monitoring
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: service-level-operator2. Create Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: service-level-operator
namespace: monitoring
labels:
app: service-level-operator
component: app
spec:
replicas: 1
selector:
matchLabels:
app: service-level-operator
component: app
strategy:
rollingUpdate:
maxUnavailable: 0
template:
metadata:
labels:
app: service-level-operator
component: app
spec:
serviceAccountName: service-level-operator
containers:
- name: app
imagePullPolicy: Always
image: quay.io/spotahome/service-level-operator:latest
ports:
- containerPort: 8080
name: http
protocol: TCP
readinessProbe:
httpGet:
path: /healthz/ready
port: http
livenessProbe:
httpGet:
path: /healthz/live
port: http
resources:
limits:
cpu: 220m
memory: 254Mi
requests:
cpu: 120m
memory: 128Mi3. Create Service
apiVersion: v1
kind: Service
metadata:
name: service-level-operator
namespace: monitoring
labels:
app: service-level-operator
component: app
spec:
ports:
- port: 80
protocol: TCP
name: http
targetPort: http
selector:
app: service-level-operator
component: app4. Create Prometheus ServiceMonitor
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: service-level-operator
namespace: monitoring
labels:
app: service-level-operator
component: app
prometheus: myprometheus
spec:
selector:
matchLabels:
app: service-level-operator
component: app
namespaceSelector:
matchNames:
- monitoring
endpoints:
- port: http
interval: 10sAfter deployment, the operator appears as a target in Prometheus.
Defining ServiceLevel for an Application
Example for the Grafana service ("four nines" SLO):
apiVersion: monitoring.spotahome.com/v1alpha1
kind: ServiceLevel
metadata:
name: prometheus-grafana-service
namespace: monitoring
spec:
serviceLevelObjectives:
- name: "9999_http_request_lt_500"
description: 99.99% of requests must be served with <500 status code.
disable: false
availabilityObjectivePercent: 99.99
serviceLevelIndicator:
prometheus:
address: http://prometheus-k8s.monitoring.svc:9090
totalQuery: sum(increase(http_request_total{service="grafana"}[2m]))
errorQuery: sum(increase(http_request_total{service="grafana",code=~"5.."}[2m]))
output:
prometheus:
labels:
team: prometheus-grafana
iteration: "3"Import Grafana dashboard ID 8793 to visualize SLI, error budget consumption, and remaining budget.
Alert Rules for SLO Breaches
groups:
- name: slo.rules
rules:
- alert: SLOErrorRateTooFast1h
expr: |
(increase(service_level_sli_result_error_ratio_total[1h]) /
increase(service_level_sli_result_count_total[1h])) > (1 - service_level_slo_objective_ratio) * 14.6
labels:
severity: critical
team: a-team
annotations:
summary: The monthly SLO error budget consumed for 1h is greater than 2%
description: The error rate for 1h exceeds the 2% monthly budget.
- alert: SLOErrorRateTooFast6h
expr: |
(increase(service_level_sli_result_error_ratio_total[6h]) /
increase(service_level_sli_result_count_total[6h])) > (1 - service_level_slo_objective_ratio) * 6
labels:
severity: critical
team: a-team
annotations:
summary: The monthly SLO error budget consumed for 6h is greater than 5%
description: The error rate for 6h exceeds the 5% monthly budget.These rules trigger alerts when the error budget consumption exceeds the thresholds derived from Google’s baseline.
Google’s Baseline for SLO Error Budgets
1h window – 2% error rate → 730 × 2 / 100 = 14.6 error budget units.
6h window – 5% error rate → 730 / 6 × 5 / 100 = 6 error budget units.
3‑day window – 10% error rate → 30 / 3 × 10 / 100 = 1 error budget unit.
Measuring System Availability
Two common approaches:
Time‑based: Availability = Service Time / (Service Time + Downtime).
Request‑based: Availability = Successful Requests / Total Requests.
SRE practice usually prefers the request‑based metric, but a comprehensive view should also consider latency, error rate, and finer‑grained SLI for critical components.
References
[1] "SRE Practice Handbook" – Zhao Cheng [2] "SRE: Google Site Reliability Engineering" [3] https://github.com/spotahome/service-level-operator
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Ops Development Stories
Maintained by a like‑minded team, covering both operations and development. Topics span Linux ops, DevOps toolchain, Kubernetes containerization, monitoring, log collection, network security, and Python or Go development. Team members: Qiao Ke, wanger, Dong Ge, Su Xin, Hua Zai, Zheng Ge, Teacher Xia.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
