Operations 12 min read

How to Implement SLI/SLO Monitoring with Service Level Operator on Kubernetes

This article explains the concepts of SLI and SLO, shows how to select appropriate indicators, introduces Google’s VALET method, and provides step‑by‑step instructions for deploying the Service Level Operator on a Kubernetes cluster with Prometheus and Grafana for full SLI/SLO monitoring and alerting.

Ops Development Stories
Ops Development Stories
Ops Development Stories
How to Implement SLI/SLO Monitoring with Service Level Operator on Kubernetes

What is SLI/SLO

SLI (Service Level Indicator) is a metric that measures system stability, while SLO (Service Level Objective) is the target stability level, such as "four nines" or "five nines". SRE uses these indicators to assess whether the system meets the desired availability goals.

How to Choose SLI

Common metrics include:

System level: CPU usage, memory usage, disk usage, etc.

Application server level: port health, JVM status, etc.

Application runtime level: status codes, latency, QPS, etc.

Middleware level: QPS, TPS, latency, etc.

Business level: success rate, growth speed, etc.

Selection principles:

Choose metrics that directly reflect the stability of the target entity; discard metrics that do not.

Prefer metrics that are strongly related to user experience or are easily perceived by users.

Google’s VALET method can be used directly:

V – Volume: maximum capacity promised by the service.

A – Availability: whether the service is operational.

L – Latency: response time of the service.

E – Error: request error rate.

T – Ticket: need for manual intervention.

For deeper study, see "SRE: Google Site Reliability Engineering" and Zhao Cheng’s "SRE Practice Handbook".

service-level-operator

The Service Level Operator measures SLI/SLO for applications running in Kubernetes and can display the results in Grafana.

Example ServiceLevel custom resource:

apiVersion: monitoring.spotahome.com/v1alpha1
kind: ServiceLevel
metadata:
  name: awesome-service
spec:
  serviceLevelObjectives:
  - name: "9999_http_request_lt_500"
    description: 99.99% of requests must be served with <500 status code.
    disable: false
    availabilityObjectivePercent: 99.99
    serviceLevelIndicator:
      prometheus:
        address: http://myprometheus:9090
        totalQuery: sum(increase(http_request_total{host="awesome_service_io"}[2m]))
        errorQuery: sum(increase(http_request_total{host="awesome_service_io",code=~"5.."}[2m]))
    output:
      prometheus:
        labels:
          team: a-team
          iteration: "3"

Key fields:

availabilityObjectivePercent – the SLO target.

totalQuery – total request count.

errorQuery – error request count.

Operator calculates the SLO from these queries.

Deploying service-level-operator

1. Create RBAC

apiVersion: v1
kind: ServiceAccount
metadata:
  name: service-level-operator
  namespace: monitoring
  labels:
    app: service-level-operator
    component: app
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: service-level-operator
  labels:
    app: service-level-operator
    component: app
rules:
  - apiGroups: ["apiextensions.k8s.io"]
    resources: ["customresourcedefinitions"]
    verbs: ["*"]
  - apiGroups: ["monitoring.spotahome.com"]
    resources: ["servicelevels", "servicelevels/status"]
    verbs: ["*"]
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: service-level-operator
subjects:
  - kind: ServiceAccount
    name: service-level-operator
    namespace: monitoring
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: service-level-operator

2. Create Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: service-level-operator
  namespace: monitoring
  labels:
    app: service-level-operator
    component: app
spec:
  replicas: 1
  selector:
    matchLabels:
      app: service-level-operator
      component: app
  strategy:
    rollingUpdate:
      maxUnavailable: 0
  template:
    metadata:
      labels:
        app: service-level-operator
        component: app
    spec:
      serviceAccountName: service-level-operator
      containers:
      - name: app
        imagePullPolicy: Always
        image: quay.io/spotahome/service-level-operator:latest
        ports:
        - containerPort: 8080
          name: http
          protocol: TCP
        readinessProbe:
          httpGet:
            path: /healthz/ready
            port: http
        livenessProbe:
          httpGet:
            path: /healthz/live
            port: http
        resources:
          limits:
            cpu: 220m
            memory: 254Mi
          requests:
            cpu: 120m
            memory: 128Mi

3. Create Service

apiVersion: v1
kind: Service
metadata:
  name: service-level-operator
  namespace: monitoring
  labels:
    app: service-level-operator
    component: app
spec:
  ports:
  - port: 80
    protocol: TCP
    name: http
    targetPort: http
  selector:
    app: service-level-operator
    component: app

4. Create Prometheus ServiceMonitor

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: service-level-operator
  namespace: monitoring
  labels:
    app: service-level-operator
    component: app
    prometheus: myprometheus
spec:
  selector:
    matchLabels:
      app: service-level-operator
      component: app
  namespaceSelector:
    matchNames:
    - monitoring
  endpoints:
  - port: http
    interval: 10s

After deployment, the operator appears as a target in Prometheus.

Defining ServiceLevel for an Application

Example for the Grafana service ("four nines" SLO):

apiVersion: monitoring.spotahome.com/v1alpha1
kind: ServiceLevel
metadata:
  name: prometheus-grafana-service
  namespace: monitoring
spec:
  serviceLevelObjectives:
  - name: "9999_http_request_lt_500"
    description: 99.99% of requests must be served with <500 status code.
    disable: false
    availabilityObjectivePercent: 99.99
    serviceLevelIndicator:
      prometheus:
        address: http://prometheus-k8s.monitoring.svc:9090
        totalQuery: sum(increase(http_request_total{service="grafana"}[2m]))
        errorQuery: sum(increase(http_request_total{service="grafana",code=~"5.."}[2m]))
    output:
      prometheus:
        labels:
          team: prometheus-grafana
          iteration: "3"

Import Grafana dashboard ID 8793 to visualize SLI, error budget consumption, and remaining budget.

Alert Rules for SLO Breaches

groups:
- name: slo.rules
  rules:
  - alert: SLOErrorRateTooFast1h
    expr: |
      (increase(service_level_sli_result_error_ratio_total[1h]) /
       increase(service_level_sli_result_count_total[1h])) > (1 - service_level_slo_objective_ratio) * 14.6
    labels:
      severity: critical
      team: a-team
    annotations:
      summary: The monthly SLO error budget consumed for 1h is greater than 2%
      description: The error rate for 1h exceeds the 2% monthly budget.
  - alert: SLOErrorRateTooFast6h
    expr: |
      (increase(service_level_sli_result_error_ratio_total[6h]) /
       increase(service_level_sli_result_count_total[6h])) > (1 - service_level_slo_objective_ratio) * 6
    labels:
      severity: critical
      team: a-team
    annotations:
      summary: The monthly SLO error budget consumed for 6h is greater than 5%
      description: The error rate for 6h exceeds the 5% monthly budget.

These rules trigger alerts when the error budget consumption exceeds the thresholds derived from Google’s baseline.

Google’s Baseline for SLO Error Budgets

1h window – 2% error rate → 730 × 2 / 100 = 14.6 error budget units.

6h window – 5% error rate → 730 / 6 × 5 / 100 = 6 error budget units.

3‑day window – 10% error rate → 30 / 3 × 10 / 100 = 1 error budget unit.

Measuring System Availability

Two common approaches:

Time‑based: Availability = Service Time / (Service Time + Downtime).

Request‑based: Availability = Successful Requests / Total Requests.

SRE practice usually prefers the request‑based metric, but a comprehensive view should also consider latency, error rate, and finer‑grained SLI for critical components.

References

[1] "SRE Practice Handbook" – Zhao Cheng [2] "SRE: Google Site Reliability Engineering" [3] https://github.com/spotahome/service-level-operator

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

KubernetesPrometheusSLOSLIService Level Operator
Ops Development Stories
Written by

Ops Development Stories

Maintained by a like‑minded team, covering both operations and development. Topics span Linux ops, DevOps toolchain, Kubernetes containerization, monitoring, log collection, network security, and Python or Go development. Team members: Qiao Ke, wanger, Dong Ge, Su Xin, Hua Zai, Zheng Ge, Teacher Xia.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.