How to Design Effective SLOs and SLAs: A Technical Deep Dive
This article explains the definitions of service, SLI, SLO, and SLA, outlines how to choose and measure appropriate indicators, shares best practices for setting and improving SLOs, and shows how SLAs combine objectives with consequences to manage service reliability.
Preface
SLO and SLA are common terms: Service Level Objective and Service Level Agreement.
In the cloud era, major providers publish SLA clauses (e.g., Amazon EC2 and S3). This article examines how SLAs are defined from a technical perspective.
SLAs are inseparable from SLOs, and another less‑known concept is SLI (Service Level Indicator). Good SLOs and SLIs are essential for an enforceable SLA.
SLI/SLO/SLA only make sense when tied to a concrete service.
Service
What is a service?
Simply, any useful functionality offered to a customer.
A service is provided by a service provider, typically a combination of people and software that runs on compute resources and may depend on other systems.
The customer is the person or organization that consumes the service.
SLI
SLI is a carefully defined metric that reflects specific system characteristics; determining an SLI is a complex process.
Key questions for defining an SLI:
What metric should be measured?
What is the system state during measurement?
How should the metric be aggregated?
Does the metric accurately describe service quality?
How trustworthy is the metric?
Common measurement dimensions
Performance
Latency
Throughput
QPS
Freshness
Availability
Uptime
Failure time/frequency
Reliability
Quality
Accuracy
Correctness
Completeness
Coverage
Relevance
Internal metrics
Queue length
RAM usage
Human factors
Time to response
Time to fix
Fix rate
Example: Hotmail downtime SLI
Error rate calculates total errors returned to users.
If error rate exceeds X %, the service is considered down and downtime starts counting.
If the error rate stays above the threshold for more than Y minutes, that period is counted as downtime.
Intermittent downtimes shorter than Y minutes are ignored.
System state during measurement
Whether the request is malformed, failed, or timed out.
System load at measurement time (e.g., peak load).
Origin of measurement (client vs. server).
Time window (working days only, 24/7, inclusion of planned maintenance).
Aggregating the metric
Define the time interval (rolling window vs. monthly).
Choose average or percentile (e.g., 95th‑percentile response time for tickets).
Metric definition: measure time from ticket creation to resolution.
Measurement method: use timestamps from tickets.
Scope: include only business hours, exclude holidays.
Data: use a one‑week sliding window, 95 % percentile of resolution time.
Does the metric accurately describe service quality?
Performance: timeliness, bias.
Accuracy: precision, coverage, data stability.
Completeness: data loss, invalid or outlier data.
Metric trustworthiness
Both provider and customer must accept it.
It should be independently verifiable (e.g., third‑party audit).
Client‑side vs. server‑side measurement and sampling interval.
How error requests are counted.
SLO
SLO (Service Level Objective) defines the expected state of a service and contains all information describing the desired functionality.
Providers use SLOs to set system expectations; developers implement code to meet them; customers rely on SLOs for business decisions. SLOs do not specify consequences for missed targets.
SLOs are expressed using SLIs, for example:
Average QPS > 100 k/s per minute
99 % of requests have latency < 500 ms
99 % of minutes have bandwidth > 200 MB/s
Best practices for setting SLOs:
Specify the calculation time window.
Use consistent windows (e.g., X‑hour rolling, quarterly).
Include an exemption clause, such as “achieve the SLO 95 % of the time.”
If a service is defining its first SLO, follow these principles:
Measure the current system state.
Set expectations, not guarantees.
Early SLOs should not be used as strict quality enforcement tools.
Iteratively improve SLOs (lower response time, higher throughput, etc.).
Maintain a safety buffer (internal SLOs higher than external promises).
Avoid over‑achieving; schedule regular downtime to prevent excessive surplus.
Overall SLO can be a weighted sum of individual service SLOs:
Total SLO = service1.SLO1 × weight1 + service2.SLO2 × weight2 + …
Benefits of SLOs for customers and providers include predictable quality, better cost/benefit trade‑offs, improved risk control, and faster incident response.
To ensure SLOs are met, a control loop is needed to monitor SLIs, compare them against targets, adjust objectives or systems, and repeat the process.
SLA
SLA is a two‑party contract that both sides must agree to and honor; it is a critical signal of service quality and typically involves product and legal teams.
Simple formula: SLA = SLO + Consequences
Actions taken when SLOs are not met (partial failures, etc.).
Implementation of consequences, often monetary rewards/penalties.
An SLA helps allocate resources rationally; the ideal state is when the marginal benefit of adding more resources is less than the benefit of allocating them elsewhere.
For example, improving availability from 99.9 % to 99.99 % must be weighed against the required resources to decide whether “four‑nines” are justified.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.