Operations 11 min read

How to Design Effective SLOs and SLAs: A Technical Deep Dive

This article explains the definitions of service, SLI, SLO, and SLA, outlines how to choose and measure appropriate indicators, shares best practices for setting and improving SLOs, and shows how SLAs combine objectives with consequences to manage service reliability.

Efficient Ops
Efficient Ops
Efficient Ops
How to Design Effective SLOs and SLAs: A Technical Deep Dive

Preface

SLO and SLA are common terms: Service Level Objective and Service Level Agreement.

In the cloud era, major providers publish SLA clauses (e.g., Amazon EC2 and S3). This article examines how SLAs are defined from a technical perspective.

SLAs are inseparable from SLOs, and another less‑known concept is SLI (Service Level Indicator). Good SLOs and SLIs are essential for an enforceable SLA.

SLI/SLO/SLA only make sense when tied to a concrete service.

Service

What is a service?

Simply, any useful functionality offered to a customer.

A service is provided by a service provider, typically a combination of people and software that runs on compute resources and may depend on other systems.

The customer is the person or organization that consumes the service.

SLI

SLI is a carefully defined metric that reflects specific system characteristics; determining an SLI is a complex process.

Key questions for defining an SLI:

What metric should be measured?

What is the system state during measurement?

How should the metric be aggregated?

Does the metric accurately describe service quality?

How trustworthy is the metric?

Common measurement dimensions

Performance

Latency

Throughput

QPS

Freshness

Availability

Uptime

Failure time/frequency

Reliability

Quality

Accuracy

Correctness

Completeness

Coverage

Relevance

Internal metrics

Queue length

RAM usage

Human factors

Time to response

Time to fix

Fix rate

Example: Hotmail downtime SLI

Error rate calculates total errors returned to users.

If error rate exceeds X %, the service is considered down and downtime starts counting.

If the error rate stays above the threshold for more than Y minutes, that period is counted as downtime.

Intermittent downtimes shorter than Y minutes are ignored.

System state during measurement

Whether the request is malformed, failed, or timed out.

System load at measurement time (e.g., peak load).

Origin of measurement (client vs. server).

Time window (working days only, 24/7, inclusion of planned maintenance).

Aggregating the metric

Define the time interval (rolling window vs. monthly).

Choose average or percentile (e.g., 95th‑percentile response time for tickets).

Metric definition: measure time from ticket creation to resolution.

Measurement method: use timestamps from tickets.

Scope: include only business hours, exclude holidays.

Data: use a one‑week sliding window, 95 % percentile of resolution time.

Does the metric accurately describe service quality?

Performance: timeliness, bias.

Accuracy: precision, coverage, data stability.

Completeness: data loss, invalid or outlier data.

Metric trustworthiness

Both provider and customer must accept it.

It should be independently verifiable (e.g., third‑party audit).

Client‑side vs. server‑side measurement and sampling interval.

How error requests are counted.

SLO

SLO (Service Level Objective) defines the expected state of a service and contains all information describing the desired functionality.

Providers use SLOs to set system expectations; developers implement code to meet them; customers rely on SLOs for business decisions. SLOs do not specify consequences for missed targets.

SLOs are expressed using SLIs, for example:

Average QPS > 100 k/s per minute

99 % of requests have latency < 500 ms

99 % of minutes have bandwidth > 200 MB/s

Best practices for setting SLOs:

Specify the calculation time window.

Use consistent windows (e.g., X‑hour rolling, quarterly).

Include an exemption clause, such as “achieve the SLO 95 % of the time.”

If a service is defining its first SLO, follow these principles:

Measure the current system state.

Set expectations, not guarantees.

Early SLOs should not be used as strict quality enforcement tools.

Iteratively improve SLOs (lower response time, higher throughput, etc.).

Maintain a safety buffer (internal SLOs higher than external promises).

Avoid over‑achieving; schedule regular downtime to prevent excessive surplus.

Overall SLO can be a weighted sum of individual service SLOs:

Total SLO = service1.SLO1 × weight1 + service2.SLO2 × weight2 + …

Benefits of SLOs for customers and providers include predictable quality, better cost/benefit trade‑offs, improved risk control, and faster incident response.

To ensure SLOs are met, a control loop is needed to monitor SLIs, compare them against targets, adjust objectives or systems, and repeat the process.

SLA

SLA is a two‑party contract that both sides must agree to and honor; it is a critical signal of service quality and typically involves product and legal teams.

Simple formula: SLA = SLO + Consequences

Actions taken when SLOs are not met (partial failures, etc.).

Implementation of consequences, often monetary rewards/penalties.

An SLA helps allocate resources rationally; the ideal state is when the marginal benefit of adding more resources is less than the benefit of allocating them elsewhere.

For example, improving availability from 99.9 % to 99.99 % must be weighed against the required resources to decide whether “four‑nines” are justified.

cloud computingoperationsSLAservice reliabilitySLOSLI
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.