Operations 13 min read

How Meta’s SLICK Transforms SLO Management for Reliable Services

This article explains how Meta built SLICK, a centralized SLO/SLI platform that improves service reliability through discoverability, long‑term insights, integrated workflows, and scalable architecture, and shares real‑world examples and lessons learned from its deployment across thousands of services.

MaGe Linux Operations

May 7, 2023

How Meta’s SLICK Transforms SLO Management for Reliable Services

Defining service SLI and SLO and presenting them globally helps SRE practice become actionable; this article introduces Meta’s (Facebook) implementation called SLICK.

Meta continuously engages with users and communities to provide reliable support, and in a fast‑moving, large‑scale environment with thousands of engineers deploying code frequently, setting clear expectations for each product, feature, and service is essential for visualizing user experience and analyzing system bottlenecks.

To address this, Meta studied Service‑Level Indicators (SLIs) and Service‑Level Objectives (SLOs) and built SLICK, an “SLO store” that centralizes SLI/SLO definitions, offers high‑retention granular data, and integrates SLOs into company workflows.

Before SLICK, SLOs were scattered across custom dashboards, documents, or tools, making discovery time‑consuming and long‑term analysis impossible due to limited data retention.

Define SLOs for services consistently.

Collect minute‑level metrics retained for up to two years.

Provide standard visualizations and insights for SLI/SLO metrics.

Send periodic reliability reports to internal teams for regular reliability checks.

Discoverability

SLICK defines a standard model that lets everyone discuss reliability using the same terminology, enabling new service teams to adopt company‑wide standards early in design.

By knowing a service name, SLICK can locate its reliability metrics and performance data through a built‑in service index linked to standardized dashboards, allowing a single click to see if the service meets user expectations and to start troubleshooting immediately.

Above: SLICK SLO index search example.

Long‑term Insights

Reliability issues can stem from a single faulty deployment or accumulate gradually. SLICK stores up to two years of full‑granularity metric data in sharded MySQL databases, refreshed hourly, enabling engineers, TPMs, and leaders to see degradation trends that would otherwise be missed.

Workflows

SLOs and SLIs are integrated into common workflows so that anyone can plan, evaluate, and act on reliability impact. During large‑scale incidents, teams can assess user‑experience impact via real‑time SLOs, and major incidents can trigger SLO‑driven response processes.

SLICK Onboarding

Service teams add themselves to SLICK via a UI or a simple DSL configuration file that specifies the service name and queries for SLI time series and associated SLO.

After submitting the configuration, SLICK automatically indexes the service, creates a dedicated dashboard, and begins data collection for long‑term observation.

Using SLICK

1) Dashboard

The dashboard shows real‑time SLI data and historical trends based on high‑retention metrics.

Left: full‑granularity SLI time series; Right: weekly aggregated SLI values versus SLO target.

2) Periodic Reports

SLICK generates regular SLO performance summary reports for internal teams, helping them focus on regressions and conduct post‑mortems.

3) CLI

SLICK provides a command‑line interface for tasks such as back‑filling data, generating reports on demand, or testing configuration changes.

SLICK Architecture

Overall Design

SLICK Configs: DSL configuration files submitted by users.

SLICK Syncer: Syncs config changes to metadata storage.

SLICK UI: Generates per‑service dashboards and provides the service index.

SLICK Service: API server that answers queries such as “how to compute SLO for a visualization?” and abstracts storage details.

SLICK Data Pipelines: Periodic pipelines that ingest SLI data.

Data Ingestion Details

The pipelines run hourly, query SLICK metadata to discover all SLIs, fetch minute‑level raw time‑series data, and batch‑insert it into the appropriate sharded MySQL partitions.

Data quality checks run deterministic test series through the same pipeline and compare inserted rows with expected values to ensure pipeline correctness.

Current State of SLOs Using SLICK at Meta

Since its 2019 launch, over 1,000 services have been onboarded by 2021. The following (simulated) charts illustrate typical reliability improvements.

LogDevice: Regression Detection and Fix

LogDevice, Meta’s distributed log storage, uses SLICK to monitor read availability, detect regressions, and verify that fixes restore the SLO.

Backend ML Service Reliability Example

In 2020 a critical backend ML system showed significant reliability degradation affecting end‑user applications. SLICK revealed that the service consistently missed its SLO, prompting a reliability investigation that identified and fixed the root cause.

Key Takeaways

Long‑term tracking provides valuable trend data for planning reliability work.

SLOs must be central to engineering culture, influencing both strategic planning and daily communication.

Introducing SLOs strengthens overall service reliability.

Future investments will focus on:

Aligning service SLOs with those of their dependencies to understand cross‑service impact and prevent cascading failures.

Providing actionable feedback to service teams on how to improve reliability and meet SLOs.

Expanding SLICK’s coverage while maintaining its own reliability and scalability.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Observability SRE Reliability SLO SLI Meta

Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.