How Meta’s SLICK Transforms SLO Management for Reliable Services
This article explains how Meta built SLICK, a centralized SLO/SLI platform that improves service reliability through discoverability, long‑term insights, integrated workflows, and scalable architecture, and shares real‑world examples and lessons learned from its deployment across thousands of services.
Defining service SLI and SLO and presenting them globally helps SRE practice become actionable; this article introduces Meta’s (Facebook) implementation called SLICK.
Meta continuously engages with users and communities to provide reliable support, and in a fast‑moving, large‑scale environment with thousands of engineers deploying code frequently, setting clear expectations for each product, feature, and service is essential for visualizing user experience and analyzing system bottlenecks.
To address this, Meta studied Service‑Level Indicators (SLIs) and Service‑Level Objectives (SLOs) and built SLICK, an “SLO store” that centralizes SLI/SLO definitions, offers high‑retention granular data, and integrates SLOs into company workflows.
Before SLICK, SLOs were scattered across custom dashboards, documents, or tools, making discovery time‑consuming and long‑term analysis impossible due to limited data retention.
Define SLOs for services consistently.
Collect minute‑level metrics retained for up to two years.
Provide standard visualizations and insights for SLI/SLO metrics.
Send periodic reliability reports to internal teams for regular reliability checks.
Discoverability
SLICK defines a standard model that lets everyone discuss reliability using the same terminology, enabling new service teams to adopt company‑wide standards early in design.
By knowing a service name, SLICK can locate its reliability metrics and performance data through a built‑in service index linked to standardized dashboards, allowing a single click to see if the service meets user expectations and to start troubleshooting immediately.
Above: SLICK SLO index search example.
Long‑term Insights
Reliability issues can stem from a single faulty deployment or accumulate gradually. SLICK stores up to two years of full‑granularity metric data in sharded MySQL databases, refreshed hourly, enabling engineers, TPMs, and leaders to see degradation trends that would otherwise be missed.
Workflows
SLOs and SLIs are integrated into common workflows so that anyone can plan, evaluate, and act on reliability impact. During large‑scale incidents, teams can assess user‑experience impact via real‑time SLOs, and major incidents can trigger SLO‑driven response processes.
SLICK Onboarding
Service teams add themselves to SLICK via a UI or a simple DSL configuration file that specifies the service name and queries for SLI time series and associated SLO.
After submitting the configuration, SLICK automatically indexes the service, creates a dedicated dashboard, and begins data collection for long‑term observation.
Using SLICK
1) Dashboard
The dashboard shows real‑time SLI data and historical trends based on high‑retention metrics.
Left: full‑granularity SLI time series; Right: weekly aggregated SLI values versus SLO target.
2) Periodic Reports
SLICK generates regular SLO performance summary reports for internal teams, helping them focus on regressions and conduct post‑mortems.
3) CLI
SLICK provides a command‑line interface for tasks such as back‑filling data, generating reports on demand, or testing configuration changes.
SLICK Architecture
Overall Design
SLICK Configs: DSL configuration files submitted by users.
SLICK Syncer: Syncs config changes to metadata storage.
SLICK UI: Generates per‑service dashboards and provides the service index.
SLICK Service: API server that answers queries such as “how to compute SLO for a visualization?” and abstracts storage details.
SLICK Data Pipelines: Periodic pipelines that ingest SLI data.
Data Ingestion Details
The pipelines run hourly, query SLICK metadata to discover all SLIs, fetch minute‑level raw time‑series data, and batch‑insert it into the appropriate sharded MySQL partitions.
Data quality checks run deterministic test series through the same pipeline and compare inserted rows with expected values to ensure pipeline correctness.
Current State of SLOs Using SLICK at Meta
Since its 2019 launch, over 1,000 services have been onboarded by 2021. The following (simulated) charts illustrate typical reliability improvements.
LogDevice: Regression Detection and Fix
LogDevice, Meta’s distributed log storage, uses SLICK to monitor read availability, detect regressions, and verify that fixes restore the SLO.
Backend ML Service Reliability Example
In 2020 a critical backend ML system showed significant reliability degradation affecting end‑user applications. SLICK revealed that the service consistently missed its SLO, prompting a reliability investigation that identified and fixed the root cause.
Key Takeaways
Long‑term tracking provides valuable trend data for planning reliability work.
SLOs must be central to engineering culture, influencing both strategic planning and daily communication.
Introducing SLOs strengthens overall service reliability.
Future investments will focus on:
Aligning service SLOs with those of their dependencies to understand cross‑service impact and prevent cascading failures.
Providing actionable feedback to service teams on how to improve reliability and meet SLOs.
Expanding SLICK’s coverage while maintaining its own reliability and scalability.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
