From Ad‑hoc Deployment to Standardized SRE Practices: Definitions, Responsibilities, Metrics and Alerting
The article traces the evolution from a rudimentary deployment workflow in a small startup to a mature, Google‑inspired Site Reliability Engineering (SRE) approach, explaining SRE definitions, team duties, error‑budget concepts, key reliability metrics (SLI/SLO/SLA), monitoring implementation with OpenTSDB, and best‑practice alerting rules.
The author recounts three stages of development process maturity: Stage 1 at a tiny startup with manual FTP uploads; Stage 2 at ByteDance where testing equaled local development, code could be merged without review, and incidents surfaced via a long chain of users‑>operations‑>product‑>R&D; and Stage 3 where a formal release pipeline, environment segregation, code review, and monitoring were introduced.
An incident example shows a front‑end bundle injected into a container becoming unavailable due to a disabled cloud‑function, highlighting the lack of monitoring and delayed detection.
Stage 3 establishes a standardized release flow, emphasizing separate dev, pre‑release, and production environments, mandatory code review, deployment permissions, and proactive monitoring and alerting.
SRE (Site Reliability Engineering) is introduced from the Google book “SRE: Google Operations Guide”, described as a collection of methods derived from Google engineers to improve system reliability, evolving from simple scripts to sophisticated time‑series monitoring (e.g., Borgmon).
The SRE team’s responsibilities include availability improvement, latency reduction, performance and efficiency optimization, as well as change management.
A conflict between product teams (pushing frequent releases) and SRE (maintaining stability) is resolved by the error‑budget concept: set a monthly SLO (e.g., 99.9 % availability), measure actual availability, and only release new versions while the error budget remains positive.
Monitoring is broken down into three pillars: monitoring & alerting, emergency incident handling, and capacity planning/resource management.
Key Service Level Indicators (SLIs) are listed: request latency (PCT50, PCT95, PCT99), error rate, QPS, resource usage (CPU, memory, disk), throughput, and overall availability (uptime or success‑rate).
The significance of PCT99 is explained: it captures tail latency that average values hide, especially under high QPS load.
Service Level Objectives (SLOs) are defined as target values for SLIs, with practical guidance (e.g., availability > 99 %, average latency < 100 ms) and cautions against setting 100 % targets.
Service Level Agreements (SLAs) describe the contractual consequences of meeting or missing SLOs, with examples from Alibaba Cloud ECS and AWS.
Monitoring implementation is illustrated using OpenTSDB: time‑series data with tags, and a concrete query result is shown below.
{ "dps": { "1605110400": 1.1, "1605110430": 1.9, "1605110460": 1.0, "1605110490": 1.3, "1605110520": 1.4, ... }, "tags": { "method": "home", ... } }
Alerting rules should set meaningful thresholds, be actionable (preferably automated), and aggregate repeated alerts; the article concludes with a link to the OpenTSDB documentation.
ByteDance ADFE Team
Official account of ByteDance Advertising Frontend Team
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.