14 Expert Q&A on Building an Effective SRE System for Fault Management
In this detailed Q&A, a Meitu SRE leader explains the relationship between DevOps and SRE, shares practical advice on team composition, monitoring, alerting, fault‑prevention design, and provides step‑by‑step guidance using Grafana, draw.io, and other tools to help organizations build reliable services.
Relationship between DevOps and SRE
Answer: DevOps and SRE overlap in practice, but DevOps emphasizes delivery efficiency while SRE’s primary organizational goal is system stability.
Practicing SRE on Low‑Quality Services
Answer: SRE should first increase its own control capability to compensate for service quality deficiencies. Second, SRE should advise developers, testers, and product owners on design improvements—e.g., applying flexible design patterns that increase resilience.
SRE Team Composition
Answer: The Dev‑Ops to SRE staffing ratio depends on business complexity and team skill level. High‑performing organizations (e.g., Netflix) keep a very small core SRE group that supports global services.
Dynamic Fault‑Module Diagram in Grafana
The diagram is built with Grafana, the Flowcharting plugin, Grafana‑images‑render, draw.io, and an enterprise IM bot (WeChat, DingTalk, Feishu). The implementation consists of two main parts.
1. Dynamic Diagram Data
Create the diagram in draw.io ( https://app.diagrams.net/).
Copy the *.drawio source into Grafana → Visualization → FlowCharting → Source Content.
Configure a data source for the diagram.
Bind each diagram element (blocks, lines) to the corresponding monitoring metrics.
2. Alert‑Sending Functionality
Obtain the rendering URL of the Grafana chart and use grafana-images-render to generate a publicly accessible image.
Write a script that posts the chart image to enterprise IM tools.
Configure Grafana alerts to invoke the script when a rule fires, delivering the image to alert recipients.
Reference material (keep URLs for further reading):
Plugin tutorial: https://algenty.github.io/flowcharting-repository/STARTED.html
Online demo: https://play.grafana.org/d/_J1UvKjWk/flowcharting-aws-cloud?orgId=1
Plugin source: https://github.com/algenty/grafana-flowcharting
Plugin homepage: https://algenty.github.io/flowcharting-repository/
Plugin installation: https://grafana.com/grafana/plugins/agenty-flowcharting-panel
Accelerating Growth of a New SRE Team
Focus on the three IT‑management pillars: people, process, and technology.
People: Align goals and mindset across the team.
Process: Define common workflows, communication standards, and tooling; iterate quickly with small steps and regular retrospectives.
Technology: Raise core technical competence to reliably operate services, then plan longer‑term technical evolution.
Tool Development Strategy
Whether to build tools in‑house depends on the scenario; some requirements need external collaboration or co‑development because no single team can be all‑knowing.
Tracing System Selection and Practice
Meitu wraps and adapts open‑source tracing products. For small teams, directly adopting mature open‑source solutions is recommended.
Service Degradation and Circuit Breaking Techniques
Implementation can occur at three layers:
Business code (e.g., explicit fallback logic).
Business framework (e.g., middleware that provides degradation hooks).
Underlying infrastructure (e.g., load balancer or service mesh policies).
Industry‑standard libraries such as Hystrix (or its successors) are commonly used.
SRE Organization Size at Meitu
The SRE function belongs to the Operations department, which includes DBA, big‑data SRE, security, infrastructure, and product SRE teams. The product SRE team consists of about 7‑8 members; a separate infrastructure team also exists.
Future Risks for Business‑Focused SRE Teams
Automation (NoOps) may reduce some tasks, but the larger risk is being superseded by AI. Continuous learning and improvement are essential.
Recommended SRE Knowledge Stack
Adopt a cloud‑native perspective when building the SRE technology stack, covering observability, reliability engineering, and automation tools.
Monitoring Strategy Across Product Lines
A unified monitoring solution covers most scenarios. Teams can create custom dashboards on top of it and perform selective adaptations when needed.
Middleware Ownership
Responsibility for middleware varies; some components are managed by other teams, not always by SRE.
Service‑Level Objectives (SLO) for Microservices
In theory, each microservice should have its own SLO because services differ in criticality and required availability targets.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
