Operations 10 min read

14 Expert Q&A on Building an Effective SRE System for Fault Management

In this detailed Q&A, a Meitu SRE leader explains the relationship between DevOps and SRE, shares practical advice on team composition, monitoring, alerting, fault‑prevention design, and provides step‑by‑step guidance using Grafana, draw.io, and other tools to help organizations build reliable services.

dbaplus Community
dbaplus Community
dbaplus Community
14 Expert Q&A on Building an Effective SRE System for Fault Management

Relationship between DevOps and SRE

Answer: DevOps and SRE overlap in practice, but DevOps emphasizes delivery efficiency while SRE’s primary organizational goal is system stability.

Practicing SRE on Low‑Quality Services

Answer: SRE should first increase its own control capability to compensate for service quality deficiencies. Second, SRE should advise developers, testers, and product owners on design improvements—e.g., applying flexible design patterns that increase resilience.

SRE Team Composition

Answer: The Dev‑Ops to SRE staffing ratio depends on business complexity and team skill level. High‑performing organizations (e.g., Netflix) keep a very small core SRE group that supports global services.

Dynamic Fault‑Module Diagram in Grafana

The diagram is built with Grafana, the Flowcharting plugin, Grafana‑images‑render, draw.io, and an enterprise IM bot (WeChat, DingTalk, Feishu). The implementation consists of two main parts.

1. Dynamic Diagram Data

Create the diagram in draw.io ( https://app.diagrams.net/).

Copy the *.drawio source into Grafana → Visualization → FlowCharting → Source Content.

Configure a data source for the diagram.

Bind each diagram element (blocks, lines) to the corresponding monitoring metrics.

2. Alert‑Sending Functionality

Obtain the rendering URL of the Grafana chart and use grafana-images-render to generate a publicly accessible image.

Write a script that posts the chart image to enterprise IM tools.

Configure Grafana alerts to invoke the script when a rule fires, delivering the image to alert recipients.

Fault Management – Monitoring and Alerting Diagram
Fault Management – Monitoring and Alerting Diagram

Reference material (keep URLs for further reading):

Plugin tutorial: https://algenty.github.io/flowcharting-repository/STARTED.html

Online demo: https://play.grafana.org/d/_J1UvKjWk/flowcharting-aws-cloud?orgId=1

Plugin source: https://github.com/algenty/grafana-flowcharting

Plugin homepage: https://algenty.github.io/flowcharting-repository/

Plugin installation: https://grafana.com/grafana/plugins/agenty-flowcharting-panel

Accelerating Growth of a New SRE Team

Focus on the three IT‑management pillars: people, process, and technology.

People: Align goals and mindset across the team.

Process: Define common workflows, communication standards, and tooling; iterate quickly with small steps and regular retrospectives.

Technology: Raise core technical competence to reliably operate services, then plan longer‑term technical evolution.

Tool Development Strategy

Whether to build tools in‑house depends on the scenario; some requirements need external collaboration or co‑development because no single team can be all‑knowing.

Tracing System Selection and Practice

Meitu wraps and adapts open‑source tracing products. For small teams, directly adopting mature open‑source solutions is recommended.

Service Degradation and Circuit Breaking Techniques

Implementation can occur at three layers:

Business code (e.g., explicit fallback logic).

Business framework (e.g., middleware that provides degradation hooks).

Underlying infrastructure (e.g., load balancer or service mesh policies).

Industry‑standard libraries such as Hystrix (or its successors) are commonly used.

SRE Organization Size at Meitu

The SRE function belongs to the Operations department, which includes DBA, big‑data SRE, security, infrastructure, and product SRE teams. The product SRE team consists of about 7‑8 members; a separate infrastructure team also exists.

Future Risks for Business‑Focused SRE Teams

Automation (NoOps) may reduce some tasks, but the larger risk is being superseded by AI. Continuous learning and improvement are essential.

Recommended SRE Knowledge Stack

Adopt a cloud‑native perspective when building the SRE technology stack, covering observability, reliability engineering, and automation tools.

Monitoring Strategy Across Product Lines

A unified monitoring solution covers most scenarios. Teams can create custom dashboards on top of it and perform selective adaptations when needed.

Middleware Ownership

Responsibility for middleware varies; some components are managed by other teams, not always by SRE.

Service‑Level Objectives (SLO) for Microservices

In theory, each microservice should have its own SLO because services differ in criticality and required availability targets.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

DevOpsSREGrafanaSite Reliability Engineeringfault management
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.