Cloud Native 31 min read

Evolution of SRE in the Cloud‑Native Era – Insights from Industry Experts

Industry experts from Zhejiang Mobile, Bilibili, and Xiaomi discuss how SRE has evolved in the cloud‑native era, sharing concrete frameworks, observability practices, and cost‑focused platforms while emphasizing stability, metrics, on‑call processes, and the need to adapt Google’s model to real‑world product and operational contexts.

Bilibili Tech

Jun 21, 2022

Evolution of SRE in the Cloud‑Native Era – Insights from Industry Experts

This article compiles the content of an online sharing session titled “The Evolution Path of SRE under Cloud‑Native” organized by the dbaplus community, featuring three senior SRE practitioners: Shi Junting (Zhejiang Mobile), Liu Hao (Bilibili Infrastructure), and Liu Zhijie (Xiaomi Storage‑Compute).

It begins with a brief history of Site Reliability Engineering (SRE), noting that Google introduced the concept over a decade ago and that its adoption has accelerated with the rise of large‑scale internet systems and cloud‑native technologies. The speakers emphasize that SRE differs from traditional operations by demanding higher stability, performance, and engineering rigor.

Each expert shares concrete practices from their companies:

Shi Junting (Zhejiang Mobile) outlines a comprehensive SRE framework that includes architecture design, integrated delivery, testing, release, change control, unified monitoring, online governance, post‑measurement, and retrospectives. He describes eight engineering projects—such as the “Architecture Box,” “Operation Command Center,” “Dream Platform” (chaos engineering), “Multi‑Active Platform,” “Traffic Replay Platform,” and “Consistent Delivery System”—and three major systems (fault‑resistance, release, and delivery‑guard) that span the entire lifecycle.

Liu Hao (Bilibili) highlights Bilibili’s mature observability stack (logs, tracing, metrics), service‑tree for full‑lifecycle metadata, an on‑call scheduling system, and chaos‑engineering foundations built on Kubernetes. He stresses the need to adapt Google’s SRE methodology to local contexts and avoid over‑idealizing it.

Liu Zhijie (Xiaomi) explains Xiaomi’s SRE focus on quality, cost, and efficiency, detailing platforms such as the lightweight “Canoe” storage‑compute middle‑platform, the Aifault monitoring system, the MiGOC quality‑operation center, the Horus rapid‑troubleshooting system, and the Mife unified access management. He also discusses cost‑reduction through cloud‑billing integration and the importance of product thinking in SRE work.

The session includes a Q&A segment where the speakers discuss common misconceptions (e.g., “SRE is just a senior developer who knows ops”), the challenges of copying Google’s model directly, and the importance of combining development, operations, and product‑management skills.

Key takeaways include:

Stability is the foundation; cost and efficiency are optimized around it.

Metrics such as MTTR, MTBF, SLO/SLA, and resource‑usage ratios guide decision‑making.

Standardized on‑call, incident‑response, and change‑management procedures are essential.

Collaboration between business, development, and SRE teams must be codified through shared guidelines and tooling.

The article concludes with a call to watch the full replay of the session and provides additional reading links.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Cloud Native devops SRE Reliability Engineering

Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.