Evolution of SRE in the Cloud‑Native Era – Insights from Industry Experts
Industry experts from Zhejiang Mobile, Bilibili, and Xiaomi discuss how SRE has evolved in the cloud‑native era, sharing concrete frameworks, observability practices, and cost‑focused platforms while emphasizing stability, metrics, on‑call processes, and the need to adapt Google’s model to real‑world product and operational contexts.
This article compiles the content of an online sharing session titled “The Evolution Path of SRE under Cloud‑Native” organized by the dbaplus community, featuring three senior SRE practitioners: Shi Junting (Zhejiang Mobile), Liu Hao (Bilibili Infrastructure), and Liu Zhijie (Xiaomi Storage‑Compute).
It begins with a brief history of Site Reliability Engineering (SRE), noting that Google introduced the concept over a decade ago and that its adoption has accelerated with the rise of large‑scale internet systems and cloud‑native technologies. The speakers emphasize that SRE differs from traditional operations by demanding higher stability, performance, and engineering rigor.
Each expert shares concrete practices from their companies:
Shi Junting (Zhejiang Mobile) outlines a comprehensive SRE framework that includes architecture design, integrated delivery, testing, release, change control, unified monitoring, online governance, post‑measurement, and retrospectives. He describes eight engineering projects—such as the “Architecture Box,” “Operation Command Center,” “Dream Platform” (chaos engineering), “Multi‑Active Platform,” “Traffic Replay Platform,” and “Consistent Delivery System”—and three major systems (fault‑resistance, release, and delivery‑guard) that span the entire lifecycle.
Liu Hao (Bilibili) highlights Bilibili’s mature observability stack (logs, tracing, metrics), service‑tree for full‑lifecycle metadata, an on‑call scheduling system, and chaos‑engineering foundations built on Kubernetes. He stresses the need to adapt Google’s SRE methodology to local contexts and avoid over‑idealizing it.
Liu Zhijie (Xiaomi) explains Xiaomi’s SRE focus on quality, cost, and efficiency, detailing platforms such as the lightweight “Canoe” storage‑compute middle‑platform, the Aifault monitoring system, the MiGOC quality‑operation center, the Horus rapid‑troubleshooting system, and the Mife unified access management. He also discusses cost‑reduction through cloud‑billing integration and the importance of product thinking in SRE work.
The session includes a Q&A segment where the speakers discuss common misconceptions (e.g., “SRE is just a senior developer who knows ops”), the challenges of copying Google’s model directly, and the importance of combining development, operations, and product‑management skills.
Key takeaways include:
Stability is the foundation; cost and efficiency are optimized around it.
Metrics such as MTTR, MTBF, SLO/SLA, and resource‑usage ratios guide decision‑making.
Standardized on‑call, incident‑response, and change‑management procedures are essential.
Collaboration between business, development, and SRE teams must be codified through shared guidelines and tooling.
The article concludes with a call to watch the full replay of the session and provides additional reading links.
Bilibili Tech
Provides introductions and tutorials on Bilibili-related technologies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.