Operations 11 min read

How Zhejiang Mobile Transformed SRE for Telecom: A Practical Operations Blueprint

This article details Zhejiang Mobile's adaptation of Google‑originated Site Reliability Engineering to a telecom environment, outlining a three‑layer capability framework, standardized processes, integrated platforms, and measurable outcomes that demonstrate how agile SRE practices can boost reliability and scalability in traditional industries.

Efficient Ops
Efficient Ops
Efficient Ops
How Zhejiang Mobile Transformed SRE for Telecom: A Practical Operations Blueprint
Introduction: SRE originated at Google and is widely adopted in internet companies, but resources are scarce for traditional industries. The author, working as an SRE in a telecom operator, adapts Google’s SRE concepts to the telecom’s distinct organization, summarizing practical lessons for peers.

Increasing market competition among operators drives frequent agile development and iterative releases, while micro‑service adoption and new open‑source components add operational complexity. Balancing system iteration, technology upgrades, and stability led Zhejiang Mobile to transform its continuity assurance system.

1. Main Ideas

Analyze key factors affecting SLA and make application continuity the core guarantee.

Through minor organizational changes, integrate project onboarding, release, acceptance, and daily operations into a centralized Application SRE team.

Based on the strategy of weakening the impact of system iteration, technology change, and platform stability on operations, design and implement an agile SRE support capability framework.

2. Overview of the Agile SRE Support Capability Framework

The framework centers on a capability‑construction layer, relies on platform support below, and serves SRE work above. Standardized specifications translate capabilities into concrete practices, driven by three processes that coordinate overall work and control.

3. Research and Practice of the Framework

Targeting improved SLA for Zhejiang Mobile IT support, the framework was built based on on‑site operations.

(1) Three‑Layer Capability System

Ability Application Layer

Entry‑Control: Emphasize risk assessment and control for system changes, moving SRE activities forward to handle review, implementation, and acceptance before deployment.

Non‑Functional Governance: Predict and manage architectural decay risks, evaluate robustness, performance, and capacity through drills and daily monitoring, and address weak points.

Daily Assurance: Provide 24×7 emergency response via a rotating frontline and second‑line team, maximizing fault detection and resolution efficiency.

Capability Construction Layer

Automation of acceptance, full‑link stress testing, monitoring & alerting, fault localization, and fault handling capabilities.

Steady‑state operation and self‑healing system that provides rapid warning, localization, and handling of common anomalies, achieving basic autonomy.

Support Platform Layer

Integrated Diagnosis Platform: Zhejiang Mobile’s "Wujian" platform aggregates data from the operations middle‑platform to build diagnostic models, reducing manual intervention and enabling data‑driven self‑healing.

Diagnosis library: Trains anomaly detection models from operational data, continuously improving accuracy.

Self‑healing capability set: Implements routing switches, domain switches, service isolation, etc., and evolves with fault scenarios.

Integrated platform: Unifies collection, aggregation, analysis, and processing, with rule matching, decision engine, and control core to automate fault handling.

(2) Two Supporting Lines for SRE Implementation

Standard Guidance: Standardization drives overall delivery quality.

Define review standards to ensure concrete control rather than formalities.

Clarify implementation standards to reduce reliance on individual experience.

Define standardized metrics as a foundation for automation.

Process‑Based Control: Combine management and quality verification to fully govern change quality and fault operation.

One‑stop service from release review to next‑day emergency support.

One‑stop service from construction demand to entry‑control acceptance.

Closed‑loop fault operation covering detection, diagnosis, review, governance, and drills.

4. Practice Outcomes

The SRE capability has been rapidly replicated, providing strong organizational scalability. Coverage grew from a dozen systems to over a hundred, and personnel count is no longer a bottleneck. Standardized capabilities enable fast integration of new functions, expanding responsibilities and fostering professional growth for operations staff.

Agile SRE culture blurred and then clarified the boundary between development and operations, eliminating lengthy hand‑over cycles; all systems now achieve “no hand‑over” status, marking a milestone toward agile operations.

Higher demands on operations personnel have opened new technical career paths and improved retention.

Based on industry maturity models, a traditional‑industry SRE capability maturity model was defined. Since 2017, Zhejiang Mobile’s SRE team progressed from maturity level 1‑2 to a mature state after three years of refinement.

5. Future Outlook

Since 2019, the team has experimented with gray‑release, chaos engineering, and other techniques. The SRE team now leads overall technology and capability research, enriching DevOps practices and non‑functional governance, and will continue to share practical experience to support dual‑state IT transformation in traditional industries.

Source: Article originally published by the "San Dun IT" public account.

operationsSREagileInfrastructureSite Reliability Engineeringtelecom
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.