Operations 7 min read

Beijing Mobile’s SRE Success: Automation, Cloud‑Native Ops & Reliability

The article details how Beijing Mobile’s SRE Smart Operations team applied SRE principles, automation, and cloud‑native tools to transform traditional DevOps into a reliable, scalable operation, highlighting their fault‑prevention, monitoring, incident response, and continuous improvement practices that earned them the 2023 IT Technology Leadership award.

Efficient Ops

Nov 26, 2023

Beijing Mobile’s SRE Success: Automation, Cloud‑Native Ops & Reliability

2023 IT Technology Leadership Awards (GOITI)

On October 26, the 2023 IT Technology Leadership Annual Awards (GOITI) were held at the GOPS Global Operations Conference 2023 Shanghai. Since its inception in 2016, GOITI has run seven editions, expanding from a focus on operations to the entire IT technology field, aiming to recognize enterprises, practitioners, teams, and products across the industry.

The 2023 selection seeks to encourage continuous technical innovation, motivate talent to explore, and inspire the industry to maintain its original mission while leading further technological development.

Among the finalists, Beijing Mobile’s SRE Smart Operations team was honored as the "Telecom Industry IT Operations Leading Team".

Beijing Mobile SRE Smart Operations Team Overview

The team embraces SRE principles, transitioning from traditional development and operations to system optimization, tool development, and a supporting mechanism that forms a comprehensive SRE work system. They foster a growth‑oriented culture, believing that capabilities and viewpoints can continuously improve, and leverage tools to solve operational problems while maintaining a learning mindset.

Guided by the core SRE logic of simplifying operations—eliminating routine tasks and system complexity—they develop automated, intelligent operation tools to reduce repetitive work and boost efficiency. Through scalable system expansion and reliability assessments, they optimize production systems and lower complexity, establishing mechanisms for participation, on‑call rotation, incident response, post‑mortem analysis, and change management to enhance reliability and maintainability.

With the centralization and decoupling of business production systems and widespread cloud‑native container adoption, operational complexity has risen, increasing the risk of service interruptions that can damage corporate reputation, trigger policy scrutiny, and cause revenue loss. The team’s location in the capital further amplifies the importance of stability.

Focusing on continuous SRE assurance, the team adopts a "four‑in‑one" operational guarantee covering architecture stability, disaster‑recovery reliability, and platform tool support. Using Service Level Objectives (SLOs) as a benchmark, they aim to reduce fault count and duration, driving fault prevention, detection, handling, and continuous improvement.

Fault Prevention : A change‑management platform enforces comprehensive change procedures; an automated operations platform provides three‑layer health checks (IaaS, PaaS, SaaS) and rapid emergency support; the CMChaos platform offers extensive chaos experiments and fault scenarios; private‑cloud resource management handles full lifecycle management of resources.

Fault Observation : A unified monitoring platform supports custom charts, refresh rates, and data models, aggregating logs, traces, business databases, and system data for visualization. The alarm platform enables rule configuration, hierarchical management, alarm convergence, escalation, and integrates with CMDB for full resource coverage and alarm governance.

Fault Handling : Incident response utilizes a unified online dispatch tool with capabilities for reporting, escalation, and collaboration. Root‑cause analysis leverages an end‑to‑end observability platform, employing AIM, log centers, and operational data. Fault mitigation follows a "restore first, repair later" principle, employing restart, rollback, rate limiting, circuit breaking, automated emergency plans, and dual‑center switching.

Optimization and Improvement : A tiered post‑mortem mechanism clarifies root causes, responsibility attribution, and corrective actions, linking incident tickets with remediation tickets to close the improvement loop. The "2022‑2024 SRE Implementation Plan" guides ongoing development of SRE capabilities.

2023 IT Technology Leadership Awards Details: Glory Moment! 2023 IT Technology Leadership Awards Successfully Held.

Past IT Technology Leadership Cases: Glory Moment! 2022 IT Technology Leadership Awards Successfully Held; Glory Moment! 2021 IT Technology Leadership Awards Successfully Held.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Automation Operations SRE Reliability Engineering

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.