How to Build a Next‑Gen “Big Operations” System for Reliability and Observability
This article outlines the evolution from manual operations to DevOps and SRE‑driven “big operations,” detailing system reliability and continuity practices, observability concepts, and the development of AIOps maturity standards, offering a comprehensive guide for building stable, efficient, and secure operational frameworks.
Constructing a New “Big Operations” System
“Big Operations” refers to a stage in the evolution of operations where the focus shifts from manual, developer‑assisted tasks to automated, collaborative DevOps practices and eventually to SRE‑driven large‑scale operations that emphasize system stability, maintainability, and scalability.
In the new era, the big‑operations system must be built jointly across the five stages of demand, design, development, testing, and operations, targeting four goals: stability, efficiency, precision, and security. These goals are supported by operation target management, organizational management, team management, service capability, and tool capability.
Four engineering practices : stability assurance, efficient operations, refined operations, and security operations, all realized through shared responsibility across the entire lifecycle.
System Reliability and Continuity Practice
Recent high‑profile outages (e.g., Federal Reserve, Twitter, Fastly, Bilibili) highlight the growing importance of system stability. In 2021, China’s “Key Information Infrastructure Security Protection Regulations” mandated operators to ensure stable and secure operation of critical infrastructure.
The CAICT has published the “Research and Development Operations System Reliability and Continuity Engineering (SRE)” standard, dividing reliability into two parts: the development process and technical operations.
Reliability and Continuity in the Development Process
Design and Development : stability admission and architecture design review, involving capacity planning and performance metrics from the production evaluation stage.
Quality Assurance : testing (unit, integration, functional) and code review to ensure code stability.
Deployment and Release : release strategy, quality and change management, automation tools, and deployment frequency to reduce manual intervention and improve success rates.
Reliability and Continuity in Technical Operations
Fault Prevention : health checks, capacity planning, and chaos engineering, leveraging FinOps principles for proactive resource sizing.
Fault Observation : observability of metrics, logs, and traces, with intelligent operations for alarm convergence and storm control.
Fault Handling : rapid detection, response, localization, and mitigation of incidents.
Optimization and Improvement : post‑incident reviews and steady‑state operational enhancements.
Observability Capability Practice
Observability means that a system’s internal state, behavior, and performance can be reliably monitored, analyzed, and traced, enabling quick problem discovery and resolution to improve availability and stability.
Timeline: 2016 – Google SRE highlighted observability for rapid fault elimination; 2017 – Peter Bourgon defined metrics, tracing, and logging; 2019 – CNCF’s OpenTelemetry made rapid deployment of observability feasible.
Gartner predicts that by 2024, 30% of enterprises will use observability technologies to boost digital business operations, a three‑fold increase from 2020.
Observability goes beyond monitoring: monitoring reacts to known events, while observability reveals the underlying reasons for those events.
AIOps builds on observability by applying intelligent analytics to operational data, enhancing automated decision‑making and overall operational efficiency.
Observability Practice Path
Data source monitoring – covering applications, middleware, network devices, servers, etc.
Data management – collection, storage, transmission, and processing.
Data observation – modeling, multidimensional analysis, and topology mapping.
Observation scenarios – business‑centric views such as infrastructure, container performance, user experience, and business performance.
Intelligent Operations (AIOps) Maturity Standards
National policies encourage enterprises to adopt intelligent operations. The AIOps industry has expanded across internet, finance, and technology sectors, moving from monitoring and visualization to event prediction and AI‑driven decision support.
Case studies show AIOps improves efficiency: telecom complaint handling, finance automation, and internet service efficiency gains exceeding 70%, while reducing operational costs and enhancing user experience.
2022 surveys reveal that while leadership values AIOps, challenges remain in scenario development, data integration, and standardization.
The maturity model defines five levels of capability, guiding enterprises in building comprehensive AIOps functions.
In July 2021, CAICT led the first international AIOps standard in ITU‑T SG13, promoting global collaboration and industry health.
Future Outlook
The CAICT Distributed System Stability Laboratory will continue to conduct stability assessment projects, providing guidance for reliability and continuity across industries and supporting China’s rapid yet stable digital transformation.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.