Unlocking the Core Value of Operations: How Risk Control Drives DevOps Success
The article explores the fundamental role of operations engineers in software lifecycle risk control, comparing their work to automotive maintenance, categorizing risks, and outlining strategies such as automation, standardization, and DevOps practices to enhance stability, efficiency, and long‑term value.
Preface
Three months ago I changed jobs and joined the application operations team of a large financial group. Moving from a startup‑style internet finance company to a traditional, large‑scale financial institution revealed a huge difference in work culture and practices.
Concepts that dominate internet companies—micro‑services, DevOps, continuous integration and delivery, containerized deployment—are almost never used in this environment, and are rarely even mentioned.
This dramatic shift in work style prompted me to reflect on the core value of the operations engineer role. I believe that understanding the core value of a job is essential for evaluating whether one fits the role and can sustain a long‑term career.
Core Value of Operations
To clarify the core value of operations, we must first understand where operations fit in the software industry. Software, like traditional industry, shares many fundamental similarities in its production process.
If we compare the software lifecycle to a car’s lifecycle, developers are the designers and manufacturers who turn requirements into a tangible product, focusing mainly on the early stages. Operations engineers, however, intervene in the middle and later stages—akin to drivers and mechanics—while users are the passengers.
Operations engineers must use every technical and managerial tool available to keep the "car" running safely, stably, efficiently, and comfortably, delivering value to the company. In other words, development delivers a tangible product; operations deliver an intangible service that eliminates and controls runtime risks, much like risk‑control in finance.
Today, operations work includes hardware procurement, data‑center construction, network planning, system installation, application deployment, middleware maintenance, monitoring, and automation. Many view it as repetitive, low‑value work, but when understood as risk‑control, the heavy workload makes sense.
Just as a driver steers a car toward its destination while avoiding hazards, developers build the software and operations engineers ensure it runs smoothly, performing regular inspections and maintenance.
“Life is full of unpredictable risks; only by fully recognizing them can we control them.”
Risk Classification
Software faces many types of risk, which can be classified by stage, layer, impact, or cause. This article focuses on the causes of risk.
Causes of Risk
Inherent (Design‑Time) Risks
These arise from systemic flaws introduced during the design phase, including business logic, frameworks, middleware, databases, as well as hardware, OS, and network architecture. Developers often ignore the operational environment, leading to hidden risks that surface only in production.
Such risks are hard to detect in simple functional tests because they require the richness of a production environment. Operations become the nearest "fire hydrant" to these risks, yet many operations engineers lack design experience, making mitigation difficult.
Change‑Induced Risks
As former Google SRE Sun Yucong noted, development’s goal is change, while operations’ goal is stability; change itself is a form of disruption. Version releases, hardware upgrades, OS updates, framework changes, code changes, middleware updates, network re‑architectures, and configuration modifications all introduce risk.
Controlling this risk means reducing the frequency, scope, and magnitude of changes. Operations engineers should be involved early—during requirement analysis and design—to have a say in risky proposals and to suggest mitigations.
Human Risks
Traditional operations rely heavily on manual, human‑driven tasks, which are unpredictable due to varying knowledge, fatigue, emotions, and personality. While policies can mitigate some issues, over‑restrictive rules stifle creativity and value.
The ideal goal is to hand repetitive tasks to machines, allowing engineers to focus on higher‑value, creative work.
Systemic Design Defects
These stem from insufficient consideration of network and application architecture, leading to risks that only appear in specific scenarios.
Procedural Risks
Unclear or weak processes create responsibility gaps, making rapid response and standardized handling difficult.
Risk‑Control Measures
The essence of risk control is to eliminate root causes during design and to limit unavoidable factors through technical and managerial means.
Reduce Human Intervention
Standardize and automate daily operations.
Reduce Disruption
Minimize the amount and frequency of changes, especially during version releases.
Standardization, Automation, and Monitoring
Adopt micro‑service architectures that “split” large applications into smaller, more manageable units, thereby reducing change impact and improving resource utilization.
Containerization (e.g., Docker) isolates resources and packages dependencies, enabling consistent deployment across environments and dramatically lowering environment‑related risk.
Configuration centers decouple applications from environments, allowing centralized management of configuration items and reducing manual errors during large releases.
Implement the “three‑rights” separation: operation rights, allocation rights, and audit rights, assigning them to different individuals to provide mutual checks.
Apply “divide‑and‑conquer” by assigning responsibility for specific business lines, machines, clusters, or middleware to distinct teams, reducing the impact of a single failure.
Limit external entry points (e.g., bastion hosts, API gateways) to improve traffic control and security auditing.
Balance traffic flow rather than merely reducing total volume, preventing spikes that exceed system capacity.
Reduce external dependencies by establishing reliable collaboration mechanisms with other teams or third‑party services.
Build comprehensive monitoring that not only collects data but also defines actionable thresholds and response plans.
Continuously improve performance through capacity planning, hardware upgrades, and eliminating bottlenecks.
Accelerate incident response with fast scaling, graceful degradation, rollback, and automated remediation.
Maintain thorough audit logs and conduct regular audits to close gaps and refine processes.
Implement robust backup strategies (e.g., multi‑site, multi‑center) and regularly test restoration procedures.
Strengthen policies by focusing on clarity and enforceability rather than sheer quantity.
DevOps and Risk Control
DevOps is not merely “operations development”; it encompasses automated demand management, development, building, compiling, testing, and delivery. Its goal is to break down the invisible wall between development and operations, improving efficiency and software quality.
Small, agile teams can achieve full‑process automation without a dedicated operations‑development team by fostering a culture of collaboration and shared responsibility.
Automation’s Role in Risk Control
Automation reduces uncontrolled human actions, cuts material, labor, and time costs, and provides a foundation for standardization, a unified CMDB, and modular, component‑based processes.
Practicing DevOps
By abstracting and optimizing risk‑control processes, we can productize them, turning risk management into a service that delivers measurable value.
Engineers, regardless of role, should intervene early in the software lifecycle to manage risk downstream.
While ITIL emphasizes strict process control and role separation, DevOps focuses on rapid, continuous improvement. Both aim to eliminate risk, but DevOps is better suited for fast‑changing internet environments, whereas ITIL fits stable traditional industries.
Operations leaders must maintain a global view of risk, plan long‑term control systems, nurture talent, and shift the focus from endless 24/7 firefighting to achieving a truly risk‑free operation.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.