Designing an Effective DevOps Operations System: Principles and Practices
This article outlines a comprehensive DevOps operations framework, tracing its evolution from traditional ops to modern automation, detailing business standards, work policies, system integration, and best‑practice norms to achieve high SLA, low cost, and a one‑stop operational platform.
Preface
The system is like a hat, a deep summary of DevOps operations, sharing insights that may inspire you.
DevOps evolved from original operations, which were manual and rule‑driven, to automated DevOps and now AIOps, aiming for higher efficiency, lower error rates, and optimized processes.
Initially, after business functions were standardized, a set of regulations and guidelines formed a framework to cope with rapid internet growth, iterating continuously while focusing on high SLA and low cost.
Tools act as the underlying support; with this foundation, goals become more scientific and efficient.
1. Define Business Standards
Standardization is essential for batch, high‑efficiency work, similar to how American farms use standardized processes and tools to achieve high yield with low cost.
In DevOps, we manage three main categories: Resources (servers, network devices, load balancers, certificates, domains, code, containers), Services (monitoring, CI/CD, log analysis, incident plans, configuration management), and Standards (processes, resource and service standards).
Key norms include:
Change Management : code release, rollback, scaling; configuration changes; network changes; other changes such as traffic routing and service switching.
Principles: establish review processes, notifications, rollback strategies, follow test‑gray‑full rollout rules, clean dependencies on decommission.
Disaster Recovery : multi‑machine, multi‑data‑center for services; multi‑backup, off‑site for data; multi‑line, multi‑device for network.
Principles: prefer automatic over manual switching, stateless over stateful, hot standby over cold, multi‑data‑center over single.
Capacity Management : calculate system, module, data‑center, and single‑machine capacities using the bucket principle, set metrics like QPS, connections, online users, consider read/write balance and storage growth.
Principles: define capacity indicators, consider both upstream and downstream, compare usage with capacity, identify bottlenecks.
Inspection : monitor core user and service metrics, basic resources, dependencies, automate inspection reports, arrange on‑call duties.
Principles: focus on dashboard convergence, automate anomaly detection to prevent failures.
Alarm : monitor CPU, memory, network, I/O; process and port; logs and business events; dependencies like databases and APIs.
Principles: consolidate alerts, grade them, note impact, build dashboards for troubleshooting, value real‑time fault detection.
Plan : line switching (mobile, telecom, unicom), data‑center switching, machine removal, service degradation, DB master‑slave and read‑write switching, network primary‑backup switching.
Principles: domain switching over IP change, automate removal and switching, consider avalanche effects.
Fault Management : service impact grading, fault level definition, notification and handling procedures, post‑mortem and improvement tracking.
Principle: embrace faults and prevent recurrence.
Permission & Security : development, operations, temporary permissions; comply with security audit standards.
Documentation & Tools : unified knowledge sharing, shared scripts and tools.
Principle: aim for a “one‑stop operations platform” covering all tool operations.
Standardization : host naming, log storage and format, domain usage, software installation paths.
Principles: hostnames should convey service, module, data‑center info; logs must be standardized for manual and automated analysis.
Resource Management : servers, VIPs, domains, certificates, code.
Principle: manage resources with relational awareness.
These are common business norms; many more are defined based on actual problems, and they represent best practices crucial to DevOps construction.
2. Build Work Policies
Policies shape workflows and culture; good policies are systematic, tool‑enabled, executable, and quantifiable, enabling DevOps to enforce them technically.
Policies should solve a class of problems, not just single cases, and must be enforced through technology rather than relying solely on personal discipline.
Release approval policy
Compliance deployment policy
Log cleaning policy
Capacity management policy
On‑call management policy
Service inspection policy
Fault management policy
Security management policy
…
Effective policies reflect a long‑term vision, scientific attitude, and DevOps mindset.
3. Build the DevOps System
Implement the previous concepts with technology, aiming for a “one‑stop operations” platform where engineers need not switch systems.
While many single‑purpose tools (e.g., Zabbix, Jenkins) exist, they often require jumping between multiple systems. Integrating them as modular “wheels” under a unified account and permission system creates a two‑layer architecture: a bottom layer of specialized tools and an upper application layer (SRE‑oriented) that manages resources, standards, and services.
Leverage open‑source components, avoid reinventing the wheel, and expose their APIs to build an elegant, simple user experience.
Hope this inspires your practice.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
