Operations 13 min read

Designing an Effective DevOps Operations System: Principles and Practices

This article outlines a comprehensive DevOps operations framework, tracing its evolution from traditional ops to modern automation, detailing business standards, work policies, system integration, and best‑practice norms to achieve high SLA, low cost, and a one‑stop operational platform.

Efficient Ops

Oct 19, 2020

Designing an Effective DevOps Operations System: Principles and Practices

Preface

The system is like a hat, a deep summary of DevOps operations, sharing insights that may inspire you.

DevOps evolved from original operations, which were manual and rule‑driven, to automated DevOps and now AIOps, aiming for higher efficiency, lower error rates, and optimized processes.

Initially, after business functions were standardized, a set of regulations and guidelines formed a framework to cope with rapid internet growth, iterating continuously while focusing on high SLA and low cost.

Tools act as the underlying support; with this foundation, goals become more scientific and efficient.

1. Define Business Standards

Standardization is essential for batch, high‑efficiency work, similar to how American farms use standardized processes and tools to achieve high yield with low cost.

In DevOps, we manage three main categories: Resources (servers, network devices, load balancers, certificates, domains, code, containers), Services (monitoring, CI/CD, log analysis, incident plans, configuration management), and Standards (processes, resource and service standards).

Key norms include:

Change Management : code release, rollback, scaling; configuration changes; network changes; other changes such as traffic routing and service switching.

Principles: establish review processes, notifications, rollback strategies, follow test‑gray‑full rollout rules, clean dependencies on decommission.

Disaster Recovery : multi‑machine, multi‑data‑center for services; multi‑backup, off‑site for data; multi‑line, multi‑device for network.

Principles: prefer automatic over manual switching, stateless over stateful, hot standby over cold, multi‑data‑center over single.

Capacity Management : calculate system, module, data‑center, and single‑machine capacities using the bucket principle, set metrics like QPS, connections, online users, consider read/write balance and storage growth.

Principles: define capacity indicators, consider both upstream and downstream, compare usage with capacity, identify bottlenecks.

Inspection : monitor core user and service metrics, basic resources, dependencies, automate inspection reports, arrange on‑call duties.

Principles: focus on dashboard convergence, automate anomaly detection to prevent failures.

Alarm : monitor CPU, memory, network, I/O; process and port; logs and business events; dependencies like databases and APIs.

Principles: consolidate alerts, grade them, note impact, build dashboards for troubleshooting, value real‑time fault detection.

Plan : line switching (mobile, telecom, unicom), data‑center switching, machine removal, service degradation, DB master‑slave and read‑write switching, network primary‑backup switching.

Principles: domain switching over IP change, automate removal and switching, consider avalanche effects.

Fault Management : service impact grading, fault level definition, notification and handling procedures, post‑mortem and improvement tracking.

Principle: embrace faults and prevent recurrence.

Permission & Security : development, operations, temporary permissions; comply with security audit standards.

Documentation & Tools : unified knowledge sharing, shared scripts and tools.

Principle: aim for a “one‑stop operations platform” covering all tool operations.

Standardization : host naming, log storage and format, domain usage, software installation paths.

Principles: hostnames should convey service, module, data‑center info; logs must be standardized for manual and automated analysis.

Resource Management : servers, VIPs, domains, certificates, code.

Principle: manage resources with relational awareness.

These are common business norms; many more are defined based on actual problems, and they represent best practices crucial to DevOps construction.

2. Build Work Policies

Policies shape workflows and culture; good policies are systematic, tool‑enabled, executable, and quantifiable, enabling DevOps to enforce them technically.

Policies should solve a class of problems, not just single cases, and must be enforced through technology rather than relying solely on personal discipline.

Release approval policy

Compliance deployment policy

Log cleaning policy

Capacity management policy

On‑call management policy

Service inspection policy

Fault management policy

Security management policy

…

Effective policies reflect a long‑term vision, scientific attitude, and DevOps mindset.

3. Build the DevOps System

Implement the previous concepts with technology, aiming for a “one‑stop operations” platform where engineers need not switch systems.

While many single‑purpose tools (e.g., Zabbix, Jenkins) exist, they often require jumping between multiple systems. Integrating them as modular “wheels” under a unified account and permission system creates a two‑layer architecture: a bottom layer of specialized tools and an upper application layer (SRE‑oriented) that manages resources, standards, and services.

Leverage open‑source components, avoid reinventing the wheel, and expose their APIs to build an elegant, simple user experience.

Hope this inspires your practice.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Automation devops SRE best practices infrastructure

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.