Operations 17 min read

Mastering DevOps: 36 Operational Strategies to Prevent Disasters and Boost Efficiency

This article shares practical DevOps tactics—including disaster‑recovery drills, SET architecture, automated self‑healing, and disciplined change management—to help operations teams reduce errors, improve reliability, and free time for strategic work.

Efficient Ops
Efficient Ops
Efficient Ops
Mastering DevOps: 36 Operational Strategies to Prevent Disasters and Boost Efficiency

Author: Liang Ding'an, Tencent Cloud product lead and DevOps expert with over a decade of operations experience.

Overview

Although each company's operations team faces unique challenges, daily tasks are similar: 24/7 availability, resource preparation for releases, and rapid incident response. The mission is to provide quality, efficiency, cost‑effectiveness, and security for business health.

Repeated human errors still cause major incidents, prompting the need for systematic practices.

DevOps 36 Strategies – Daily Operations (First Strategy)

The goal is to establish correct daily operational rules so teams spend less time fixing avoidable problems.

Operations work is divided into planned tasks and unplanned tasks , based on over ten years of experience.

Strategy 7: Disaster‑Response Plans Must Include Regular Drills

In 2009 Tencent’s rapid growth outpaced data‑center capacity, leading to the adoption of a multi‑region, multi‑active SET (Service‑Entity‑Template) architecture. SET groups related business modules to limit cross‑IDC traffic and simplify disaster‑relocation.

Key SET characteristics:

Each SET contains ≤50 modules and ≤500 devices.

SET acts as the smallest unit for disaster scheduling.

SET management reduces decision‑making time during emergencies.

Challenges in SET‑based disaster scheduling include deciding whether to shift, which SET to shift, how to shift given dependencies, how much traffic to shift, and who performs the shift. Repeated drills refine these decisions.

Strategy 23: Every Incident Hides an Underlying Cause – Find and Eliminate It

Operations can be split into predictable (planned) and unpredictable (unplanned) tasks. A project to reduce on‑call call volume showed that standardizing configurations and enabling self‑healing eliminated many basic alerts (ping failures, agent timeouts, disk issues).

Three phases were applied:

Phase 1 – Configuration Standardization & Self‑Healing: Store key configuration in CMDB and automate safe restarts/replacements.

Phase 2 – Common Rule Extraction: Apply module‑level policies (e.g., disk‑cleanup) to all devices in a cluster once any device shows the issue.

Phase 3 – Correlation Analysis & Root‑Cause Tracing: Use CMDB relationships to aggregate network‑level alerts, reducing noise and enabling faster response.

Result: >90% of basic alerts now self‑heal.

Strategy 11: Delay or Throttle Irreversible Delete/Modify Operations

Typical emergency actions—restart, reinstall, rollback—must be complemented by strict records and rules. Irreversible actions (deleting databases, removing services) often cause severe incidents.

To mitigate risk, a seven‑step shutdown workflow is enforced:

Verify module and IP to avoid copy‑paste errors.

Remove service from name‑service routing.

Stop processes and ports, confirming traffic is cleared.

Apply iptables isolation to block external access.

Run automated packet capture to detect lingering traffic.

Enforce isolation periods based on service criticality (2 days to 1 month).

Automate OS reinstall or VM destruction.

These disciplined steps, combined with automation, help keep high‑risk operations safe and repeatable.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

monitoringAutomationOperationschange managementdisaster recoverySET architecture
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.