Operations 30 min read

How to Build Reliable Operations: From BCM to Google SRE Practices

This article examines the growing challenges of system availability in modern operations, explains the concept of availability and the N‑nine metric, introduces Business Continuity Management and Google SRE approaches, and provides concrete technical and managerial methods—including architecture standardization, scaling strategies, tooling, emergency drills, and incident‑centralized management—to improve operational reliability.

dbaplus Community

May 8, 2018

How to Build Reliable Operations: From BCM to Google SRE Practices

1. Availability Fundamentals

Availability measures the ability of a system to perform required functions within specified conditions and time intervals. It is commonly expressed as “N 9’s” (e.g., 99.9 % uptime). The basic calculation is: Availability = MTBF / (MTTR + MTBF) In practice, nominal availability assumes only unplanned downtime. Typical targets are:

Core network and data‑center services: 100 % (six 9’s)

Customer‑facing transaction systems: five 9’s or four 9’s

Internal management systems: three 9’s

Key recovery objectives are:

RPO (Recovery Point Objective): maximum tolerable data loss.

RTO (Recovery Time Objective): maximum tolerable downtime.

The Chinese “Information Security Technology – Disaster‑Recovery Specification” defines six DR levels based on RPO and RTO and identifies seven essential DR elements (facility, network, server, OS, middleware, database, application).

2. Business Continuity Management (BCM)

BCM is a systematic process that helps enterprises identify potential crises, devise response and recovery plans, and improve risk‑mitigation capability. It covers disaster recovery, risk management, emergency response, and related industry guidelines.

3. Google Site Reliability Engineering (SRE) Perspective

Google SRE focuses on ensuring service availability through deep system knowledge, automation, and continuous improvement. Responsibilities are split into three skill areas:

System architecture & runtime awareness (hardware, OS, network, containers, programming languages, performance tuning).

Operations process expertise (fault handling, release, availability management, post‑mortem analysis).

Operations development & product management (building and maintaining automation, monitoring, and deployment tools). Google recommends a 50 %/50 % split between day‑to‑day reliability work and project work.

4. Technical Measures for Availability

4.1 Architecture Standardization

Define high‑availability standards as admission gates for production environments. Common patterns include:

Dual‑machine backup (cold or hot standby with heartbeat monitoring, data synchronization, and automated failover scripts).

Cluster, load‑balancing, and distributed architectures (load‑balancing algorithms, health checks, session persistence; distributed systems such as Hadoop MapReduce and HDFS designed for node failures).

Standardization recommendations:

Specify a preferred architecture per business module and enforce it as an admission gate.

Develop reusable components (hardware load‑balancers, software load‑balancing modules, PaaS cluster templates, standardized scripts).

Provide developers with standardized interfaces for exception reporting and centralized event handling.

4.2 Architecture Optimization

Typical optimization tactics:

Vertical scaling : add more powerful servers.

Horizontal scaling : add nodes, split databases, increase load‑balancer capacity, or deploy region‑based instances.

Service‑oriented decomposition : split business functions into independent services, enabling separate read/write databases.

Read/write splitting : separate read‑heavy and write‑heavy workloads, use cache layers, or adopt sharded distributed databases.

Service‑logic grouping : group related services to reduce operational overhead.

Workflow optimization : simplify or eliminate low‑value business steps.

Asynchronous conversion : move suitable synchronous processes to asynchronous queues to improve concurrency.

Peak‑load limiting : identify peak usage and apply throttling or capacity planning.

Database monitoring : analyze slow‑query logs and feed findings to developers/DBAs for tuning.

4.3 Tooling to Support Availability

An effective incident‑centralized management platform should provide:

Event aggregation across layers and domains.

Event convergence to suppress duplicate alerts.

Event classification and escalation based on severity.

Event correlation analysis (vertical: infrastructure → application → transaction; horizontal: peer services).

5. Management Measures

5.1 Emergency Drills

Drills validate and improve availability by simulating failures. Typical types include:

Routine availability drills (planned shutdowns, maintenance windows).

High‑availability failover drills (primary‑secondary switch).

Pre‑production drills (joint exercises with development/testing for major changes).

Table‑top drills (scenario discussion and decision‑making).

Real‑world failover (non‑transaction period, long‑run backup operation).

Destructive testing (e.g., Netflix “Chaos Monkey”).

5.2 Emergency Measures

Effective emergency plans should be concise and focus on the 80 % of incidents they can resolve. They must contain:

System‑level information : role in transaction flow, configuration, scaling actions.

Service‑level details : affected business, log locations, restart procedures, parameter tuning.

Transaction‑level checks : impact assessment, query scripts, critical batch jobs.

Tool usage guidance : automation scripts, monitoring integration.

Communication plan : contact lists for upstream/downstream systems, third‑party vendors, and business owners.

5.3 Prioritized Emergency Approaches (“Three‑Axe”)

Group the most frequent remedial actions to maximize reliability gains:

Application‑level : restart, rollback, switch.

Database‑level : kill locks, add indexes, clean data.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Operations SRE Incident Management availability BCM

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.