Operations 20 min read

Practical Strategies for Building High‑Availability Systems

This article presents a comprehensive, step‑by‑step guide on improving system reliability through early fault detection, scope reduction, frequency reduction, and rapid incident handling, using real‑world practices from Baidu's commercial hosting platform.

Baidu Geek Talk

Oct 20, 2021

Practical Strategies for Building High‑Availability Systems

In fast‑changing internet companies, system complexity grows quickly, making stability a critical bottleneck; rebuilding from scratch is costly, so continuous architectural optimization is essential.

Overall High‑Availability Approach

The strategy focuses on fault detection time, impact range, frequency, and handling speed, combining standards, monitoring, redundancy, degradation, and pre‑planned responses.

2.1 Early Fault Detection

Log Standardization : Enforce unified log formats using MDC, set additivity=false, concise messages, and SLF4J façade. Define log levels (TRACE, DEBUG, WARN, INFO, ERROR) and their usage. Example log pattern:

<property name="ENCODER_PATTERN" value="%d{yyyy-MM-dd HH:mm:ss.SSS} [%thread] [%X{reqid}] [%X{ip}] [%X{baiduid}] [%X{cuid}] %-5level %logger{5}: %msg%n"/>

Alarm Standardization : Align alarm severity with log levels, assign monitoring tasks, and define escalation processes for generic services, critical dependencies, error codes, and third‑party latency.

On‑Call Procedures : Define clear steps for reporting, damage control, localization, and resolution during incidents.

System Monitoring : Build automated detection from business metrics, functional checks, stability indicators, and data correctness. Use logs to feed monitoring pipelines.

Capacity Assessment : Combine static analysis of dependency topology with dynamic load testing. Static analysis estimates maximum load; dynamic testing validates with real traffic, balancing risk and accuracy.

2.2 Reducing Fault Scope

Storage Isolation : Split shared MySQL clusters by business domain to avoid cross‑impact; migrate to separate physical clusters with proper capacity planning and dual‑write synchronization.

Service Isolation : Separate frequently changing services, high‑concurrency services, and align service boundaries with organizational structure (Conway's law). Apply multi‑region redundancy for stateful components.

Permission Isolation : Restrict database read/write IPs, limit deployment permissions, and enforce code repository access controls.

2.3 Reducing Fault Frequency

Rate Limiting : Implement token‑bucket or leaky‑bucket algorithms at gateway (e.g., BFE) for web/API traffic and use Resilience4j RateLimiter for RPC services.

Circuit Breaking : Identify strong vs. weak dependencies, then apply Resilience4j or Sentinel circuit breakers with states CLOSED, HALF_OPEN, OPEN, DISABLED, FORCED_OPEN, using a ring‑bit buffer for call outcomes.

Timeout & Retry Settings : Set RPC/HTTP timeouts 30‑50% above the 99th‑percentile latency; adjust framework defaults (Redis 2000 ms, OkHttp 10 s, Hikari 30 s) based on load.

Pool Configuration : Use ThreadPoolExecutor instead of Executors, name pools for monitoring, size pools based on QPS × processing time, and align with CPU, memory, and disk resources.

2.4 Fast Fault Handling

Rapid Scaling : Leverage container‑orchestrated auto‑scaling; migrate legacy services to PaaS for on‑demand capacity.

Pre‑planned Runbooks : Archive past incidents, create triggerable playbooks for traffic throttling, data backup, and rollback procedures.

Data Backup : Perform daily MySQL backups with binlog replay capability; coordinate with DBAs for rapid restoration.

Conclusion

System stability requires coordinated efforts across development, testing, and operations. By applying early detection, isolation, rate limiting, circuit breaking, proper timeout/pool settings, and well‑defined runbooks, organizations can significantly improve availability while minimizing downtime and operational risk.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

monitoring operations high‑availability system reliability capacity planning Rate Limiting Circuit Breaker Log Standardization

Written by

Baidu Geek Talk

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.