Operations 12 min read

10 Proven Strategies to Achieve 99.99% System Availability

This article presents ten practical techniques—including redundant deployment, circuit breaking, traffic shaping, auto‑scaling, gray releases, downgrade switches, full‑link stress testing, data sharding, chaos engineering, and three‑layer monitoring—to dramatically improve system high‑availability from 99% to 99.99% in production environments.

Su San Talks Tech
Su San Talks Tech
Su San Talks Tech
10 Proven Strategies to Achieve 99.99% System Availability

Introduction

System high availability is a classic problem that appears frequently in interviews and real work.

This article shares ten practical rules to ensure high availability.

1 Redundant Deployment

Scenario: During a major e‑commerce promotion, the primary database node crashes, causing the whole site to stall.

Problem: A single‑node deployment means that failure of a critical component (e.g., database, message queue) brings the business down.

Solution: Use master‑slave replication or clustering for redundancy, such as MySQL master‑slave sync or Redis Sentinel.

MySQL master‑slave configuration:

-- master config
CHANGE MASTER TO
MASTER_HOST='master_host',
MASTER_USER='replica_user',
MASTER_PASSWORD='password',
MASTER_LOG_FILE='mysql-bin.000001',
MASTER_LOG_POS=154;

-- start slave
START SLAVE;

Effect: When the master fails, the slave automatically switches to read‑write, and the business remains uninterrupted.

2 Service Circuit Breaker

Scenario: Payment service latency exhausts the order service thread pool, causing a cascade failure.

Problem: An exception in any link of a service dependency chain can drag down the entire system like dominoes.

Solution: Introduce a circuit‑breaker pattern, e.g., Hystrix or Resilience4j.

Resilience4j circuit‑breaker configuration:

CircuitBreakerConfig config = CircuitBreakerConfig.custom()
    .failureRateThreshold(50) // trigger when failure rate > 50%
    .waitDurationInOpenState(Duration.ofMillis(1000))
    .build();
CircuitBreaker circuitBreaker = CircuitBreaker.of("paymentService", config);

// call payment service
Supplier<String> supplier = () -> paymentService.call();
Supplier<String> decoratedSupplier = CircuitBreaker
    .decorateSupplier(circuitBreaker, supplier);

Effect: When the payment service failure rate spikes, the circuit opens and returns a degraded response (e.g., "System busy, try later").

3 Traffic Shaping (Peak Cutting)

Scenario: At the start of a flash‑sale, 100k QPS instantly overwhelms the database connection pool.

Problem: Sudden traffic exceeds system capacity, exhausting resources.

Solution: Introduce a message queue (e.g., Kafka, RocketMQ) for asynchronous buffering.

User order flow diagram:

RocketMQ producer example:

DefaultMQProducer producer = new DefaultMQProducer("seckill_producer");
producer.setNamesrvAddr("127.0.0.1:9876");
producer.start();
Message msg = new Message("seckill_topic", "订单数据".getBytes());
producer.send(msg);

Effect: The instantaneous 100k QPS request is smoothed to a database‑friendly 2k TPS.

4 Dynamic Scaling

Scenario: Normal traffic fits 100 servers, but a promotion requires rapid scaling to 500 servers.

Problem: Fixed resources cannot cope with traffic spikes; manual scaling is inefficient.

Solution: Use Kubernetes Horizontal Pod Autoscaler (HPA).

K8s HPA configuration:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: order-service-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: order-service
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 60

Effect: When CPU utilization exceeds 60%, pods scale out; below 30%, they scale in automatically.

5 Gray Release

Scenario: New version contains a memory leak; full rollout crashes the service.

Problem: One‑time full release carries high risk of global failure.

Solution: Adopt traffic‑percentage based gray release.

Istio traffic‑splitting configuration:

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: bookinfo
spec:
  hosts:
  - bookinfo.com
  http:
  - route:
    - destination:
        host: reviews
        subset: v1
      weight: 90 # 90% traffic to old version
    - destination:
        host: reviews
        subset: v2
      weight: 10 # 10% traffic to new version

Effect: If the new version fails, only 10% of users are affected, enabling quick rollback.

6 Degrade Switch

Scenario: Recommendation service timeout inflates product detail page load time from 200 ms to 5 s.

Problem: Non‑core service failures degrade the core user experience.

Solution: Add a degradation switch in the configuration center to dynamically disable non‑critical services.

Apollo configuration example:

@ApolloConfig
private Config config;

public ProductDetail getDetail(String productId) {
    if (config.getBooleanProperty("recommend.switch", true)) {
        // call recommendation service
    }
    // return basic product info
}

Effect: Disabling the recommendation service brings page response time back under 200 ms.

7 Full‑Link Stress Test

Scenario: A financial system reveals a database deadlock under real traffic.

Problem: Test environments cannot simulate real traffic characteristics, making production risks hidden.

Solution: Conduct full‑link stress testing based on recorded traffic.

Implementation Steps:

Online traffic recording (e.g., JMeter + TCPCopy)

Shadow database isolation (store test data separately)

Test data desensitization

Execute stress test and monitor bottlenecks

Effect: Early detection of connection‑pool shortages, cache penetration, and other issues.

8 Data Sharding

Scenario: User table reaches 1 billion rows, causing query performance to collapse.

Problem: A single database/table becomes a performance bottleneck.

Solution: Use ShardingSphere for database and table sharding.

ShardingSphere configuration:

sharding:
  tables:
    user:
      actualDataNodes: ds_${0..1}.user_${0..15}
      tableStrategy:
        standard:
          shardingColumn: user_id
          preciseAlgorithmClassName: HashModShardingAlgorithm
          preciseAlgorithmType: HASH_MOD
          shardingCount: 16

Effect: One billion records are spread across 16 physical tables, improving query speed by 20×.

9 Chaos Engineering

Scenario: A data‑center network glitch caused a 3‑hour service outage.

Problem: Insufficient system robustness and weak fault‑recovery capability.

Solution: Use ChaosBlade to simulate failures.

Example commands:

# simulate network latency
blade create network delay --time 3000 --interface eth0

# simulate database node crash
blade create docker kill --container-id mysql-node-1

Effect: Early discovery of cache‑penetration causing DB overload, leading to improved cache‑break protection.

10 Three‑Dimensional Monitoring

Scenario: Sudden IOPS spike leads to order timeout, but ops notice it only after two hours.

Problem: Single‑dimensional monitoring cannot quickly locate root causes.

Solution: Build a Metrics‑Log‑Trace integrated monitoring system.

Tech Stack:

Metrics: Prometheus + Grafana (resource metrics)

Log: ELK (log analysis)

Trace: SkyWalking (call‑chain tracing)

Problem‑location workflow:

CPU utilization > 80% → correlate logs → find frequent GC → trace call chain → discover DAO SQL without index

Effect: Fault‑location time reduced from hours to minutes.

Conclusion

Building high availability is like constructing a deep‑sea vessel: redundant deployment acts as twin engines, circuit breaking as lifeboats, and monitoring as radar.

The key lies in proactive fault prevention, automation (e.g., K8s auto‑scaling), and data‑driven optimization (full‑link testing + three‑dimensional monitoring).

While 100% availability is impossible, applying these ten practical techniques can raise system uptime from 99% to 99.99%, saving roughly eight hours of downtime per year.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

BackendCloud NativeMicroservicesSystem Design
Su San Talks Tech
Written by

Su San Talks Tech

Su San, former staff at several leading tech companies, is a top creator on Juejin and a premium creator on CSDN, and runs the free coding practice site www.susan.net.cn.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.