Operations 8 min read

How to Build Highly Available Systems: Fault Tolerance and Scalability Strategies

This article explains why high availability is critical for internet services, outlines key techniques such as stateless design, service discovery, heartbeat checks, idempotent operations, load balancing, throttling, caching, and micro‑service architecture, and discusses the operational challenges and monitoring tools needed to maintain resilient, scalable systems.

21CTO
21CTO
21CTO
How to Build Highly Available Systems: Fault Tolerance and Scalability Strategies

In today’s high‑traffic internet era, guaranteeing service availability is essential; downtime leads to poor user experience, brand damage, and direct financial loss, as illustrated by the 2015 Ctrip outage that cost about $1.06 million per hour.

How to Achieve High Availability?

Key principles include:

Design services to be stateless so that any instance can be replaced without data loss.

Implement service discovery and registration so that clients can locate healthy instances dynamically.

Use regular heartbeat checks to detect failed machines, services, or network partitions.

Ensure operations are idempotent and support retries to handle transient failures without duplicate effects.

When traffic spikes, scaling out by adding more machines improves throughput, but load must be distributed using load‑balancing strategies, black‑/white‑lists, and rate limiting.

Adding resources alone is insufficient; bottlenecks such as blocking calls must be addressed, and service timeouts should be configured to prevent cascading delays.

Choosing between synchronous and asynchronous processing depends on business requirements; asynchronous designs often rely on message queues that provide true asynchrony, no loss, and no duplication, though they add complexity.

During extreme load, rate limiting and service degradation protect the system by rejecting or simplifying requests, while caching reduces read pressure because most internet workloads are read‑heavy.

Adopting a micro‑service architecture enables vertical and horizontal decomposition of functionality, but introduces challenges such as maintaining eventual consistency , monitoring call chains, alerting, distributed logging, and advanced tracing (e.g., jstack, Btrace).

By applying these techniques, a system can tolerate machine failures, service crashes, network issues, and traffic bursts, achieving high availability while remaining scalable.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

MicroservicesScalabilityhigh availabilityservice discoveryfault toleranceIdempotency
21CTO
Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.