How to Build Highly Available and Scalable Distributed Systems

This article explains the key challenges of high availability and scalability in distributed architectures and provides practical solutions for each layer—entry, business, cache, and database—using techniques such as heartbeat IPs, stateless services, consistent hashing, and sharding.

21CTO
21CTO
21CTO
How to Build Highly Available and Scalable Distributed Systems

What Are High Availability and Scalability

High availability means users experience no disruption when a single machine, network card, switch, or cable fails; it also includes handling service‑level failures without user impact. Scalability means the system can grow by adding machines and deploying software so capacity increases automatically.

High Availability Solutions

Entry Layer

Use a heartbeat mechanism such as Keepalived with a virtual IP that can be moved between two machines in the same subnet. When the primary machine fails, the secondary grabs the virtual IP, and DNS points to that IP so users always reach a live node. Limitations include the need for the same subnet and the fact that services must listen on all IPs; otherwise the standby cannot start.

A common mistake is pointing a domain to two separate public IPs via DNS; when one node fails, roughly half of the users will still be directed to the dead IP, causing noticeable outages.

Business Layer

Make services stateless and store state in the cache or database. Stateless services can be load‑balanced by Nginx; if a node goes down or is updated, traffic is seamlessly redirected to the remaining nodes.

Avoid storing session data in cookies for security‑sensitive scenarios, as it can be replayed or tampered with; prefer centralized cache or database storage.

Cache Layer

High‑availability caches are achieved by fine‑grained partitioning and master‑slave replication. If a cache node fails, traffic falls back to the database, so the cache should be sized to avoid overwhelming the DB.

Use consistent hashing with virtual nodes to minimize cache miss spikes when adding or removing nodes; this limits the impact on the database to a small percentage of keys.

Handle consistency types: strong consistency (no stale data, e.g., balances), weak consistency (eventual, e.g., social counts), and immutable caches (values never change).

Database Layer

Most databases already provide HA via master‑slave or master‑master replication (MySQL, MongoDB ReplicaSet, etc.).

Scalability Solutions

Entry Layer

Scale by adding more machines and using DNS round‑robin or large‑capacity servers. For non‑HTTP traffic, clients can query a small service for a list of healthy entry IPs and pick one randomly.

Business Layer

Stateless services scale horizontally; Nginx can proxy to dozens or hundreds of backend instances without significant load.

Cache Layer

Redis 3.0 or Codis can be expanded by adding nodes during low‑traffic periods, then bringing new caches online. Consistent hashing reduces cache invalidation when nodes are added.

When expanding, avoid removing nodes; instead, only increase node count or ensure the node‑removal interval exceeds the data’s TTL to prevent dirty reads.

Database

Use horizontal sharding (e.g., split users into 100 tables) to prepare for future scaling to many machines. Vertical splitting separates frequently accessed columns from heavy detail columns, improving index efficiency. Periodic rolling of old data to separate storage (e.g., SSD vs. slower disks) also helps.

Conclusion

Divide a system into entry, business, cache, and database layers, each with specific HA and scalability techniques: heartbeat IPs and parallel deployment for entry, stateless design for business, fine‑grained caching with consistent hashing, and replication or sharding for databases. Even small services should adopt these patterns early, as the cost is modest and prevents catastrophic failures when traffic grows. Modern platforms (GAE, Heroku, Cloud Foundry, Docker, Kubernetes, Mesos) further abstract machines, making HA and scalability inherent to the deployment environment.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Distributed SystemsBackend ArchitectureScalabilityload balancing
21CTO
Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.