How to Build Highly Available and Scalable Distributed Systems
This article explains the key challenges of high availability and scalability in distributed architectures and provides practical solutions for each layer—entry, business, cache, and database—using techniques such as heartbeat IPs, stateless services, consistent hashing, and sharding.
What Are High Availability and Scalability
High availability means users experience no disruption when a single machine, network card, switch, or cable fails; it also includes handling service‑level failures without user impact. Scalability means the system can grow by adding machines and deploying software so capacity increases automatically.
High Availability Solutions
Entry Layer
Use a heartbeat mechanism such as Keepalived with a virtual IP that can be moved between two machines in the same subnet. When the primary machine fails, the secondary grabs the virtual IP, and DNS points to that IP so users always reach a live node. Limitations include the need for the same subnet and the fact that services must listen on all IPs; otherwise the standby cannot start.
A common mistake is pointing a domain to two separate public IPs via DNS; when one node fails, roughly half of the users will still be directed to the dead IP, causing noticeable outages.
Business Layer
Make services stateless and store state in the cache or database. Stateless services can be load‑balanced by Nginx; if a node goes down or is updated, traffic is seamlessly redirected to the remaining nodes.
Avoid storing session data in cookies for security‑sensitive scenarios, as it can be replayed or tampered with; prefer centralized cache or database storage.
Cache Layer
High‑availability caches are achieved by fine‑grained partitioning and master‑slave replication. If a cache node fails, traffic falls back to the database, so the cache should be sized to avoid overwhelming the DB.
Use consistent hashing with virtual nodes to minimize cache miss spikes when adding or removing nodes; this limits the impact on the database to a small percentage of keys.
Handle consistency types: strong consistency (no stale data, e.g., balances), weak consistency (eventual, e.g., social counts), and immutable caches (values never change).
Database Layer
Most databases already provide HA via master‑slave or master‑master replication (MySQL, MongoDB ReplicaSet, etc.).
Scalability Solutions
Entry Layer
Scale by adding more machines and using DNS round‑robin or large‑capacity servers. For non‑HTTP traffic, clients can query a small service for a list of healthy entry IPs and pick one randomly.
Business Layer
Stateless services scale horizontally; Nginx can proxy to dozens or hundreds of backend instances without significant load.
Cache Layer
Redis 3.0 or Codis can be expanded by adding nodes during low‑traffic periods, then bringing new caches online. Consistent hashing reduces cache invalidation when nodes are added.
When expanding, avoid removing nodes; instead, only increase node count or ensure the node‑removal interval exceeds the data’s TTL to prevent dirty reads.
Database
Use horizontal sharding (e.g., split users into 100 tables) to prepare for future scaling to many machines. Vertical splitting separates frequently accessed columns from heavy detail columns, improving index efficiency. Periodic rolling of old data to separate storage (e.g., SSD vs. slower disks) also helps.
Conclusion
Divide a system into entry, business, cache, and database layers, each with specific HA and scalability techniques: heartbeat IPs and parallel deployment for entry, stateless design for business, fine‑grained caching with consistent hashing, and replication or sharding for databases. Even small services should adopt these patterns early, as the cost is modest and prevents catastrophic failures when traffic grows. Modern platforms (GAE, Heroku, Cloud Foundry, Docker, Kubernetes, Mesos) further abstract machines, making HA and scalability inherent to the deployment environment.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
