Fundamentals 8 min read

Building Robust Distributed Systems: Reducing Dependencies and Enhancing Resilience

The article explains how to design resilient distributed systems by minimizing inter‑component dependencies, duplicating or denormalizing data, isolating failures with SLAs, protecting callers and callees, and adding buffers such as asynchronous messaging and elastic scaling to handle random faults as systems grow.

Architecture Digest

May 8, 2022

Building Robust Distributed Systems: Reducing Dependencies and Enhancing Resilience

In a previous post the author introduced what distributed systems are and how they provide massive scalability at the cost of more complex design; this article focuses on making those systems resilient to random failures that become more common as the system grows.

System theory tells us that the more inter‑connected parts a system has, the higher the chance of a large‑scale failure, so building resilience means reducing connections or temporarily cutting off faulty components to prevent error cascades.

Each component must assume that any other component may fail at some point and decide how to respond when that happens. Adding buffers—relaxed requirements or slack—helps the system cope with unexpected conditions.

1. Minimize Inter‑Component Dependencies

Components communicate to obtain data or functionality; pushing data/functionality to the caller instead of remote access reduces connection needs. Replicating frequently accessed data locally, caching regularly changing data, and denormalizing relational data all lower runtime dependencies and latency.

Packaging remote components as libraries can also break dependencies, though it may introduce version‑upgrade challenges.

2. Isolate Failures

Failure isolation is crucial because individual errors are common in distributed systems and unchecked cascading failures defeat the purpose of building complex systems.

Each component declares an SLA (latency, error rate, concurrency, etc.). Callers treat a component as failed if it cannot meet its SLA, using timeouts, retries (when operations are idempotent), or circuit breakers that temporarily stop calls to allow recovery.

Protecting the callee involves adding random jitter to retries to avoid “retry storms” and employing back‑pressure so an overloaded component can shed load before violating its SLA.

3. Build Buffers in the System

Asynchronous communication channels such as message buses let callers invoke remote components without strict SLA dependencies, providing flexibility under load.

Elastic scaling—adding more hardware when traffic spikes—offers a final line of defense, assuming cost constraints allow.

Overall, the article emphasizes reducing coupling, providing redundancy, enforcing SLAs, and adding buffering mechanisms to achieve fault‑tolerant, scalable distributed architectures.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Microservices System Design SLA fault tolerance Resilience asynchronous communication

Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.