Fundamentals 8 min read

Why Unreliable Networks Threaten Distributed Systems—and How to Mitigate Them

The article explains how network failures such as packet loss, reordering, latency, and ambiguous node failures make distributed systems unreliable, compares synchronous and asynchronous networks, and discusses the trade‑off between timeout settings and resource utilization.

Xiaokun's Architecture Exploration Notes
Xiaokun's Architecture Exploration Notes
Xiaokun's Architecture Exploration Notes
Why Unreliable Networks Threaten Distributed Systems—and How to Mitigate Them

Unreliable Network Issue Classification

Distributed systems suffer from partial failures not only from faulty components but also from unreliable network conditions.

Request/Response Loss : Packets may be completely lost due to physical link failures (e.g., a cut fiber) or protocol errors (e.g., TCP retransmission limits).

Reordering and Delay : Asynchronous networks cannot guarantee in‑order delivery; congestion control algorithms and hop count cause long‑tail latency, where the 99th‑percentile delay can be ten times the average.

Ambiguity of Node Failure : It is hard to distinguish a network partition from an actual node crash, leading to false failure detections.

Synchronous Network and Bounded Latency

A synchronous network reserves a fixed bandwidth path for the entire communication session, similar to a dedicated telephone circuit. The topology is typically star‑shaped, providing predictable, fixed end‑to‑end latency, which we call bounded latency.

Synchronous network illustration
Synchronous network illustration

Asynchronous Network and Unbounded Latency

Data‑center and mobile networks operate as asynchronous networks, akin to a highway where vehicles of different sizes (data packets) share lanes and may change lanes or queue, causing congestion. Consequently, end‑to‑end latency is unpredictable (unbounded latency).

Asynchronous network illustration
Asynchronous network illustration

Packet reordering arises from packet‑switching: packets are split into smaller units and may arrive out of order, requiring sequence numbers or delimiters to reassemble correctly.

Typical asynchronous network topologies include Fat‑Tree and Spine‑Leaf architectures:

Fat‑Tree topology
Fat‑Tree topology
Spine‑Leaf topology
Spine‑Leaf topology

Ambiguity of Node Failure

Network unreliability adds complexity to fault‑tolerance mechanisms because it is difficult to tell whether a node is truly down or merely unreachable due to network issues.

In compute clusters, load balancers must stop sending requests to a dead node (e.g., Node3). In storage clusters, a master‑slave setup requires electing a new master when the current master fails.

Compute cluster failure detection
Compute cluster failure detection
Storage cluster master election
Storage cluster master election

Timeout settings must balance between being long enough to avoid false positives and short enough to prevent prolonged failures; a heuristic of 2d + r (round‑trip delay plus processing time) is often suggested, but real‑world variability makes precise tuning difficult.

Summary

In data‑center asynchronous networks, packet‑switching and congestion‑induced queuing introduce significant uncertainty in timeout configuration, forcing engineers to trade off between latency tolerance and resource utilization.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Distributed SystemsLatencyNode FailureNetwork Reliabilityasynchronous networksynchronous network
Xiaokun's Architecture Exploration Notes
Written by

Xiaokun's Architecture Exploration Notes

10 years of backend architecture design | AI engineering infrastructure, storage architecture design, and performance optimization | Former senior developer at NetEase, Douyu, Inke, etc.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.