Tag

network reliability

0 views collected around this technical thread.

Xiaokun's Architecture Exploration Notes
Xiaokun's Architecture Exploration Notes
May 11, 2025 · Fundamentals

Why Unreliable Networks Threaten Distributed Systems and How to Mitigate Them

Distributed systems suffer from network unreliability—including packet loss, out‑of‑order delivery, variable latency, and ambiguous node failures—making timeout settings and fault detection challenging, and this article explains these issues, compares synchronous and asynchronous networks, and discusses strategies to balance latency and resource utilization.

asynchronous networkdistributed systemsfault tolerance
0 likes · 8 min read
Why Unreliable Networks Threaten Distributed Systems and How to Mitigate Them
Xiaokun's Architecture Exploration Notes
Xiaokun's Architecture Exploration Notes
Apr 20, 2025 · Fundamentals

Why Unreliable Networks Threaten Distributed Systems—and How to Mitigate Them

The article explains how network failures such as packet loss, reordering, latency, and ambiguous node failures make distributed systems unreliable, compares synchronous and asynchronous networks, and discusses the trade‑off between timeout settings and resource utilization.

Node Failureasynchronous networkdistributed systems
0 likes · 8 min read
Why Unreliable Networks Threaten Distributed Systems—and How to Mitigate Them
Cognitive Technology Team
Cognitive Technology Team
Feb 2, 2025 · Fundamentals

Common Misconceptions in Distributed System Design and Their Solutions

Designing distributed systems often falls prey to misconceptions such as assuming reliable networks, zero latency, unlimited bandwidth, inherent security, static topology, zero transmission cost, and full autonomy, but applying retries, idempotency, message queues, encryption, dynamic discovery, caching, and time protocols can mitigate these issues.

ConsensusSecuritydistributed systems
0 likes · 5 min read
Common Misconceptions in Distributed System Design and Their Solutions
Cognitive Technology Team
Cognitive Technology Team
May 15, 2024 · Fundamentals

The Fallacies of Distributed Systems: Understanding Common Network Assumptions

This article revisits the classic “Fallacies of Distributed Systems” introduced by Peter Deutsch, explaining why assumptions such as reliable networks, zero latency, infinite bandwidth, secure and homogeneous communication are false, and offering practical strategies like retries, caching, batching, and security‑first design to build robust distributed applications.

BandwidthSecuritydistributed systems
0 likes · 4 min read
The Fallacies of Distributed Systems: Understanding Common Network Assumptions
Ctrip Technology
Ctrip Technology
Dec 14, 2023 · Operations

Improving Optical Transport Network Reliability at Ctrip: Architecture, Issue Analysis, and Optimization Strategies

This article describes Ctrip's optical transport network (TOTN) architecture, analyzes frequent fiber‑cut incidents and resulting device port flapping, presents technical research on fast optical switching and alarm delay, and details an optimization plan that achieved sub‑100 ms fault‑free switchover and stable Redis performance.

DCILink DelayOptical Transport
0 likes · 11 min read
Improving Optical Transport Network Reliability at Ctrip: Architecture, Issue Analysis, and Optimization Strategies
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Oct 11, 2023 · Cloud Computing

Alibaba Cloud Infrastructure Network Team’s Contributions at ECOC 2023: Papers, Tutorial, and Workshop on Optical Data Center Networks

The Alibaba Cloud Infrastructure Network team presented a pioneering paper, a high‑profile tutorial, and a workshop at the 2023 European Conference on Optical Communications (ECOC), showcasing systematic analyses of optical network unavailability, innovative data‑center optical network designs, and multi‑fiber scaling strategies for large‑scale cloud operators.

Data CenterECOC 2023cloud infrastructure
0 likes · 5 min read
Alibaba Cloud Infrastructure Network Team’s Contributions at ECOC 2023: Papers, Tutorial, and Workshop on Optical Data Center Networks
Big Data Technology Architecture
Big Data Technology Architecture
Mar 15, 2023 · Big Data

Ensuring Secure Write Paths in Hadoop S3A: Experiments, Benchmarks, and Best Practices

This article analyses the security of Hadoop S3A write paths in data lakes, explains fast upload mechanisms, demonstrates disk‑IO and network‑error simulations, compares checksum algorithms, and presents Alibaba Cloud EMR JindoSDK best‑practice results with performance and reliability evaluations.

Big DataData LakeHadoop
0 likes · 24 min read
Ensuring Secure Write Paths in Hadoop S3A: Experiments, Benchmarks, and Best Practices
Efficient Ops
Efficient Ops
Jul 18, 2022 · Operations

When Core Switches Fail: A Network Engineer’s Close Call and Lessons Learned

A network engineer recounts a terrifying core switch outage caused by an SSD firmware bug, describes the emergency troubleshooting steps, the eventual fix through firmware upgrade, and urges manufacturers to adopt recall mechanisms for critical network equipment.

SSD firmware bugcore switch failurenetwork reliability
0 likes · 9 min read
When Core Switches Fail: A Network Engineer’s Close Call and Lessons Learned
Refining Core Development Skills
Refining Core Development Skills
Jun 6, 2022 · Fundamentals

Why Large File Downloads Still Need Integrity Verification Despite TCP Reliability

Although TCP provides reliable transmission, its guarantees have limits—such as incomplete CRC checks, process crashes before data reaches the transport layer, possible ISP tampering, and kernel‑level ACKs that don’t ensure user‑space receipt—so large downloads must still be validated for full integrity.

Linuxchecksumdownload
0 likes · 4 min read
Why Large File Downloads Still Need Integrity Verification Despite TCP Reliability
Architects' Tech Alliance
Architects' Tech Alliance
Jun 1, 2022 · Fundamentals

Understanding Switch Stacking: Benefits, Supported Devices, and Configuration Process

Switch stacking connects multiple compatible switches via stacking cables to form a single logical device, enhancing reliability, expanding port count, increasing bandwidth, simplifying network topology, supporting long‑distance stacking, and reducing maintenance, with details on supported Huawei devices, roles, IDs, priorities, and the step‑by‑step setup process.

Network TopologySwitch Stackingbandwidth scaling
0 likes · 10 min read
Understanding Switch Stacking: Benefits, Supported Devices, and Configuration Process
Architects' Tech Alliance
Architects' Tech Alliance
Sep 14, 2021 · Fundamentals

Understanding Switch Stacking: Benefits, Devices, and Configuration Process

The article explains switch stacking—a method of connecting multiple stack‑capable switches into a single logical device—to improve reliability, expand port count, increase bandwidth, simplify network design, and support long‑distance deployments, while detailing supported hardware, role definitions, and step‑by‑step setup procedures.

Switch Stackingbandwidth aggregationnetwork design
0 likes · 8 min read
Understanding Switch Stacking: Benefits, Devices, and Configuration Process
Youzan Coder
Youzan Coder
May 31, 2021 · Backend Development

16 TCP Network Programming Best Practices for Building Robust Applications

The article presents sixteen practical TCP network‑programming best practices—from setting SO_REUSEADDR and defining port standards to using application‑layer heartbeats, exponential backoff, connection limits, client‑side load balancing, periodic DNS refresh, optimal buffer sizing, configurable timeouts, proper connection‑pool sizing, and comprehensive metrics—to help developers build stable, reliable applications.

Linux TCPSocket ProgrammingTCP Network Programming
0 likes · 28 min read
16 TCP Network Programming Best Practices for Building Robust Applications
vivo Internet Technology
vivo Internet Technology
Apr 21, 2021 · Operations

System Health Check: Principles and Implementation

System health checks, akin to medical exams, are vital for modern IT infrastructure, using active and passive monitoring, failover strategies, and tools like Spring Boot Actuator to detect hardware, network, load, or software issues, prevent single points of failure, and ensure continuous high‑availability service operation.

FailoverHigh AvailabilityRocketMQ
0 likes · 12 min read
System Health Check: Principles and Implementation
Efficient Ops
Efficient Ops
Apr 16, 2020 · Operations

What Caused Cloudflare’s 4‑Hour Outage? Lessons on Cable Management and Process Clarity

A four‑hour Cloudflare outage was triggered by an unauthorized cable removal during a planned maintenance, compounded by unclear instructions and unlabeled wiring, highlighting the need for better cable management, clear operational procedures, and robust single‑point‑of‑failure mitigation.

CloudflareData Center Operationscable management
0 likes · 3 min read
What Caused Cloudflare’s 4‑Hour Outage? Lessons on Cable Management and Process Clarity
Ctrip Technology
Ctrip Technology
Aug 5, 2016 · Mobile Development

Ctrip Mobile App Network Service Architecture and Performance Optimizations

This article details Ctrip's mobile app network service architecture, the rationale for using TCP over HTTP, and a series of channel governance and performance optimization techniques—including DNS bypass, socket connection pooling, weak‑network handling, data format improvements, retry mechanisms, Hybrid and overseas network enhancements—demonstrating how these measures raised service success rates above 99% and reduced latency.

CtripProtocolsTCP optimization
0 likes · 16 min read
Ctrip Mobile App Network Service Architecture and Performance Optimizations