Tagged articles

Network Reliability

25 articles · Page 1 of 1

Jul 18, 2025 · Fundamentals

Why Does TCP Retransmission Matter? Unveiling the Secrets Behind Reliable Data Transfer

In the digital age, TCP's retransmission mechanisms ensure reliable data delivery by handling packet loss through timeout, fast retransmission, SACK, and D‑SACK, while adaptive RTO calculations and optimization strategies keep network performance stable across varied conditions.

Karn algorithmNetwork ReliabilityPacket loss

0 likes · 18 min read

Why Does TCP Retransmission Matter? Unveiling the Secrets Behind Reliable Data Transfer

FunTester

May 15, 2025 · Operations

Uncovering the Eight Hidden Pitfalls That Can Crash Your Distributed System

This article dissects the classic Eight Fallacies of Distributed Computing, explaining each mistaken assumption about network reliability, latency, bandwidth, security, topology, administration, cost, and homogeneity, and provides real‑world case studies and practical recommendations to help engineers design more resilient distributed systems.

FallaciesLatencyNetwork Reliability

0 likes · 16 min read

Uncovering the Eight Hidden Pitfalls That Can Crash Your Distributed System

Xiaokun's Architecture Exploration Notes

May 11, 2025 · Fundamentals

Why Unreliable Networks Threaten Distributed Systems and How to Mitigate Them

Distributed systems suffer from network unreliability—including packet loss, out‑of‑order delivery, variable latency, and ambiguous node failures—making timeout settings and fault detection challenging, and this article explains these issues, compares synchronous and asynchronous networks, and discusses strategies to balance latency and resource utilization.

Network Reliabilityasynchronous networkdistributed systems

0 likes · 8 min read

Why Unreliable Networks Threaten Distributed Systems and How to Mitigate Them

Xiaokun's Architecture Exploration Notes

Apr 20, 2025 · Fundamentals

Why Unreliable Networks Threaten Distributed Systems—and How to Mitigate Them

The article explains how network failures such as packet loss, reordering, latency, and ambiguous node failures make distributed systems unreliable, compares synchronous and asynchronous networks, and discusses the trade‑off between timeout settings and resource utilization.

LatencyNetwork ReliabilityNode Failure

0 likes · 8 min read

Cognitive Technology Team

Feb 2, 2025 · Fundamentals

Common Misconceptions in Distributed System Design and Their Solutions

Designing distributed systems often falls prey to misconceptions such as assuming reliable networks, zero latency, unlimited bandwidth, inherent security, static topology, zero transmission cost, and full autonomy, but applying retries, idempotency, message queues, encryption, dynamic discovery, caching, and time protocols can mitigate these issues.

ConsensusLatencyNetwork Reliability

0 likes · 5 min read

Common Misconceptions in Distributed System Design and Their Solutions

Alibaba Cloud Observability

Jan 6, 2025 · Operations

How Synthetic Monitoring Boosts Network Reliability and User Experience

This article explains the importance of network stability, outlines major real‑world outages, and introduces synthetic monitoring—its functions, advantages, disadvantages, and various types such as protocol, browser, and internal monitoring—while comparing probe point categories and guiding enterprises on selecting the right strategy to improve service reliability and performance.

Network ReliabilityObservabilityOperations

0 likes · 12 min read

How Synthetic Monitoring Boosts Network Reliability and User Experience

Cognitive Technology Team

May 15, 2024 · Fundamentals

The Fallacies of Distributed Systems: Understanding Common Network Assumptions

This article revisits the classic “Fallacies of Distributed Systems” introduced by Peter Deutsch, explaining why assumptions such as reliable networks, zero latency, infinite bandwidth, secure and homogeneous communication are false, and offering practical strategies like retries, caching, batching, and security‑first design to build robust distributed applications.

FallaciesLatencyNetwork Reliability

0 likes · 4 min read

The Fallacies of Distributed Systems: Understanding Common Network Assumptions

Ctrip Technology

Dec 14, 2023 · Operations

Improving Optical Transport Network Reliability at Ctrip: Architecture, Issue Analysis, and Optimization Strategies

This article describes Ctrip's optical transport network (TOTN) architecture, analyzes frequent fiber‑cut incidents and resulting device port flapping, presents technical research on fast optical switching and alarm delay, and details an optimization plan that achieved sub‑100 ms fault‑free switchover and stable Redis performance.

DCILink DelayNetwork Reliability

0 likes · 11 min read

Improving Optical Transport Network Reliability at Ctrip: Architecture, Issue Analysis, and Optimization Strategies

Alibaba Cloud Infrastructure

Oct 11, 2023 · Cloud Computing

Alibaba Cloud Infrastructure Network Team’s Contributions at ECOC 2023: Papers, Tutorial, and Workshop on Optical Data Center Networks

The Alibaba Cloud Infrastructure Network team presented a pioneering paper, a high‑profile tutorial, and a workshop at the 2023 European Conference on Optical Communications (ECOC), showcasing systematic analyses of optical network unavailability, innovative data‑center optical network designs, and multi‑fiber scaling strategies for large‑scale cloud operators.

Data CenterECOC 2023Network Reliability

0 likes · 5 min read

Alibaba Cloud Infrastructure Network Team’s Contributions at ECOC 2023: Papers, Tutorial, and Workshop on Optical Data Center Networks

Big Data Technology Architecture

Mar 15, 2023 · Big Data

Ensuring Secure Write Paths in Hadoop S3A: Experiments, Benchmarks, and Best Practices

This article analyses the security of Hadoop S3A write paths in data lakes, explains fast upload mechanisms, demonstrates disk‑IO and network‑error simulations, compares checksum algorithms, and presents Alibaba Cloud EMR JindoSDK best‑practice results with performance and reliability evaluations.

HadoopNetwork ReliabilityS3A

0 likes · 24 min read

Ensuring Secure Write Paths in Hadoop S3A: Experiments, Benchmarks, and Best Practices

Efficient Ops

Jul 18, 2022 · Operations

When Core Switches Fail: A Network Engineer’s Close Call and Lessons Learned

A network engineer recounts a terrifying core switch outage caused by an SSD firmware bug, describes the emergency troubleshooting steps, the eventual fix through firmware upgrade, and urges manufacturers to adopt recall mechanisms for critical network equipment.

Network ReliabilitySSD firmware bugcore switch failure

0 likes · 9 min read

When Core Switches Fail: A Network Engineer’s Close Call and Lessons Learned

Refining Core Development Skills

Jun 6, 2022 · Fundamentals

Why Large File Downloads Still Need Integrity Verification Despite TCP Reliability

Although TCP provides reliable transmission, its guarantees have limits—such as incomplete CRC checks, process crashes before data reaches the transport layer, possible ISP tampering, and kernel‑level ACKs that don’t ensure user‑space receipt—so large downloads must still be validated for full integrity.

DownloadLinuxNetwork Reliability

0 likes · 4 min read

Why Large File Downloads Still Need Integrity Verification Despite TCP Reliability

Open Source Linux

Jun 6, 2022 · Operations

Why Stack Switches? Benefits and Step‑by‑Step Guide to Building a Stacked Network

This article explains what switch stacking is, why it improves reliability, expands port capacity, boosts bandwidth, simplifies network design, and supports long‑distance deployments, then details the devices that support stacking and provides a complete step‑by‑step process for creating a stacked network.

Network OperationsNetwork Reliabilitybandwidth increase

0 likes · 10 min read

Why Stack Switches? Benefits and Step‑by‑Step Guide to Building a Stacked Network

MaGe Linux Operations

Jun 4, 2022 · Operations

Boost Bandwidth and Reliability with Link Aggregation: Principles and Config Guide

This article explains why link aggregation is needed as networks grow, describes its bandwidth‑boosting and reliability benefits, outlines manual and LACP modes, details data‑flow handling, and provides step‑by‑step configuration commands and troubleshooting tips for Ethernet trunks.

EthernetLACPNetwork Bandwidth

0 likes · 12 min read

Boost Bandwidth and Reliability with Link Aggregation: Principles and Config Guide

Architects' Tech Alliance

Jun 1, 2022 · Fundamentals

Understanding Switch Stacking: Benefits, Supported Devices, and Configuration Process

Switch stacking connects multiple compatible switches via stacking cables to form a single logical device, enhancing reliability, expanding port count, increasing bandwidth, simplifying network topology, supporting long‑distance stacking, and reducing maintenance, with details on supported Huawei devices, roles, IDs, priorities, and the step‑by‑step setup process.

Network Reliabilitybandwidth scalingnetwork topology

0 likes · 10 min read

Understanding Switch Stacking: Benefits, Supported Devices, and Configuration Process

Huawei Cloud Developer Alliance

Nov 24, 2021 · Cloud Computing

How Proactive Link Monitoring Transforms Cloud Network Reliability

This article explains Huawei Cloud Stack's proactive link monitoring system, detailing its point‑line‑plane architecture, golden metrics of packet loss and latency, detection techniques, system components, and key innovations such as strategy optimization, alarm aggregation, and visualized performance dashboards for cloud data‑center networks.

Cloud MonitoringData CenterLatency

0 likes · 13 min read

How Proactive Link Monitoring Transforms Cloud Network Reliability

Open Source Linux

Nov 18, 2021 · Operations

Boost Network Bandwidth & Reliability with Link Aggregation: Concepts & Config

This article explains the fundamentals of link aggregation, its motivations, various deployment scenarios, core principles, manual and static LACP modes, data flow control, configuration steps for both layer‑2 and layer‑3 trunks, and troubleshooting commands, helping network engineers increase bandwidth and reliability without hardware upgrades.

Ethernet trunkingLACPNetwork Bandwidth

0 likes · 12 min read

Boost Network Bandwidth & Reliability with Link Aggregation: Concepts & Config

Architects' Tech Alliance

Sep 14, 2021 · Fundamentals

Understanding Switch Stacking: Benefits, Devices, and Configuration Process

The article explains switch stacking—a method of connecting multiple stack‑capable switches into a single logical device—to improve reliability, expand port count, increase bandwidth, simplify network design, and support long‑distance deployments, while detailing supported hardware, role definitions, and step‑by‑step setup procedures.

Network Reliabilitybandwidth aggregationnetwork design

0 likes · 8 min read

Understanding Switch Stacking: Benefits, Devices, and Configuration Process

Youzan Coder

May 31, 2021 · Backend Development

16 TCP Network Programming Best Practices for Building Robust Applications

The article presents sixteen practical TCP network‑programming best practices—from setting SO_REUSEADDR and defining port standards to using application‑layer heartbeats, exponential backoff, connection limits, client‑side load balancing, periodic DNS refresh, optimal buffer sizing, configurable timeouts, proper connection‑pool sizing, and comprehensive metrics—to help developers build stable, reliable applications.

Backend DevelopmentConnection PoolNetwork Reliability

0 likes · 28 min read

16 TCP Network Programming Best Practices for Building Robust Applications

vivo Internet Technology

Apr 21, 2021 · Operations

System Health Check: Principles and Implementation

System health checks, akin to medical exams, are vital for modern IT infrastructure, using active and passive monitoring, failover strategies, and tools like Spring Boot Actuator to detect hardware, network, load, or software issues, prevent single points of failure, and ensure continuous high‑availability service operation.

High AvailabilityMonitoringNetwork Reliability

0 likes · 12 min read

System Health Check: Principles and Implementation

Efficient Ops

Apr 16, 2020 · Operations

What Caused Cloudflare’s 4‑Hour Outage? Lessons on Cable Management and Process Clarity

A four‑hour Cloudflare outage was triggered by an unauthorized cable removal during a planned maintenance, compounded by unclear instructions and unlabeled wiring, highlighting the need for better cable management, clear operational procedures, and robust single‑point‑of‑failure mitigation.

CloudflareNetwork ReliabilityOutage

0 likes · 3 min read

What Caused Cloudflare’s 4‑Hour Outage? Lessons on Cable Management and Process Clarity

21CTO

Mar 30, 2020 · Cloud Computing

What Triggered the Massive Google Cloud Outage on March 26 2020?

On March 26 2020 Google’s core services—including Search, Gmail, YouTube and G Suite—experienced a worldwide outage caused by a router failure in an Atlanta data center, a third‑party software bug that disrupted traffic across multiple regions, prompting detailed analysis from Google, DownDetector, ThousandEyes and other observers.

Google CloudNetwork ReliabilityOutage

0 likes · 8 min read

What Triggered the Massive Google Cloud Outage on March 26 2020?

Suning Technology

Nov 11, 2017 · Operations

Zero-Accident O2O Festival: Suning’s Frontend, App & Network Ops Secrets

Suning’s 2017 O2O shopping festival achieved a “zero‑incident” goal by integrating real‑time browser performance monitoring, WEEX‑based WAP acceleration, comprehensive app data collection with cloud‑based analytics, precise DNS and HTTP2 optimizations, and a multi‑layer network and service monitoring system that enabled rapid fault detection and capacity planning.

App OperationsNetwork ReliabilityO2O

0 likes · 15 min read

Zero-Accident O2O Festival: Suning’s Frontend, App & Network Ops Secrets

Tencent TDS Service

Dec 15, 2016 · Operations

Why Mobile Apps Need Their Own Timeout Strategy Beyond TCP

This article examines the design of read/write timeout mechanisms in WeChat's STN module, comparing TCP/IP layer retransmission with application‑level strategies, presenting experimental data from Android and iOS devices, and proposing total, first‑packet, packet‑gap, and dynamic timeout solutions to improve reliability on mobile networks.

Application LayerMobile NetworkingNetwork Reliability

0 likes · 15 min read

Why Mobile Apps Need Their Own Timeout Strategy Beyond TCP

Ctrip Technology

Aug 5, 2016 · Mobile Development

Ctrip Mobile App Network Service Architecture and Performance Optimizations

This article details Ctrip's mobile app network service architecture, the rationale for using TCP over HTTP, and a series of channel governance and performance optimization techniques—including DNS bypass, socket connection pooling, weak‑network handling, data format improvements, retry mechanisms, Hybrid and overseas network enhancements—demonstrating how these measures raised service success rates above 99% and reduced latency.

CtripMobile NetworkingNetwork Reliability

0 likes · 16 min read

Ctrip Mobile App Network Service Architecture and Performance Optimizations