Tagged articles
25 articles
Page 1 of 1
FunTester
FunTester
May 15, 2025 · Operations

Uncovering the Eight Hidden Pitfalls That Can Crash Your Distributed System

This article dissects the classic Eight Fallacies of Distributed Computing, explaining each mistaken assumption about network reliability, latency, bandwidth, security, topology, administration, cost, and homogeneity, and provides real‑world case studies and practical recommendations to help engineers design more resilient distributed systems.

Distributed SystemsFallaciesLatency
0 likes · 16 min read
Uncovering the Eight Hidden Pitfalls That Can Crash Your Distributed System
Xiaokun's Architecture Exploration Notes
Xiaokun's Architecture Exploration Notes
May 11, 2025 · Fundamentals

Why Unreliable Networks Threaten Distributed Systems and How to Mitigate Them

Distributed systems suffer from network unreliability—including packet loss, out‑of‑order delivery, variable latency, and ambiguous node failures—making timeout settings and fault detection challenging, and this article explains these issues, compares synchronous and asynchronous networks, and discusses strategies to balance latency and resource utilization.

Distributed SystemsNetwork Reliabilityasynchronous network
0 likes · 8 min read
Why Unreliable Networks Threaten Distributed Systems and How to Mitigate Them
Cognitive Technology Team
Cognitive Technology Team
Feb 2, 2025 · Fundamentals

Common Misconceptions in Distributed System Design and Their Solutions

Designing distributed systems often falls prey to misconceptions such as assuming reliable networks, zero latency, unlimited bandwidth, inherent security, static topology, zero transmission cost, and full autonomy, but applying retries, idempotency, message queues, encryption, dynamic discovery, caching, and time protocols can mitigate these issues.

ConsensusDistributed SystemsLatency
0 likes · 5 min read
Common Misconceptions in Distributed System Design and Their Solutions
Alibaba Cloud Observability
Alibaba Cloud Observability
Jan 6, 2025 · Operations

How Synthetic Monitoring Boosts Network Reliability and User Experience

This article explains the importance of network stability, outlines major real‑world outages, and introduces synthetic monitoring—its functions, advantages, disadvantages, and various types such as protocol, browser, and internal monitoring—while comparing probe point categories and guiding enterprises on selecting the right strategy to improve service reliability and performance.

Network ReliabilityObservabilityOperations
0 likes · 12 min read
How Synthetic Monitoring Boosts Network Reliability and User Experience
Cognitive Technology Team
Cognitive Technology Team
May 15, 2024 · Fundamentals

The Fallacies of Distributed Systems: Understanding Common Network Assumptions

This article revisits the classic “Fallacies of Distributed Systems” introduced by Peter Deutsch, explaining why assumptions such as reliable networks, zero latency, infinite bandwidth, secure and homogeneous communication are false, and offering practical strategies like retries, caching, batching, and security‑first design to build robust distributed applications.

Distributed SystemsFallaciesLatency
0 likes · 4 min read
The Fallacies of Distributed Systems: Understanding Common Network Assumptions
Ctrip Technology
Ctrip Technology
Dec 14, 2023 · Operations

Improving Optical Transport Network Reliability at Ctrip: Architecture, Issue Analysis, and Optimization Strategies

This article describes Ctrip's optical transport network (TOTN) architecture, analyzes frequent fiber‑cut incidents and resulting device port flapping, presents technical research on fast optical switching and alarm delay, and details an optimization plan that achieved sub‑100 ms fault‑free switchover and stable Redis performance.

DCILink DelayNetwork Reliability
0 likes · 11 min read
Improving Optical Transport Network Reliability at Ctrip: Architecture, Issue Analysis, and Optimization Strategies
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Oct 11, 2023 · Cloud Computing

Alibaba Cloud Infrastructure Network Team’s Contributions at ECOC 2023: Papers, Tutorial, and Workshop on Optical Data Center Networks

The Alibaba Cloud Infrastructure Network team presented a pioneering paper, a high‑profile tutorial, and a workshop at the 2023 European Conference on Optical Communications (ECOC), showcasing systematic analyses of optical network unavailability, innovative data‑center optical network designs, and multi‑fiber scaling strategies for large‑scale cloud operators.

Data centerECOC 2023Network Reliability
0 likes · 5 min read
Alibaba Cloud Infrastructure Network Team’s Contributions at ECOC 2023: Papers, Tutorial, and Workshop on Optical Data Center Networks
Efficient Ops
Efficient Ops
Jul 18, 2022 · Operations

When Core Switches Fail: A Network Engineer’s Close Call and Lessons Learned

A network engineer recounts a terrifying core switch outage caused by an SSD firmware bug, describes the emergency troubleshooting steps, the eventual fix through firmware upgrade, and urges manufacturers to adopt recall mechanisms for critical network equipment.

Network ReliabilitySSD firmware bugcore switch failure
0 likes · 9 min read
When Core Switches Fail: A Network Engineer’s Close Call and Lessons Learned
Refining Core Development Skills
Refining Core Development Skills
Jun 6, 2022 · Fundamentals

Why Large File Downloads Still Need Integrity Verification Despite TCP Reliability

Although TCP provides reliable transmission, its guarantees have limits—such as incomplete CRC checks, process crashes before data reaches the transport layer, possible ISP tampering, and kernel‑level ACKs that don’t ensure user‑space receipt—so large downloads must still be validated for full integrity.

DownloadLinuxNetwork Reliability
0 likes · 4 min read
Why Large File Downloads Still Need Integrity Verification Despite TCP Reliability
Open Source Linux
Open Source Linux
Jun 6, 2022 · Operations

Why Stack Switches? Benefits and Step‑by‑Step Guide to Building a Stacked Network

This article explains what switch stacking is, why it improves reliability, expands port capacity, boosts bandwidth, simplifies network design, and supports long‑distance deployments, then details the devices that support stacking and provides a complete step‑by‑step process for creating a stacked network.

Network Reliabilitybandwidth increasenetwork operations
0 likes · 10 min read
Why Stack Switches? Benefits and Step‑by‑Step Guide to Building a Stacked Network
Architects' Tech Alliance
Architects' Tech Alliance
Jun 1, 2022 · Fundamentals

Understanding Switch Stacking: Benefits, Supported Devices, and Configuration Process

Switch stacking connects multiple compatible switches via stacking cables to form a single logical device, enhancing reliability, expanding port count, increasing bandwidth, simplifying network topology, supporting long‑distance stacking, and reducing maintenance, with details on supported Huawei devices, roles, IDs, priorities, and the step‑by‑step setup process.

Network Reliabilitybandwidth scalingnetwork topology
0 likes · 10 min read
Understanding Switch Stacking: Benefits, Supported Devices, and Configuration Process
Huawei Cloud Developer Alliance
Huawei Cloud Developer Alliance
Nov 24, 2021 · Cloud Computing

How Proactive Link Monitoring Transforms Cloud Network Reliability

This article explains Huawei Cloud Stack's proactive link monitoring system, detailing its point‑line‑plane architecture, golden metrics of packet loss and latency, detection techniques, system components, and key innovations such as strategy optimization, alarm aggregation, and visualized performance dashboards for cloud data‑center networks.

Data centerLatencyNetwork Reliability
0 likes · 13 min read
How Proactive Link Monitoring Transforms Cloud Network Reliability
Open Source Linux
Open Source Linux
Nov 18, 2021 · Operations

Boost Network Bandwidth & Reliability with Link Aggregation: Concepts & Config

This article explains the fundamentals of link aggregation, its motivations, various deployment scenarios, core principles, manual and static LACP modes, data flow control, configuration steps for both layer‑2 and layer‑3 trunks, and troubleshooting commands, helping network engineers increase bandwidth and reliability without hardware upgrades.

Ethernet trunkingLACPNetwork Bandwidth
0 likes · 12 min read
Boost Network Bandwidth & Reliability with Link Aggregation: Concepts & Config
Architects' Tech Alliance
Architects' Tech Alliance
Sep 14, 2021 · Fundamentals

Understanding Switch Stacking: Benefits, Devices, and Configuration Process

The article explains switch stacking—a method of connecting multiple stack‑capable switches into a single logical device—to improve reliability, expand port count, increase bandwidth, simplify network design, and support long‑distance deployments, while detailing supported hardware, role definitions, and step‑by‑step setup procedures.

Network Reliabilitybandwidth aggregationnetwork design
0 likes · 8 min read
Understanding Switch Stacking: Benefits, Devices, and Configuration Process
Youzan Coder
Youzan Coder
May 31, 2021 · Backend Development

16 TCP Network Programming Best Practices for Building Robust Applications

The article presents sixteen practical TCP network‑programming best practices—from setting SO_REUSEADDR and defining port standards to using application‑layer heartbeats, exponential backoff, connection limits, client‑side load balancing, periodic DNS refresh, optimal buffer sizing, configurable timeouts, proper connection‑pool sizing, and comprehensive metrics—to help developers build stable, reliable applications.

Backend DevelopmentConnection PoolLinux TCP
0 likes · 28 min read
16 TCP Network Programming Best Practices for Building Robust Applications
vivo Internet Technology
vivo Internet Technology
Apr 21, 2021 · Operations

System Health Check: Principles and Implementation

System health checks, akin to medical exams, are vital for modern IT infrastructure, using active and passive monitoring, failover strategies, and tools like Spring Boot Actuator to detect hardware, network, load, or software issues, prevent single points of failure, and ensure continuous high‑availability service operation.

Network ReliabilityRocketMQSpring Boot Actuator
0 likes · 12 min read
System Health Check: Principles and Implementation
21CTO
21CTO
Mar 30, 2020 · Cloud Computing

What Triggered the Massive Google Cloud Outage on March 26 2020?

On March 26 2020 Google’s core services—including Search, Gmail, YouTube and G Suite—experienced a worldwide outage caused by a router failure in an Atlanta data center, a third‑party software bug that disrupted traffic across multiple regions, prompting detailed analysis from Google, DownDetector, ThousandEyes and other observers.

Google CloudNetwork ReliabilityOutage
0 likes · 8 min read
What Triggered the Massive Google Cloud Outage on March 26 2020?
Suning Technology
Suning Technology
Nov 11, 2017 · Operations

Zero-Accident O2O Festival: Suning’s Frontend, App & Network Ops Secrets

Suning’s 2017 O2O shopping festival achieved a “zero‑incident” goal by integrating real‑time browser performance monitoring, WEEX‑based WAP acceleration, comprehensive app data collection with cloud‑based analytics, precise DNS and HTTP2 optimizations, and a multi‑layer network and service monitoring system that enabled rapid fault detection and capacity planning.

App OperationsNetwork ReliabilityO2O
0 likes · 15 min read
Zero-Accident O2O Festival: Suning’s Frontend, App & Network Ops Secrets
Tencent TDS Service
Tencent TDS Service
Dec 15, 2016 · Operations

Why Mobile Apps Need Their Own Timeout Strategy Beyond TCP

This article examines the design of read/write timeout mechanisms in WeChat's STN module, comparing TCP/IP layer retransmission with application‑level strategies, presenting experimental data from Android and iOS devices, and proposing total, first‑packet, packet‑gap, and dynamic timeout solutions to improve reliability on mobile networks.

Application LayerMobile NetworkingNetwork Reliability
0 likes · 15 min read
Why Mobile Apps Need Their Own Timeout Strategy Beyond TCP
Ctrip Technology
Ctrip Technology
Aug 5, 2016 · Mobile Development

Ctrip Mobile App Network Service Architecture and Performance Optimizations

This article details Ctrip's mobile app network service architecture, the rationale for using TCP over HTTP, and a series of channel governance and performance optimization techniques—including DNS bypass, socket connection pooling, weak‑network handling, data format improvements, retry mechanisms, Hybrid and overseas network enhancements—demonstrating how these measures raised service success rates above 99% and reduced latency.

CtripMobile NetworkingNetwork Reliability
0 likes · 16 min read
Ctrip Mobile App Network Service Architecture and Performance Optimizations