How WeChat’s STN Manages Read/Write Timeouts in Mobile Networks
This article explores the design and experimentation of read/write timeout mechanisms in WeChat's STN module, detailing TCP/IP timeout behavior, mobile platform variations, and three layered application‑level strategies—total, stepwise, and dynamic—to improve reliability and user experience on unstable networks.
Introduction
mars is a platform‑independent C++ component used by WeChat on Android, iOS, Windows, Mac, and Windows Phone. It includes several independent parts: COMM (basic library), XLOG (high‑performance logging), SDT (network diagnostics), and STN (signalling transmission network). This article focuses on STN’s read/write timeout design.
Read/Write Timeout and Design Goals
TCP/IP Timeout Design
WeChat signalling uses TCP/IP, which provides timeout and retransmission at the link and transport layers.
Link‑Layer Timeout and Retransmission
The link layer typically uses Hybrid Automatic Repeat Request (HARQ), combining FEC and ARQ.
Transport‑Layer Timeout and Retransmission
TCP sets a timer for each segment; if ACK is not received before expiration, the segment is retransmitted. Traditional Unix implementations calculate the retransmission timeout (RTO) from the round‑trip time (RTT) using algorithms such as Karn’s and Jacobson’s, often employing exponential back‑off.
Measurements on Android devices (OPPO, Samsung) show RTO intervals following exponential back‑off, while iOS exhibits more aggressive, sometimes non‑exponential patterns.
Application‑Layer Timeout Goals
Maximize success rate within user‑acceptable latency.
Ensure availability on weak networks.
Maintain network sensitivity to quickly discover better paths.
The application layer cannot rely solely on lower‑layer mechanisms; it must provide request‑level reliability.
WeChat Read/Write Timeout Strategies
Solution 1: Total Read/Write Timeout
Decompose the request RTT into send time, signalling receive time, server processing time, and waiting time, then apply a total timeout that varies with network speed.
Solution 2: Stepwise Timeout
Introduce a first‑packet timeout to detect early failures, followed by a packet‑to‑packet timeout ("packet‑packet timeout") that estimates RTT after the first packet is acknowledged.
Solution 3: Dynamic Timeout
Use real‑time network speed and server processing estimates to compute adaptive timeouts, but this requires costly measurements and signaling.
Practical dynamic optimization classifies network conditions into Excellent, Evaluating, and Poor, adjusting the first‑packet timeout accordingly.
Conclusion
Although lower‑layer protocols already provide reliable transmission, the application layer has distinct reliability requirements that necessitate its own timeout and retransmission mechanisms. The design goals are to improve success rate within acceptable latency, ensure availability on weak networks, and quickly adapt to better links. mars STN’s timeout mechanisms have evolved through total, first‑packet, packet‑packet, and dynamic strategies, and will continue to be refined as they are open‑sourced and validated at WeChat’s massive scale.
WeChat Client Technology Team
Official account of the WeChat mobile client development team, sharing development experience, cutting‑edge tech, and little‑known stories across Android, iOS, macOS, Windows Phone, and Windows.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
