Why Mobile Apps Need Their Own Timeout Strategy Beyond TCP
This article examines the design of read/write timeout mechanisms in WeChat's STN module, comparing TCP/IP layer retransmission with application‑level strategies, presenting experimental data from Android and iOS devices, and proposing total, first‑packet, packet‑gap, and dynamic timeout solutions to improve reliability on mobile networks.
Preface
mars is a C++‑based, platform‑agnostic terminal component used by WeChat on Android, iOS, Windows, Mac, and Windows Phone, and is being prepared for open‑source release. It consists of several independent parts:
COMM: basic library providing sockets, threads, message queues, coroutines, etc.
XLOG: a high‑performance, high‑availability, secure, fault‑tolerant logging module for mobile terminals.
SDT: network diagnostics module.
STN: signaling transmission network module responsible for the small‑data signaling channel between client and server, incorporating extensive optimization experience from WeChat's massive user base.
This article introduces the STN module, focusing on the design and considerations of read/write timeout.
Read/Write Timeout and Design Goals
TCP/IP Timeout Design
WeChat signaling communication primarily uses TCP/IP, where the link and transport layers provide timeout and retransmission mechanisms.
Link‑Layer Timeout and Retransmission
The link layer typically uses Hybrid Automatic Repeat Request (HARQ), which combines Forward Error Correction (FEC) and Automatic Repeat Request (ARQ). This mechanism, illustrated below, enables reliable transmission over unreliable physical links.
By using acknowledgments and timeouts, the link layer achieves reliable information transfer, requiring support from both the mobile device and the RNC, and is implemented on EDGE, HSDPA, HSUPA, UMTS, and LTE.
Transport‑Layer Timeout and Retransmission
The TCP layer provides reliable transmission, but relies on the underlying unreliable link. TCP sets a timer for each segment; if the timer expires without receiving an ACK, the segment is retransmitted. Traditional Unix implementations calculate the retransmission timeout (RTO) based on the round‑trip time (RTT), which varies with network conditions.
Measured RTO intervals follow an exponential back‑off pattern (1, 3, 6, 12, 24, 48, 64 seconds, etc.).
Experiments on mobile devices show variations in TCP timeout intervals:
OPPO devices: [0.25 s, 0.5 s, 1 s, 2 s, 4 s, 8 s, 16 s, 32 s, 64 s, …]
Samsung devices: [0.42 s, 0.9 s, 1.8 s, 3.7 s, 7.5 s, 15 s, 30 s, 60 s, 120 s, …]
iOS devices show less consistent patterns, with initial experiments yielding intervals such as [1 s, 1 s, 1 s, 2 s, 4.5 s, 9 s, 13.5 s, 26 s, …] and later experiments adjusting the final RTO to 24 s.
Read/Write Timeout Goals
Although TCP/IP already handles timeout and retransmission, application‑level control remains necessary because:
Link‑layer HARQ ensures frame‑level reliability.
TCP ensures packet‑level reliability.
Application‑layer needs request‑level reliability.
The goals for application‑layer timeout and retransmission are:
Maximize success rate within user‑experience acceptable limits.
Ensure availability on weak networks.
Maintain network sensitivity to quickly discover better links.
Consequently, application‑layer retransmission should:
Reduce wasted waiting time and increase retry attempts by breaking and re‑establishing connections when TCP intervals become too large.
Switch links (e.g., IP/Port) when the current path experiences severe congestion.
WeChat Read/Write Timeout
Solution 1: Total Read/Write Timeout
Early design decomposes the request RTT into request send time, response receive time, server processing time, and waiting time, leading to a total read/write timeout that varies with network speed.
Solution 2: Stepwise Read/Write Timeout
Using a single total timeout can be too long for fluctuating networks. By estimating the arrival time of the first data segment (first‑packet timeout) and the interval between subsequent segments (packet‑gap timeout), the system becomes more responsive.
Packet‑gap timeout, applied after the first packet is received, uses a fixed RTT estimate, greatly shortening the interval and improving sensitivity to sudden network issues.
Solution 3: Dynamic Read/Write Timeout
Ideally, real‑time network speed and server processing time could drive dynamic timeout calculations, but measuring these accurately incurs significant overhead.
Dynamic speed measurement requires high‑frequency tools and must handle network volatility.
Server processing time varies per business signal and must be reported by the server.
Actual response size can only be known after the server notifies the client.
Given the cost, a pragmatic approach classifies network conditions into Excellent, Evaluating, and Poor, adjusting timeout parameters accordingly. For excellent networks, the first‑packet timeout is shortened, assuming rapid recovery from transient issues.
Summary
While TCP/IP provides link‑layer and transport‑layer reliability, the application layer has distinct reliability requirements that necessitate its own timeout and retransmission mechanisms to achieve high performance and availability. The design goals are:
Maximize success rate within acceptable user‑experience bounds.
Ensure usability on weak networks.
Maintain network sensitivity to quickly discover new links.
The mars STN module continuously refines its timeout strategy, employing total, first‑packet, packet‑gap, and dynamic timeouts, though it remains best suited for small‑payload signaling and request‑response patterns. Ongoing evolution will be validated across WeChat's massive user base, and future open‑source release is expected to foster broader community contributions.
Tencent TDS Service
TDS Service offers client and web front‑end developers and operators an intelligent low‑code platform, cross‑platform development framework, universal release platform, runtime container engine, monitoring and analysis platform, and a security‑privacy compliance suite.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
