Backend Development 28 min read

16 TCP Network Programming Best Practices for Building Robust Applications

The article presents sixteen practical TCP network‑programming best practices—from setting SO_REUSEADDR and defining port standards to using application‑layer heartbeats, exponential backoff, connection limits, client‑side load balancing, periodic DNS refresh, optimal buffer sizing, configurable timeouts, proper connection‑pool sizing, and comprehensive metrics—to help developers build stable, reliable applications.

Youzan Coder
Youzan Coder
Youzan Coder
16 TCP Network Programming Best Practices for Building Robust Applications

This article summarizes 16 practical recommendations for TCP network programming based on YouZan's years of experience in middleware development. The goal is to help applications avoid unexpected behaviors caused by various network anomalies, thereby improving system stability and reliability.

1. Set SO_REUSEADDR for Server Listening : To avoid "address already in use" errors during service restart, enable SO_REUSEADDR when binding to listening ports. This is safe when TCP timestamps are enabled (net.ipv4.tcp_timestamps = 1).

2. Establish Application Listening Port Standards : Each application and protocol should have fixed, unified listening ports. Listening ports must not be within the system's ip_local_port_range (Linux default [32768, 60999]), as these are used for automatic local port allocation.

3. Separate Service Ports from Management Ports : Service ports handle business requests while management ports handle framework/application management requests (e.g., service registration, health checks). This separation prevents business requests from affecting management operations and enables better ACL control.

4. Set Connection Establishment Timeout : Network congestion, unreachable IPs, or full handshake queues can cause connection blocking. Set explicit timeouts rather than relying on system defaults. YouZan sets net.ipv4.tcp_syn_retries to 3, returning timeout errors within ~15 seconds.

5. Implement Application-Layer Heartbeat : While TCP Keepalive exists, it has limitations: not universally supported, can be filtered by network equipment, cannot detect application-layer issues (process blocking, deadlocks, full TCP buffers), and conflicts with TCP retransmission control. Application-layer heartbeat is essential for robust applications, with recommended intervals of 5-20 seconds and failure thresholds of 2-5 consecutive failures.

6. Add Backoff and Jitter for Connection Reconnection : After network recovery, implement exponential backoff (1s, 2s, 4s, 8s...) with maximum backoff limits (e.g., 64s) and jitter to prevent thundering herd problems.

7. Limit Maximum Connections on Server Side : Constrain connections to prevent CPU/memory exhaustion and file descriptor (FD) depletion. Reserve FDs for non-TCP operations (logging, etc.).

8. Avoid Centralized L4 Load Balancers : Prefer distributed service registration/discovery with client-side load balancing over centralized solutions like LVS. Issues with centralized load balancers include: configuration changes needed for scaling, scalability bottlenecks, single points of failure, inability to perform application-layer health checks, and impact on latency.

9. Beware of Large Numbers of CLOSE_WAIT Connections : Unclosed sockets in CLOSE_WAIT state can cause FD leaks and memory exhaustion. Ensure proper socket closure using language features like defer (Go), try-catch-finally (Java), or RAII (C++).

10. Set Reasonable Long Connection TTL : Long-lived connections can cause load balancing imbalances as backend instances restart. YouZan mandates TCP long connection TTL not exceed 2 hours.

11. Periodically Resolve DNS for Domain Access : When accessing services via domain names, periodically refresh DNS resolution to handle backend instance migrations. Don't resolve only at application startup.

12. Reduce Network Read/Write System Call Frequency : Each read/write system call involves user-kernel context switching. Use read/write buffers and consider readv/writev for scatter-gather operations. Batch writes also avoid Nagle algorithm delays.

13. Carefully Set TCP Buffer Sizes : Buffer sizes should match the bandwidth-delay product (BDP). Too small reduces throughput; too large wastes memory. Linux auto-tunes buffers via tcp_wmem and tcp_rmem. Avoid manually setting SO_SNDBUF/SO_RCVVBUF without thorough evaluation.

14. Support Flexible Configuration of Network Parameters : Different deployment environments (LAN vs WAN) require different network parameters. Support configurable timeouts, health check intervals, and failure thresholds.

15. Properly Size Connection Pools : For non-multiplexing protocols (HTTP1.1, Redis), use Little's Law: concurrent connections = QPS × RT. For multiplexing protocols (HTTP2, gRPC, Dubbo), implement connection pool scaling based on per-connection pending request counts and use least-requests load balancing.

16. Implement Comprehensive Network Metrics Monitoring : Monitor TCP connection failures, retransmission rates, connection states (ESTABLISHED, TIME_WAIT, CLOSE_WAIT), active/passive closes, health check failures, FD usage, and connection pool sizes.

backend developmentLoad BalancingConnection PoolSocket Programmingnetwork monitoringnetwork reliabilityLinux TCPTCP Network Programming
Youzan Coder
Written by

Youzan Coder

Official Youzan tech channel, delivering technical insights and occasional daily updates from the Youzan tech team.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.