Why Your Redis and MySQL Connections Time Out on Alibaba Cloud—and How to Fix Them
This article explains how idle TCP connections are silently dropped by Alibaba Cloud security groups, causing Redis client timeouts and MySQL JDBC CommunicationsExceptions, and provides step‑by‑step diagnostics and configuration changes—including TCP keepalive and wait_timeout tweaks—to prevent the failures.
1. Introduction: Redis client library connection timeout
About a year ago we encountered a strange Redis connection issue on Alibaba Cloud: every ten minutes the Redis client reported a timeout because the cloud silently dropped long‑idle TCP connections without sending FIN or RST packets. The Redis server had no
tcp_keepaliveenabled, so the connection remained in the Linux conntrack table while the client side closed the local port.
When the client later reused that local port, the server’s conntrack entry was still marked ESTABLISHED , so the SYN packet was discarded and the client saw a timeout.
The simple fix was to enable
tcp_keepaliveon the Redis server, but the deeper cause had serious consequences.
2. Debt: "SELECT 1" triggers jdbc4.CommunicationsException
In production, Java services repeatedly logged errors like the following:
<code>[main] ERROR jdbc.audit - 2. Statement.execute(select 1) select 1
com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure
The last packet successfully received from the server was 576,539 ms ago.
...</code>Because we had previously seen Redis connections being cut, we suspected a similar problem and compared client and server conntrack tables, but the issue was different.
After extensive testing of MySQL sysctl settings, iptables TRACE, tcpdump captures, and parameters such as
tw_reuseand
tw_recycle, we discovered a new problem: when connecting directly to MySQL (bypassing Alibaba Cloud SLB), some databases returned after 600 s, while others hung indefinitely.
We noticed that the hanging databases had
wait_timeoutset to 60 s, whereas the others used 600 s. Our service uses HikariCP with
idleTimeout=600sand
maxLifetime=1800s. Because the server closed the idle connection after 60 s, the client never received a callback, and Hikari later sent a “SELECT 1” health‑check on a dead socket, causing the exception.
Fix: change the mis‑configured database’s
wait_timeoutto 600 s.
3. Origin: Alibaba Cloud security groups and TCP KeepAlive
The “SELECT sleep(1000)” case revealed why some queries return and others hang. For idle connections,
wait_timeoutand
interactive_timeoutapply, but
SELECT sleep(1000)is an active query limited by MySQL’s
max_execution_time(usually 600 s). Some servers had this set to 6000 s, so the query completed, but the response never reached the client because Alibaba Cloud silently drops idle TCP connections that have been idle for ≥ 910 s.
iptables TRACE logs prove the packet loss.
When the server finished the query, it sent an ACK+PSH packet that never arrived. After the client’s
wait_timeout</p> (600 s) expired, the server sent a FIN, which also never reached the client, leaving the conntrack entry in <strong>ESTABLISHED</strong> state on the client side.</p><pre><code>Dec 14 23:58:25 client-host kernel: TRACE: raw:OUTPUT ... SYN ...
Dec 15 00:41:20 client-host kernel: TRACE: filter:OUTPUT ... FIN ...Further experiments showed a reproducible pattern:
Two VMs in different security groups with no shared group.
Server security group allows port P to client; client does not open inbound ports.
After ≥ 910 s of idle time, the security group removes the conntrack entry.
Server packets sent after this are dropped.
If the client sends any data, the connection is re‑activated, and subsequent server packets are delivered.
Images illustrating the client and server sides are omitted for brevity.
4. Workarounds and proper fix
Alibaba Cloud suggested two workarounds, both undesirable:
Place server and client in the same security group.
Open all ports from the client’s security group to the server’s group.
Instead, we leveraged the fact that both the Jedis (Redis) and MySQL JDBC libraries enable SO_KEEPALIVE . By adjusting the kernel keepalive parameters to be less than the 910 s limit, the idle connection is kept alive.
<code>net.ipv4.tcp_keepalive_intvl = 30
net.ipv4.tcp_keepalive_probes = 3
net.ipv4.tcp_keepalive_time = 1800</code>Because Alibaba Cloud drops connections idle for ≥ 910 s, set tcp_keepalive_time to a value smaller than 910 s (e.g., 300 s). If a library does not expose setsockopt , use LD_PRELOAD to force the option. We also observed that Java libraries handle SO_KEEPALIVE well, Node.js Redis client enables it but its MySQL client does not, while Go libraries enable it for both.
5. Conclusion
The root cause is Alibaba Cloud’s security‑group idle timeout (≈ 910 s) that silently discards long‑idle TCP connections. Enabling appropriate TCP keepalive settings on at least one side of the connection prevents the silent drop and avoids cascading timeouts in Redis, MySQL, and other services.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.