Operations 20 min read

Why Your Redis and MySQL Connections Time Out on Alibaba Cloud—and How to Fix Them

This article explains how idle TCP connections are silently dropped by Alibaba Cloud security groups, causing Redis client timeouts and MySQL JDBC CommunicationsExceptions, and provides step‑by‑step diagnostics and configuration changes—including TCP keepalive and wait_timeout tweaks—to prevent the failures.

Efficient Ops

Mar 5, 2019

Why Your Redis and MySQL Connections Time Out on Alibaba Cloud—and How to Fix Them

1. Introduction: Redis client library connection timeout

About a year ago we encountered a strange Redis connection issue on Alibaba Cloud: every ten minutes the Redis client reported a timeout because the cloud silently dropped long‑idle TCP connections without sending FIN or RST packets. The Redis server had no tcp_keepalive enabled, so the connection remained in the Linux conntrack table while the client side closed the local port.

When the client later reused that local port, the server’s conntrack entry was still marked ESTABLISHED , so the SYN packet was discarded and the client saw a timeout.

The simple fix was to enable tcp_keepalive on the Redis server, but the deeper cause had serious consequences.

2. Debt: "SELECT 1" triggers jdbc4.CommunicationsException

In production, Java services repeatedly logged errors like the following:

[main] ERROR jdbc.audit - 2. Statement.execute(select 1) select 1
com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure
The last packet successfully received from the server was 576,539 ms ago.
...

Because we had previously seen Redis connections being cut, we suspected a similar problem and compared client and server conntrack tables, but the issue was different.

After extensive testing of MySQL sysctl settings, iptables TRACE, tcpdump captures, and parameters such as tw_reuse and tw_recycle, we discovered a new problem: when connecting directly to MySQL (bypassing Alibaba Cloud SLB), some databases returned after 600 s, while others hung indefinitely.

We noticed that the hanging databases had wait_timeout set to 60 s, whereas the others used 600 s. Our service uses HikariCP with idleTimeout=600s and maxLifetime=1800s. Because the server closed the idle connection after 60 s, the client never received a callback, and Hikari later sent a “SELECT 1” health‑check on a dead socket, causing the exception.

Fix: change the mis‑configured database’s wait_timeout to 600 s.

3. Origin: Alibaba Cloud security groups and TCP KeepAlive

The “SELECT sleep(1000)” case revealed why some queries return and others hang. For idle connections, wait_timeout and interactive_timeout apply, but SELECT sleep(1000) is an active query limited by MySQL’s max_execution_time (usually 600 s). Some servers had this set to 6000 s, so the query completed, but the response never reached the client because Alibaba Cloud silently drops idle TCP connections that have been idle for ≥ 910 s.

iptables TRACE logs prove the packet loss.

When the server finished the query, it sent an ACK+PSH packet that never arrived. After the client’s

wait_timeout</p> (600 s) expired, the server sent a FIN, which also never reached the client, leaving the conntrack entry in <strong>ESTABLISHED</strong> state on the client side.</p><pre><code>Dec 14 23:58:25 client-host kernel: TRACE: raw:OUTPUT ... SYN ...
Dec 15 00:41:20 client-host kernel: TRACE: filter:OUTPUT ... FIN ...

Further experiments showed a reproducible pattern:

Two VMs in different security groups with no shared group.

Server security group allows port P to client; client does not open inbound ports.

After ≥ 910 s of idle time, the security group removes the conntrack entry.

Server packets sent after this are dropped.

If the client sends any data, the connection is re‑activated, and subsequent server packets are delivered.

Images illustrating the client and server sides are omitted for brevity.

4. Workarounds and proper fix

Alibaba Cloud suggested two workarounds, both undesirable:

Place server and client in the same security group.

Open all ports from the client’s security group to the server’s group.

Instead, we leveraged the fact that both the Jedis (Redis) and MySQL JDBC libraries enable SO_KEEPALIVE . By adjusting the kernel keepalive parameters to be less than the 910 s limit, the idle connection is kept alive.

net.ipv4.tcp_keepalive_intvl = 30
net.ipv4.tcp_keepalive_probes = 3
net.ipv4.tcp_keepalive_time = 1800

Because Alibaba Cloud drops connections idle for ≥ 910 s, set tcp_keepalive_time to a value smaller than 910 s (e.g., 300 s). If a library does not expose setsockopt , use LD_PRELOAD to force the option. We also observed that Java libraries handle SO_KEEPALIVE well, Node.js Redis client enables it but its MySQL client does not, while Go libraries enable it for both.

5. Conclusion

The root cause is Alibaba Cloud’s security‑group idle timeout (≈ 910 s) that silently discards long‑idle TCP connections. Enabling appropriate TCP keepalive settings on at least one side of the connection prevents the silent drop and avoids cascading timeouts in Redis, MySQL, and other services.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

redis mysql HikariCP Alibaba Cloud connection timeout iptables tcp-keepalive

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.