Why Migrated Kubernetes Pods Stall on Redis Calls: Uncovering Connection‑Pool Misconfigurations
A recent migration exposed severe latency on certain Kubernetes pods, traced to Redis connection‑pool misconfigurations and validation locks, with detailed packet captures, ARMS code‑hotspot analysis, and step‑by‑step fixes that restored performance across the cloud‑native environment.
Problem Background
A data‑center migration moved some Kubernetes nodes to a new rack. After migration, pods of the same Gateway service that were scheduled on the new nodes exhibited response times up to several seconds, while pods on the original nodes responded in sub‑second latency.
Initial Investigation
Engineers traced the request flow with Alibaba Cloud ARMS. The trace showed that more than 7 seconds of latency originated inside the Gateway itself, specifically in the custom FilteringWebHandler.handle filter.
Packet Capture Analysis
Network captures from a slow‑node Gateway pod revealed the timeline:
Ingress → Gateway received request at 31 s.
Gateway → backend request sent at 37 s.
Backend responded quickly (within the same second).
Gateway → Ingress response sent at 43 s.
The six‑second gap (31 s → 37 s) contained multiple Redis connection attempts, indicating that the delay occurred while the Gateway was interacting with Redis.
Redis Connection‑Pool Diagnosis
Further inspection showed that Redis connections were long‑lived and already busy during the six‑second window, causing contention on the connection pool.
Long‑lived connections were being reused.
High concurrency led to pool starvation.
The Gateway’s pool configuration defined min‑idle, max‑idle and max‑active, but the enabled flag was missing, so the pool never became active.
protected boolean isPoolEnabled(Pool pool) {</code><code> Boolean enabled = pool.getEnabled();</code><code> return (enabled != null) ? enabled : COMMONS_POOL2_AVAILABLE;</code><code>} private static final boolean COMMONS_POOL2_AVAILABLE =
ClassUtils.isPresent("org.apache.commons.pool2.ObjectPool",
RedisConnectionConfiguration.class.getClassLoader());Code‑Hotspot Analysis with ARMS
The Gateway runs on the Reactor framework, so ARMS probes were upgraded to version 4.2.1 to support asynchronous code‑hotspot tracing. Flame graphs showed that getNativeConnection() consumed 1.86 seconds of a 2‑second request, pinpointing Lettuce’s connection acquisition as the bottleneck.
Netstat monitoring during load showed a backlog of pending connections (high recv‑queue/send‑queue) while only a single Redis socket (port 6379) was active.
Root Cause
Missing spring.redis.lettuce.pool.enabled=true and absent org.apache.commons.pool2 dependency prevented the Lettuce pool from being created.
Enabling validateConnection forced a synchronized validation step for every connection acquisition, causing lock contention under load.
Fixes Applied
Add the Commons‑Pool2 library to the classpath.
Enable the Lettuce pool and configure its size:
spring.redis.lettuce.pool.enabled=true</code><code>spring.redis.lettuce.pool.max-active=64</code><code>spring.redis.lettuce.pool.min-idle=16</code><code>spring.redis.lettuce.pool.max-idle=32</code><code>spring.redis.lettuce.pool.max-wait=100Remove the call LettuceConnectionFactory.setValidateConnection(true) or set validation to false.
After the changes, latency on the migrated nodes dropped from ~1.3 seconds to ~380 ms, matching pre‑migration performance.
Takeaways
Even when pool size parameters appear correct, the pool must be explicitly enabled and unnecessary per‑connection validation should be avoided, otherwise high concurrency can cause severe latency spikes. A scheduled background validation task can replace per‑connection validation without incurring lock contention.
@Scheduled(cron = "0/2 * * * * *")
public void validateRedis() {
if (RedisConnectionFactory instanceof LettuceConnectionFactory) {
((LettuceConnectionFactory) RedisConnectionFactory).validateConnection();
}
}Reference: https://help.aliyun.com/zh/arms/application-monitoring/user-guide/use-code-hotspots-to-diagnose-code-level-problems
Alibaba Cloud Native
We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
