Why Redis Connections Slowed After Data Center Migration – A Cloud‑Native Debug Guide
A recent data‑center migration caused certain Kubernetes pods to experience significantly higher latency, which was traced to Redis connection‑pool misconfiguration and lock contention in the Gateway service, and resolved through ARMS code‑hotspot analysis and configuration adjustments.
Background
During a migration of an enterprise application to a new data‑center, the customer observed that pods of the same application scheduled on different Kubernetes nodes exhibited markedly higher response times on the migrated nodes, while the original nodes performed normally.
Investigation
The engineer first mapped the application’s call chain and identified the Gateway service as the most problematic component. Using Alibaba Cloud ARMS, a slow call trace revealed that over 7 seconds of latency originated from the Gateway, specifically within the custom filter FilteringWebHandler.handle.
Log analysis showed that the business logic itself was fast (millisecond‑level), shifting suspicion to network delays.
Packet Capture
Packet captures from a slow‑node Gateway pod showed the request flow: Ingress → Gateway → Backend. The timeline indicated a 6‑second gap between the Gateway receiving the request (31 s) and forwarding it to the backend (37 s).
The delay was traced to Redis interactions: the Gateway spent the majority of the 6 seconds establishing or using a Redis connection.
Redis Connection‑Pool Misconfiguration
ARMS code‑hotspot flame graphs showed that getNativeConnection() (Lettuce) consumed 1.86 seconds of a 2‑second request, indicating a bottleneck in obtaining Redis connections. Examination of the Redis pool configuration revealed that the enabled flag was missing, so the pool was not actually active despite min‑idle, max‑idle and max‑active being set.
protected boolean isPoolEnabled(Pool pool) {<br/> Boolean enabled = pool.getEnabled();<br/> return (enabled != null) ? enabled : COMMONS_POOL2_AVAILABLE;<br/>}<br/><br/>private static final boolean COMMONS_POOL2_AVAILABLE = ClassUtils.isPresent("org.apache.commons.pool2.ObjectPool",<br/> RedisConnectionConfiguration.class.getClassLoader());Adding the enabled=true setting and the commons‑pool2 dependency reduced the average latency from 1.3 s to 680 ms, but the problem persisted.
ValidateConnection Lock Contention
Further profiling identified that the LettuceConnectionFactory.validateConnection() method acquired a lock on each connection validation, causing many Reactor threads to block under high concurrency. The lock contention was the primary cause of the remaining delay.
void validateConnection() {<br/> synchronized (this.connectionMonitor) {<br/> boolean valid = false;<br/> if (connection != null && connection.isOpen()) {<br/> try {<br/> if (connection instanceof StatefulRedisConnection) {<br/> ((StatefulRedisConnection) connection).sync().ping();<br/> }<br/> if (connection instanceof StatefulRedisClusterConnection) {<br/> ((StatefulRedisClusterConnection) connection).sync().ping();<br/> }<br/> valid = true;<br/> } catch (Exception e) {<br/> log.debug("Validation failed", e);<br/> }<br/> }<br/> if (!valid) {<br/> log.info("Validation of shared connection failed. Creating a new connection.");<br/> resetConnection();<br/> this.connection = getNativeConnection();<br/> }<br/> }<br/>}Removing the call LettuceConnectionFactory.setValidateConnection(true) eliminated the lock contention, bringing response times down to ~380 ms, comparable to pre‑migration performance.
Conclusion
The root causes were:
The Redis connection pool was not enabled because the enabled flag and commons‑pool2 dependency were missing.
Enabling validateConnection introduced lock contention in high‑concurrency scenarios.
Fixes included adding the missing dependency, setting enabled=true, and disabling per‑request connection validation.
Further Thoughts
Even though the same pool settings existed before migration, the cross‑region latency after migration amplified the impact of the misconfiguration. For environments with frequent Redis interactions, disabling validateConnection or moving validation to a scheduled task is advisable.
@Scheduled(cron = "0/2 * * * * *")<br/>public void task() {<br/> if (RedisConnectionFactory instanceof LettuceConnectionFactory) {<br/> LettuceConnectionFactory f = (LettuceConnectionFactory) RedisConnectionFactory;<br/> f.validateConnection();<br/> }<br/>}Reference: ARMS Code Hotspots Documentation
Alibaba Cloud Observability
Driving continuous progress in observability technology!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
