Cloud Native 13 min read

Why Redis Connections Slowed After Data Center Migration – A Cloud‑Native Debug Guide

A recent data‑center migration caused certain Kubernetes pods to experience significantly higher latency, which was traced to Redis connection‑pool misconfiguration and lock contention in the Gateway service, and resolved through ARMS code‑hotspot analysis and configuration adjustments.

Alibaba Cloud Observability
Alibaba Cloud Observability
Alibaba Cloud Observability
Why Redis Connections Slowed After Data Center Migration – A Cloud‑Native Debug Guide

Background

During a migration of an enterprise application to a new data‑center, the customer observed that pods of the same application scheduled on different Kubernetes nodes exhibited markedly higher response times on the migrated nodes, while the original nodes performed normally.

Investigation

The engineer first mapped the application’s call chain and identified the Gateway service as the most problematic component. Using Alibaba Cloud ARMS, a slow call trace revealed that over 7 seconds of latency originated from the Gateway, specifically within the custom filter FilteringWebHandler.handle.

Log analysis showed that the business logic itself was fast (millisecond‑level), shifting suspicion to network delays.

Packet Capture

Packet captures from a slow‑node Gateway pod showed the request flow: Ingress → Gateway → Backend. The timeline indicated a 6‑second gap between the Gateway receiving the request (31 s) and forwarding it to the backend (37 s).

The delay was traced to Redis interactions: the Gateway spent the majority of the 6 seconds establishing or using a Redis connection.

Redis Connection‑Pool Misconfiguration

ARMS code‑hotspot flame graphs showed that getNativeConnection() (Lettuce) consumed 1.86 seconds of a 2‑second request, indicating a bottleneck in obtaining Redis connections. Examination of the Redis pool configuration revealed that the enabled flag was missing, so the pool was not actually active despite min‑idle, max‑idle and max‑active being set.

protected boolean isPoolEnabled(Pool pool) {<br/>    Boolean enabled = pool.getEnabled();<br/>    return (enabled != null) ? enabled : COMMONS_POOL2_AVAILABLE;<br/>}<br/><br/>private static final boolean COMMONS_POOL2_AVAILABLE = ClassUtils.isPresent("org.apache.commons.pool2.ObjectPool",<br/>    RedisConnectionConfiguration.class.getClassLoader());

Adding the enabled=true setting and the commons‑pool2 dependency reduced the average latency from 1.3 s to 680 ms, but the problem persisted.

ValidateConnection Lock Contention

Further profiling identified that the LettuceConnectionFactory.validateConnection() method acquired a lock on each connection validation, causing many Reactor threads to block under high concurrency. The lock contention was the primary cause of the remaining delay.

void validateConnection() {<br/>    synchronized (this.connectionMonitor) {<br/>        boolean valid = false;<br/>        if (connection != null && connection.isOpen()) {<br/>            try {<br/>                if (connection instanceof StatefulRedisConnection) {<br/>                    ((StatefulRedisConnection) connection).sync().ping();<br/>                }<br/>                if (connection instanceof StatefulRedisClusterConnection) {<br/>                    ((StatefulRedisClusterConnection) connection).sync().ping();<br/>                }<br/>                valid = true;<br/>            } catch (Exception e) {<br/>                log.debug("Validation failed", e);<br/>            }<br/>        }<br/>        if (!valid) {<br/>            log.info("Validation of shared connection failed. Creating a new connection.");<br/>            resetConnection();<br/>            this.connection = getNativeConnection();<br/>        }<br/>    }<br/>}

Removing the call LettuceConnectionFactory.setValidateConnection(true) eliminated the lock contention, bringing response times down to ~380 ms, comparable to pre‑migration performance.

Conclusion

The root causes were:

The Redis connection pool was not enabled because the enabled flag and commons‑pool2 dependency were missing.

Enabling validateConnection introduced lock contention in high‑concurrency scenarios.

Fixes included adding the missing dependency, setting enabled=true, and disabling per‑request connection validation.

Further Thoughts

Even though the same pool settings existed before migration, the cross‑region latency after migration amplified the impact of the misconfiguration. For environments with frequent Redis interactions, disabling validateConnection or moving validation to a scheduled task is advisable.

@Scheduled(cron = "0/2 * * * * *")<br/>public void task() {<br/>    if (RedisConnectionFactory instanceof LettuceConnectionFactory) {<br/>        LettuceConnectionFactory f = (LettuceConnectionFactory) RedisConnectionFactory;<br/>        f.validateConnection();<br/>    }<br/>}

Reference: ARMS Code Hotspots Documentation

cloud-nativeRedisConnection PoolTroubleshooting
Alibaba Cloud Observability
Written by

Alibaba Cloud Observability

Driving continuous progress in observability technology!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.