Databases 7 min read

How We Fixed Redis Crashes in Production: 9 Debugging Iterations

Facing a sudden surge in traffic, a hotel‑search service saw 40% of requests fail with HTTP 500 errors due to Redis connection issues, and after nine iterative debugging attempts—including client swaps, version upgrades, persistence off‑loading, proxy adoption, and data sharding—they stabilized the system and gained valuable production insights.

Java High-Performance Architecture
Java High-Performance Architecture
Java High-Performance Architecture
How We Fixed Redis Crashes in Production: 9 Debugging Iterations
This article is a translation of Andy Grunwald’s post describing the Redis problems faced by the author’s company and the lessons learned.

Background

Product type: hotel search. Technology stack: front‑end PHP, back‑end Java, both using Redis. Redis was used for caching and temporary storage before persistence. Redis was introduced in 2010, initially accessed via the Predis client, switched to phpredis in 2013, and problems started appearing in 2014.

Problem Description

Rapid user growth doubled traffic quickly. Although hardware capacity was sufficient, the software suffered: about 40 % of requests returned HTTP 500 Internal Server Error. Logs showed the failure occurred in the PHP‑Redis connection handling.

Debugging Process

Attempt 1

Initial attempts did not find the root cause. Measures tried included:

Increasing PHP connection count and raising timeout from 500 ms to 2.5 s.

Disabling default_socket_timeout in PHP.

Disabling SYN cookies on the host OS.

Checking file descriptor limits for Redis and web servers.

Increasing the host’s mbuffer.

Adjusting TCP backlog size.

All attempts were ineffective.

Attempt 2

Attempted to reproduce the issue in a pre‑release environment, but traffic was insufficient.

Attempt 3

Considered whether Redis connections were not being closed. Modified code to manually close connections, but the problem persisted.

Attempt 4

Suspected the phpredis client. Performed an A/B test by switching back to Predis for 20 % of users. The issue remained, indicating phpredis was not the cause.

Attempt 5

Checked Redis version (v2.6) and upgraded to the latest (v2.8.9). The upgrade did not resolve the issue.

Attempt 6

Used the Redis Software Watchdog for diagnostics:

$ redis-cli --latency -p 6380 -h 1.2.3.4
min: 0, max: 463, avg: 2.03 (19443 samples)

Log excerpts showed background saving taking about 400 ms every few minutes, indicating that large data sets caused costly fork operations during persistence, which blocked Redis.

Solution: offload persistence to a dedicated slave that handles only background saving.

Attempt 7

Found slow queries using keys *, which blocked Redis as data grew. Replaced with scan.

Attempt 8

After the previous adjustments the system remained stable for months despite traffic growth, but a new problem emerged: each request opened a new Redis connection, causing significant overhead.

Solution: introduced twemproxy (Twitter’s proxy) to maintain persistent connections from web servers to Redis, reducing connection overhead.

twemproxy also supports memcached and can block expensive commands such as keys and flushall.

Attempt 9

Further optimization via data sharding:

Separate data by context.

Apply consistent‑hash sharding for related data.

Resulted in lower per‑machine load and improved cache reliability.

Conclusion

The original author provides a detailed account of their Redis journey, offering valuable practical experience for anyone operating Redis in production.

Original article: http://tech.trivago.com/2017/01/25/learn-redis-the-hard-way-in-production

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Backendperformance
Java High-Performance Architecture
Written by

Java High-Performance Architecture

Sharing Java development articles and resources, including SSM architecture and the Spring ecosystem (Spring Boot, Spring Cloud, MyBatis, Dubbo, Docker), Zookeeper, Redis, architecture design, microservices, message queues, Git, etc.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.