Operations 7 min read

Why a Massive KEYS * Command Crashed Our Redis Service and How to Fix It

The article recounts a sudden Redis performance crisis caused by massive KEYS * commands, explains how monitoring, INFO, COMMANDSTATS and SLOWLOG revealed the issue, and outlines temporary and long‑term remediation steps for preventing similar outages.

Programmer DD
Programmer DD
Programmer DD
Why a Massive KEYS * Command Crashed Our Redis Service and How to Fix It

On a Monday morning, users reported extremely slow web page loads; checking the login server showed Redis call times timing out, turning the high‑speed cache into a bottleneck.

Web Monitoring

Grafana monitoring indicated normal CPU, memory, and network usage on the server, pointing to Redis as the problem.

The Alibaba Cloud single‑node Redis (32 MB, 16 GB) showed CPU usage spiking to 100%.

QPS rose from over 1,000 to 6,000 and connections increased from 0 to 3,000, still far below limits, but a sudden surge of commands queued up, causing high CPU.

Temporary solution: rent a new Redis instance, update the application configuration, and restart the app.

Server Command Monitoring

Using redis-cli and the INFO command, two abnormal points were identified:

Slowlog showed the top ten slow commands were all keys *, which are extremely time‑consuming and block the business, yet the application had not exposed a keys * API.

Further inspection of command execution statistics revealed severe latency for several commands: setnx: 75 million calls, avg 6 s setex: 84 million calls, avg 7.33 s del: 260 million calls, avg 69 s hmset: 100 million calls, avg 64 s hmget: 68 million calls, avg 9 s hgetall: 1.4 billion calls, avg 205 s keys: 20 million calls, avg 3740 s

These command latencies typically correlate with the size of the stored values, so recent data growth or a new feature that heavily uses these commands could explain the CPU spike.

Using INFO commandstats provides statistics in the format:

cmdstat_XXX: calls=XXX,usec=XXX,usec_per_call=XXX

The SLOWLOG GET 10 command returns entries like:

1) (integer) 411
2) (integer) 1545386469
3) (integer) 232663
4) 1) "keys"
   2) "mecury:*"

Fields represent: log ID, execution timestamp, execution time in microseconds, and the command with its arguments.

These logs confirmed that a flood of keys * commands caused the CPU surge and response delays; our application had not exposed such a command.

Further investigation revealed that another application mistakenly pointed to our Redis instance and performed massive data crawling using keys *, overwhelming the server. After correcting the configuration, the issue was resolved.

Summary

Check web monitoring dashboards first when Redis performance degrades.

Use INFO and SLOWLOG to inspect command statistics and identify slow commands.

Review and optimize Redis usage in application code.

Consider scaling or upgrading Redis if traffic continues to grow.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

monitoringSlowlog
Programmer DD
Written by

Programmer DD

A tinkering programmer and author of "Spring Cloud Microservices in Action"

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.