Redis Monitoring and Alerting Practices: Metrics, Thresholds, and Troubleshooting
This article presents a comprehensive guide to Redis monitoring and alerting, covering metric classification, threshold settings, client traffic collection, host resource usage, instance health checks, cluster failover diagnostics, and detailed explanations of Redis INFO sections with practical code examples.
1. Background
Redis monitoring and alerting practice is based on the development of the CacheCloud cloud platform, where the growth of Redis instance scale brings various problems that require a complete monitoring and alarm mechanism to quickly locate issues and support development and operations.
When developing and using Redis, common problems include:
Problem 1: When does Redis memory need to be expanded?
Problem 2: How to collect client traffic and exceptions?
Problem 3: Which system resources does a Redis instance consume?
Problem 4: Is the Redis instance running normally?
Problem 5: Can a Redis cluster guarantee successful failover during machine failures?
Starting from these questions, we share how to improve Redis monitoring and alerting during development.
2. Monitoring Metric Classification
Redis is a key‑value NoSQL database that stores all data in memory, so comprehensive monitoring must consider internal and external factors such as client connections, OPS, hot keys or big keys, CPU contention, host disk I/O pressure, and cluster instance distribution. The metrics are divided into five categories:
Redis internal metrics
Client collection metrics
Host/container environment metrics
Instance runtime status metrics
Cluster topology metrics
Problem 1: When to expand Redis memory?
First we need to know which internal metrics are available; they can be obtained via info all , which returns information about clients, memory usage, runtime statistics, persistence, command statistics, cluster information, etc. The following internal metrics are key for monitoring:
Index
Config Name
Description
Relation
Threshold
1
aof_current_sizeAOF current size (MB)
Greater than
6000
2
aof_delayed_fsyncNumber of AOF blocks per minute
Greater than
3
3
client_biggest_input_bufMaximum input buffer size (MB)
Greater than
10
4
client_longest_output_listMaximum output buffer queue length
Greater than
50000
5
instantaneous_ops_per_secReal‑time OPS
Greater than
60000
6
latest_fork_usecLast fork time (µs)
Greater than
400000
7
mem_fragmentation_ratioMemory fragmentation ratio (alert when > 1.5)
Greater than
1.5
8
rdb_last_bgsave_statusLast BGSAVE status
Not equal to
ok
9
total_net_output_bytesNetwork output per minute (MB)
Greater than
5000
10
total_net_input_bytesNetwork input per minute (MB)
Greater than
1200
11
sync_partial_errPartial replication failures per minute
Greater than
0
12
sync_partial_okPartial replication successes per minute
Greater than
0
13
sync_fullFull replication executions per minute
Greater than
0
14
rejected_connectionsRejected connections per minute
Greater than
400000
15
master_slave_offset_diffMaster‑slave offset difference (bytes)
Greater than
20000000
16
cluster_stateCluster state
Not equal to
ok
17
cluster_slots_okNumber of successfully allocated slots
Not equal to
16384
18
used_memory_percentMemory usage percentage
Greater than
80%
Memory expansion is needed when used_memory > maxmemory . The default alarm threshold for memory usage is set to 80%; exceeding this triggers an email alert to administrators.
1). Current instance memory usage
127.0.0.1:6470> info memory
# Memory
used_memory:6441145376
used_memory_human:6.00G
used_memory_rss:6622314496
used_memory_rss_human:6.17G
used_memory_peak:6442453200
used_memory_peak_human:6.00G
used_memory_peak_perc:99.98%
maxmemory:6442450944
maxmemory_human:6.00G
...
2). Eviction policy (volatile‑lru)
config get maxmemory-policy
1) "maxmemory-policy"
2) "volatile-lru"
3). Expired and evicted keys
127.0.0.1:6470> info stats
# Stats
...
expired_keys:19125
evicted_keys:1120Problem 2: How to collect client traffic and exceptions?
A unified client SDK is required to embed data collection points that record Redis call statistics and exception information, enabling rapid problem location from the client perspective.
Client command statistics include:
Basic fields: collectTime, client IP, application ID.
Command fields: command, count, input/output traffic, cumulative latency (ms).
Client exception statistics cover connection failures, latency counts, and timeout details.
Data collection process:
Metrics are temporarily stored in a queue (max size 1024).
A single thread aggregates metrics per minute.
A dedicated thread sends HTTP requests to the server.
Aggregated data is visualized in charts and tables.
Daily aggregated client metrics allow quick identification of abnormal applications.
Problem 3: Which system resources does a Redis instance consume?
Redis runs as a process on the host; resource contention such as CPU competition, disk I/O pressure, or network blockage can affect performance. The following host‑level metrics are monitored:
Index
Metric Type
Level
Description
1
CPU limit duration
Container
Top‑10 containers by CPU limit time
2
CPU usage
Container
Top‑10 containers by CPU usage
3
RSS memory
Container
Memory usage threshold 85% to prevent OOM
4
Disk I/O
Host
Diagnose disk read/write pressure
5
Network traffic
Host
Diagnose abnormal network traffic
6
Disk space
Host
Check if disk space is sufficient
7
Host uptime
Host
Diagnose host machine failures
Problem 4: Is the Redis instance running normally?
Instances usually run in containers; if a container is paused or the instance is blocked for a long time, service calls are affected. The following status metrics are monitored:
Index
Metric Type
Meaning
Monitoring Interval
1
Redis instance status
0: heartbeat stopped, 1: running, 2: offline, 3: permanently offline
Every minute
2
Pod status
0: offline, 1: online
Callback notification
Instance status is detected by a background task; abnormal responses trigger email alerts.
Pod status changes are reported via callbacks to notify unavailability or automatic recovery.
Problem 5: Can the Redis cluster guarantee failover during machine failures?
During normal use of Redis Sentinel or Redis Cluster, machine failures or restarts can cause instances on the failed machine to become unavailable, leading to topology anomalies and possible failover failures.
Index
Type
Meaning
1
Redis Sentinel cluster
Diagnose whether the cluster topology is abnormal
2
Redis Cluster
Diagnose whether the cluster topology is abnormal
Sentinel topology requirements: at least three physical machines, master‑slave nodes on different machines, each master must have at least one slave.
Cluster topology requirements: at least three physical machines, ability to failover when a host fails, masters must have slaves.
3. Monitoring Metric Explanation
The previous section introduced the categories of Redis monitoring metrics; this section provides detailed analysis of possible causes for metric alarms.
3.1 Redis INFO metric description
The info all command returns the most complete system status. Modules include Clients, Cluster, Commandstats, CPU, Keyspace, Memory, Persistence, Replication, Server, etc.
1) info client module
Provides client connection count, blocked command count, input/output buffers.
127.0.0.1:6399> info clients
# Clients
connected_clients:300
client_longest_output_list:0
client_biggest_input_buf:0
blocked_clients:0connected_clients : current client connections, alarm when > 2000.
client_longest_output_list : max output buffer queue length, alarm when > 500.
client_biggest_input_buf : max input buffer size (MB), alarm when > 10.
2) info persistence module
Shows RDB and AOF persistence statistics.
127.0.0.1:6399> info persistence
# Persistence
loading:0
rdb_changes_since_last_save:589319695
rdb_bgsave_in_progress:0
rdb_last_save_time:1576117607
rdb_last_bgsave_status:ok
rdb_last_bgsave_time_sec:2
rdb_current_bgsave_time_sec:-1
rdb_last_cow_size:2129920
aof_enabled:1
aof_rewrite_in_progress:0
aof_rewrite_scheduled:0
aof_last_rewrite_time_sec:30
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok
aof_last_write_status:ok
aof_last_cow_size:20979712
aof_current_size:1654862653
aof_base_size:1584570244
aof_pending_rewrite:0
aof_buffer_length:0
aof_rewrite_buffer_length:0
aof_pending_bio_fsync:0
aof_delayed_fsync:15aof_current_size : AOF size (MB), alarm when > 6000.
aof_delayed_fsync : AOF block count per minute, alarm when > 3.
rdb_last_bgsave_status : last BGSAVE status, alarm when not "ok" (e.g., disk space shortage).
3) info stats
Basic statistics covering connections, commands, network, expiration, replication, etc.
127.0.0.1:6399> info stats
# Stats
total_connections_received:5261797
total_commands_processed:9448523137
instantaneous_ops_per_sec:1560
total_net_input_bytes:1307208851742
total_net_output_bytes:5338907609106
instantaneous_input_kbps:68.02
instantaneous_output_kbps:137.20
rejected_connections:0
sync_full:3
sync_partial_ok:1
sync_partial_err:3
expired_keys:7984396
expired_stale_perc:0.10
expired_time_cap_reached_count:0
evicted_keys:0
keyspace_hits:3477205762
keyspace_misses:5308763830
pubsub_channels:0
pubsub_patterns:0
latest_fork_usec:114520
...total_net_output_bytes : network output per minute (MB), alarm when > 2500.
total_net_input_bytes : network input per minute (MB), alarm when > 1200.
latest_fork_usec : last fork time (µs), alarm when > 600000.
sync_full : full replication executions per minute, alarm when > 0.
sync_partial_ok : partial replication successes per minute, alarm when > 0.
4) info replication
Replication statistics.
127.0.0.1:6399> info replication
# Replication
role:master
connected_slaves:1
slave0:ip=x.x.x.x,port=6387,state=online,offset=4764522381799,lag=1
master_replid:ac5d8f6938d752f8f6d453a9841b2a4ed261bcfb
master_replid2:82ab36d6a6f7059757b96c83337f4a6132597e1a
master_repl_offset:4764522381799
second_repl_offset:3823758361609
repl_backlog_active:1
repl_backlog_size:10000000
repl_backlog_first_byte_offset:4764512381800
repl_backlog_histlen:10000000repl_backlog_size : replication backlog size, alarm when > 10 MB.
master_slave_offset_diff : master‑slave offset difference (bytes), alarm when > 10 MB.
5) info cpu
CPU usage of the Redis process and its children.
127.0.0.1:6399> info cpu
# CPU
used_cpu_sys:156315.03
used_cpu_user:78421.15
used_cpu_sys_children:5608.28
used_cpu_user_children:19055.72If the main process accumulates a large amount of CPU time in kernel mode, it indicates heavy load and may cause client blocking or slow responses.
3.2 Client exception metric description
The SDK mainly uses the Jedis client. Common exception messages and typical causes include:
JedisConnectionException: Could not get a resource from the pool – connection pool leak, insufficient pool size, or slow queries causing blockage.
java.net.SocketTimeoutException: Read timed out – timeout too short, slow queries, or unstable network.
JedisDataException: ERR max number of clients reached – client connections exceed the configured maxclients (default 10000).
JedisDataException: LOADING Redis is loading the dataset in memory – Redis is loading persistence files and cannot serve reads/writes.
JedisDataException: NOAUTH Authentication required – client did not provide a password.
4. Summary
Monitoring and alerting are crucial for service quality. For Redis, the key take‑aways are:
Client side: SDK integration is simple, data collection is transparent, and SDK reliability is ensured.
Server side: Internal monitoring (Redis metrics and client metrics) enables rapid problem location; external monitoring (host resources and cluster topology) supports quick migration and repair; proper resource allocation and fault recovery capabilities improve overall stability.
Sohu Tech Products
A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.