Operations 23 min read

Redis Monitoring and Alerting Practices: Metrics, Thresholds, and Troubleshooting

This article presents a comprehensive guide to Redis monitoring and alerting, covering metric classification, threshold settings, client traffic collection, host resource usage, instance health checks, cluster failover diagnostics, and detailed explanations of Redis INFO sections with practical code examples.

Sohu Tech Products
Sohu Tech Products
Sohu Tech Products
Redis Monitoring and Alerting Practices: Metrics, Thresholds, and Troubleshooting

1. Background

Redis monitoring and alerting practice is based on the development of the CacheCloud cloud platform, where the growth of Redis instance scale brings various problems that require a complete monitoring and alarm mechanism to quickly locate issues and support development and operations.

When developing and using Redis, common problems include:

Problem 1: When does Redis memory need to be expanded?

Problem 2: How to collect client traffic and exceptions?

Problem 3: Which system resources does a Redis instance consume?

Problem 4: Is the Redis instance running normally?

Problem 5: Can a Redis cluster guarantee successful failover during machine failures?

Starting from these questions, we share how to improve Redis monitoring and alerting during development.

2. Monitoring Metric Classification

Redis is a key‑value NoSQL database that stores all data in memory, so comprehensive monitoring must consider internal and external factors such as client connections, OPS, hot keys or big keys, CPU contention, host disk I/O pressure, and cluster instance distribution. The metrics are divided into five categories:

Redis internal metrics

Client collection metrics

Host/container environment metrics

Instance runtime status metrics

Cluster topology metrics

Problem 1: When to expand Redis memory?

First we need to know which internal metrics are available; they can be obtained via info all , which returns information about clients, memory usage, runtime statistics, persistence, command statistics, cluster information, etc. The following internal metrics are key for monitoring:

Index

Config Name

Description

Relation

Threshold

1

aof_current_size

AOF current size (MB)

Greater than

6000

2

aof_delayed_fsync

Number of AOF blocks per minute

Greater than

3

3

client_biggest_input_buf

Maximum input buffer size (MB)

Greater than

10

4

client_longest_output_list

Maximum output buffer queue length

Greater than

50000

5

instantaneous_ops_per_sec

Real‑time OPS

Greater than

60000

6

latest_fork_usec

Last fork time (µs)

Greater than

400000

7

mem_fragmentation_ratio

Memory fragmentation ratio (alert when > 1.5)

Greater than

1.5

8

rdb_last_bgsave_status

Last BGSAVE status

Not equal to

ok

9

total_net_output_bytes

Network output per minute (MB)

Greater than

5000

10

total_net_input_bytes

Network input per minute (MB)

Greater than

1200

11

sync_partial_err

Partial replication failures per minute

Greater than

0

12

sync_partial_ok

Partial replication successes per minute

Greater than

0

13

sync_full

Full replication executions per minute

Greater than

0

14

rejected_connections

Rejected connections per minute

Greater than

400000

15

master_slave_offset_diff

Master‑slave offset difference (bytes)

Greater than

20000000

16

cluster_state

Cluster state

Not equal to

ok

17

cluster_slots_ok

Number of successfully allocated slots

Not equal to

16384

18

used_memory_percent

Memory usage percentage

Greater than

80%

Memory expansion is needed when used_memory > maxmemory . The default alarm threshold for memory usage is set to 80%; exceeding this triggers an email alert to administrators.

1). Current instance memory usage
127.0.0.1:6470> info memory
# Memory
used_memory:6441145376
used_memory_human:6.00G
used_memory_rss:6622314496
used_memory_rss_human:6.17G
used_memory_peak:6442453200
used_memory_peak_human:6.00G
used_memory_peak_perc:99.98%
maxmemory:6442450944
maxmemory_human:6.00G
...

2). Eviction policy (volatile‑lru)
config get maxmemory-policy
1) "maxmemory-policy"
2) "volatile-lru"

3). Expired and evicted keys
127.0.0.1:6470> info stats
# Stats
...
expired_keys:19125
evicted_keys:1120

Problem 2: How to collect client traffic and exceptions?

A unified client SDK is required to embed data collection points that record Redis call statistics and exception information, enabling rapid problem location from the client perspective.

Client command statistics include:

Basic fields: collectTime, client IP, application ID.

Command fields: command, count, input/output traffic, cumulative latency (ms).

Client exception statistics cover connection failures, latency counts, and timeout details.

Data collection process:

Metrics are temporarily stored in a queue (max size 1024).

A single thread aggregates metrics per minute.

A dedicated thread sends HTTP requests to the server.

Aggregated data is visualized in charts and tables.

Daily aggregated client metrics allow quick identification of abnormal applications.

Problem 3: Which system resources does a Redis instance consume?

Redis runs as a process on the host; resource contention such as CPU competition, disk I/O pressure, or network blockage can affect performance. The following host‑level metrics are monitored:

Index

Metric Type

Level

Description

1

CPU limit duration

Container

Top‑10 containers by CPU limit time

2

CPU usage

Container

Top‑10 containers by CPU usage

3

RSS memory

Container

Memory usage threshold 85% to prevent OOM

4

Disk I/O

Host

Diagnose disk read/write pressure

5

Network traffic

Host

Diagnose abnormal network traffic

6

Disk space

Host

Check if disk space is sufficient

7

Host uptime

Host

Diagnose host machine failures

Problem 4: Is the Redis instance running normally?

Instances usually run in containers; if a container is paused or the instance is blocked for a long time, service calls are affected. The following status metrics are monitored:

Index

Metric Type

Meaning

Monitoring Interval

1

Redis instance status

0: heartbeat stopped, 1: running, 2: offline, 3: permanently offline

Every minute

2

Pod status

0: offline, 1: online

Callback notification

Instance status is detected by a background task; abnormal responses trigger email alerts.

Pod status changes are reported via callbacks to notify unavailability or automatic recovery.

Problem 5: Can the Redis cluster guarantee failover during machine failures?

During normal use of Redis Sentinel or Redis Cluster, machine failures or restarts can cause instances on the failed machine to become unavailable, leading to topology anomalies and possible failover failures.

Index

Type

Meaning

1

Redis Sentinel cluster

Diagnose whether the cluster topology is abnormal

2

Redis Cluster

Diagnose whether the cluster topology is abnormal

Sentinel topology requirements: at least three physical machines, master‑slave nodes on different machines, each master must have at least one slave.

Cluster topology requirements: at least three physical machines, ability to failover when a host fails, masters must have slaves.

3. Monitoring Metric Explanation

The previous section introduced the categories of Redis monitoring metrics; this section provides detailed analysis of possible causes for metric alarms.

3.1 Redis INFO metric description

The info all command returns the most complete system status. Modules include Clients, Cluster, Commandstats, CPU, Keyspace, Memory, Persistence, Replication, Server, etc.

1) info client module

Provides client connection count, blocked command count, input/output buffers.

127.0.0.1:6399> info clients
# Clients
connected_clients:300
client_longest_output_list:0
client_biggest_input_buf:0
blocked_clients:0

connected_clients : current client connections, alarm when > 2000.

client_longest_output_list : max output buffer queue length, alarm when > 500.

client_biggest_input_buf : max input buffer size (MB), alarm when > 10.

2) info persistence module

Shows RDB and AOF persistence statistics.

127.0.0.1:6399> info persistence
# Persistence
loading:0
rdb_changes_since_last_save:589319695
rdb_bgsave_in_progress:0
rdb_last_save_time:1576117607
rdb_last_bgsave_status:ok
rdb_last_bgsave_time_sec:2
rdb_current_bgsave_time_sec:-1
rdb_last_cow_size:2129920
aof_enabled:1
aof_rewrite_in_progress:0
aof_rewrite_scheduled:0
aof_last_rewrite_time_sec:30
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok
aof_last_write_status:ok
aof_last_cow_size:20979712
aof_current_size:1654862653
aof_base_size:1584570244
aof_pending_rewrite:0
aof_buffer_length:0
aof_rewrite_buffer_length:0
aof_pending_bio_fsync:0
aof_delayed_fsync:15

aof_current_size : AOF size (MB), alarm when > 6000.

aof_delayed_fsync : AOF block count per minute, alarm when > 3.

rdb_last_bgsave_status : last BGSAVE status, alarm when not "ok" (e.g., disk space shortage).

3) info stats

Basic statistics covering connections, commands, network, expiration, replication, etc.

127.0.0.1:6399> info stats
# Stats
total_connections_received:5261797
total_commands_processed:9448523137
instantaneous_ops_per_sec:1560
total_net_input_bytes:1307208851742
total_net_output_bytes:5338907609106
instantaneous_input_kbps:68.02
instantaneous_output_kbps:137.20
rejected_connections:0
sync_full:3
sync_partial_ok:1
sync_partial_err:3
expired_keys:7984396
expired_stale_perc:0.10
expired_time_cap_reached_count:0
evicted_keys:0
keyspace_hits:3477205762
keyspace_misses:5308763830
pubsub_channels:0
pubsub_patterns:0
latest_fork_usec:114520
...

total_net_output_bytes : network output per minute (MB), alarm when > 2500.

total_net_input_bytes : network input per minute (MB), alarm when > 1200.

latest_fork_usec : last fork time (µs), alarm when > 600000.

sync_full : full replication executions per minute, alarm when > 0.

sync_partial_ok : partial replication successes per minute, alarm when > 0.

4) info replication

Replication statistics.

127.0.0.1:6399> info replication
# Replication
role:master
connected_slaves:1
slave0:ip=x.x.x.x,port=6387,state=online,offset=4764522381799,lag=1
master_replid:ac5d8f6938d752f8f6d453a9841b2a4ed261bcfb
master_replid2:82ab36d6a6f7059757b96c83337f4a6132597e1a
master_repl_offset:4764522381799
second_repl_offset:3823758361609
repl_backlog_active:1
repl_backlog_size:10000000
repl_backlog_first_byte_offset:4764512381800
repl_backlog_histlen:10000000

repl_backlog_size : replication backlog size, alarm when > 10 MB.

master_slave_offset_diff : master‑slave offset difference (bytes), alarm when > 10 MB.

5) info cpu

CPU usage of the Redis process and its children.

127.0.0.1:6399> info cpu
# CPU
used_cpu_sys:156315.03
used_cpu_user:78421.15
used_cpu_sys_children:5608.28
used_cpu_user_children:19055.72

If the main process accumulates a large amount of CPU time in kernel mode, it indicates heavy load and may cause client blocking or slow responses.

3.2 Client exception metric description

The SDK mainly uses the Jedis client. Common exception messages and typical causes include:

JedisConnectionException: Could not get a resource from the pool – connection pool leak, insufficient pool size, or slow queries causing blockage.

java.net.SocketTimeoutException: Read timed out – timeout too short, slow queries, or unstable network.

JedisDataException: ERR max number of clients reached – client connections exceed the configured maxclients (default 10000).

JedisDataException: LOADING Redis is loading the dataset in memory – Redis is loading persistence files and cannot serve reads/writes.

JedisDataException: NOAUTH Authentication required – client did not provide a password.

4. Summary

Monitoring and alerting are crucial for service quality. For Redis, the key take‑aways are:

Client side: SDK integration is simple, data collection is transparent, and SDK reliability is ensured.

Server side: Internal monitoring (Redis metrics and client metrics) enables rapid problem location; external monitoring (host resources and cluster topology) supports quick migration and repair; proper resource allocation and fault recovery capabilities improve overall stability.

monitoringperformanceoperationsdatabaseredismetricsalerting
Sohu Tech Products
Written by

Sohu Tech Products

A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.