Master Redis Monitoring: Essential Metrics, Scripts, and Alerting Strategies
This guide walks operations engineers through building a complete Redis monitoring system—covering why monitoring matters, which metrics to collect, how to gather them with Prometheus and Grafana, and practical Bash scripts for health checks, memory, persistence, replication, client connections, and alert thresholds.
Background and Purpose
Redis is a high‑performance in‑memory data store widely used for caching, session storage and message queues. Continuous monitoring of its runtime state, memory consumption, persistence health, replication status, client connections and command latency is essential for maintaining service stability.
Prerequisites
Linux command‑line skills, basic Redis concepts, and a Redis 7.4.x instance. The examples assume Prometheus 2.50.x and Grafana 11.x for metric collection and visualization.
Monitoring Architecture
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Application │────▶│ Redis │────▶│ Monitoring │
│ (code) │ │ (target) │ │ (collect/store)│
└─────────────┘ └─────────────┘ └─────────────┘
│ │
▼ ▼
┌─────────────┐ ┌─────────────┐
│ Alerting │ │ Alertmanager│
└─────────────┘ └─────────────┘1. Basic Runtime Metrics
1.1 Version and Process Information
# Show Redis version and process details
redis-cli INFO server | grep redis_version
# Example output
redis_version:7.4.0
redis_mode:standalone
os:Linux 6.8.5 x86_64
arch_bits:64
process_id:12345
tcp_port:63791.2 Uptime
# Show uptime in seconds and days
redis-cli INFO server | grep uptime_in
# Example output
uptime_in_seconds:864000
uptime_in_days:10
# Simple Bash script to compute days
#!/bin/bash
uptime_seconds=$(redis-cli INFO server | grep uptime_in_seconds | cut -d: -f2)
uptime_days=$(echo "scale=2; $uptime_seconds/86400" | bc)
echo "Redis uptime: ${uptime_days} days"
# Alert if uptime < 1 hour (possible recent restart)1.3 Basic Health‑Check Script
#!/bin/bash
# redis_health_check.sh – basic health check
echo "=== Redis Health Check ==="
# 1. Process alive?
if redis-cli PING 2>/dev/null | grep -q PONG; then
echo "✓ Redis process is alive"
else
echo "✗ Redis process unavailable"
exit 1
fi
# 2. Version
VERSION=$(redis-cli INFO server | grep redis_version | cut -d: -f2 | tr -d '\r')
echo "✓ Version: $VERSION"
# 3. Uptime (days)
UPTIME=$(redis-cli INFO server | grep uptime_in_days | cut -d: -f2 | tr -d '\r')
echo "✓ Uptime: ${UPTIME} days"
# 4. Write test
if redis-cli SET test_key test_value EX 10 > /dev/null; then
redis-cli DEL test_key > /dev/null
echo "✓ Writable"
else
echo "✗ Not writable (maybe out of memory)"
fi
# 5. Replication role
ROLE=$(redis-cli INFO replication | grep role: | cut -d: -f2 | tr -d '\r')
echo "✓ Role: $ROLE"2. Memory Metrics
2.1 Core Memory Fields
redis-cli INFO memory
# used_memory – total bytes allocated by Redis
# used_memory_human – human‑readable format
# used_memory_rss – OS‑reported resident set size
# used_memory_peak – historical peak usage
# used_memory_peak_perc – peak usage as a percentage of maxmemory
# used_memory_lua – memory used by the Lua engine
# maxmemory – configured memory limit (bytes)
# maxmemory_policy – eviction policy (e.g. allkeys-lru)
# memory_frag_ratio – fragmentation ratio (ideal 1.0)
# memory_frag_bytes – fragmentation in bytes2.2 Memory Alert Script
#!/bin/bash
# redis_memory_alert.sh – memory usage alert
WARNING_THRESHOLD=80 # percent
CRITICAL_THRESHOLD=90 # percent
INFO=$(redis-cli INFO memory)
USED=$(echo "$INFO" | grep '^used_memory:' | cut -d: -f2 | tr -d '\r')
MAX=$(echo "$INFO" | grep '^maxmemory:' | cut -d: -f2 | tr -d '\r')
RSS=$(echo "$INFO" | grep '^used_memory_rss:' | cut -d: -f2 | tr -d '\r')
FRAG=$(echo "$INFO" | grep '^mem_fragment_ratio:' | cut -d: -f2 | tr -d '\r')
if [ "$MAX" != "0" ]; then
USAGE=$(echo "scale=2; $USED*100/$MAX" | bc)
echo "Memory usage: ${USAGE}%"
echo "Used: $(echo "scale=2; $USED/1024/1024" | bc) MB"
echo "Max : $(echo "scale=2; $MAX/1024/1024" | bc) MB"
echo "RSS : $(echo "scale=2; $RSS/1024/1024" | bc) MB"
echo "Fragmentation ratio: $FRAG"
USAGE_INT=$(echo "$USAGE" | cut -d. -f1)
if [ $USAGE_INT -ge $CRITICAL_THRESHOLD ]; then
echo "ALERT: Memory usage ${USAGE}% exceeds $CRITICAL_THRESHOLD% – immediate action required!"
elif [ $USAGE_INT -ge $WARNING_THRESHOLD ]; then
echo "WARNING: Memory usage ${USAGE}% exceeds $WARNING_THRESHOLD%"
fi
else
echo "maxmemory not set"
fi2.3 Memory Metric Reference
used_memory – total memory allocated by Redis (bytes)
used_memory_rss – resident set size reported by the OS
mem_fragment_ratio – fragmentation ratio; normal range 1.0‑1.5, >2.0 indicates need for optimization
maxmemory – configured memory limit
mem_not_counted_for_evict – memory excluded from eviction calculations
3. Persistence Metrics
3.1 RDB Persistence
redis-cli INFO persistence
# rdb_changes_since_last_save – number of changes since last RDB save
# rdb_bgsave_in_progress – 1 if a BGSAVE is running
# rdb_last_save_time – Unix timestamp of the last successful save
# rdb_last_bgsave_status – ok or err
# rdb_last_bgsave_time_sec – duration of the last BGSAVE (seconds)
# rdb_current_bgsave_time_sec – elapsed time of the current BGSAVE
# rdb_saves – total number of RDB saves performed3.2 AOF Persistence
redis-cli INFO persistence
# aof_enabled – 1 if AOF is enabled
# aof_rewrite_in_progress – 1 if an AOF rewrite is running
# aof_last_rewrite_time_sec – duration of the last AOF rewrite
# aof_current_rewrite_time_sec– elapsed time of the current rewrite
# aof_last_write_status – ok or err for the last AOF write
# aof_delayed_fsync – number of delayed fsync operations3.3 Persistence Health‑Check Script
#!/bin/bash
# redis_persistence_check.sh – verify RDB and AOF health
echo "=== Redis Persistence Check ==="
INFO=$(redis-cli INFO persistence)
# RDB status
rdb_status=$(echo "$INFO" | grep '^rdb_last_bgsave_status:' | cut -d: -f2 | tr -d '\r')
rdb_in_progress=$(echo "$INFO" | grep '^rdb_bgsave_in_progress:' | cut -d: -f2 | tr -d '\r')
rdb_time=$(echo "$INFO" | grep '^rdb_last_bgsave_time_sec:' | cut -d: -f2 | tr -d '\r')
echo "RDB status: $rdb_status"
echo "RDB in progress: $rdb_in_progress"
if [ "$rdb_time" != "-1" ]; then
echo "RDB duration: ${rdb_time}s"
fi
# AOF status
aof_enabled=$(echo "$INFO" | grep '^aof_enabled:' | cut -d: -f2 | tr -d '\r')
aof_status=$(echo "$INFO" | grep '^aof_last_write_status:' | cut -d: -f2 | tr -d '\r')
echo "AOF enabled: $aof_enabled"
echo "AOF last write status: $aof_status"
# Alert on failures
if [ "$rdb_status" != "ok" ]; then
echo "⚠️ RDB save failed!"
fi
if [ "$aof_status" != "ok" ]; then
echo "⚠️ AOF write failed!"
fi4. Replication Metrics
4.1 View Replication State
# Master node
redis-cli INFO replication
# Example output
# role:master
# connected_slaves:2
# slave0:ip=192.168.1.101,port=6379,state=online,offset=123456
# slave1:ip=192.168.1.102,port=6379,state=online,offset=123456
# Slave node
redis-cli INFO replication
# role:slave
# master_host:192.168.1.100
# master_port:6379
# master_link_status:up
# master_repl_offset:123456
# slave_repl_offset:1234564.2 Replication Lag Check Script
#!/bin/bash
# check_replication_lag.sh – monitor replication delay
INFO=$(redis-cli INFO replication)
role=$(echo "$INFO" | grep '^role:' | cut -d: -f2 | tr -d '\r')
if [ "$role" = "master" ]; then
echo "Role: master"
slaves=$(echo "$INFO" | grep '^slave' | wc -l)
echo "Number of slaves: $slaves"
echo "$INFO" | grep '^slave' | while IFS=, read -ra fields; do
echo " ${fields[0]}"
done
else
echo "Role: slave"
master=$(echo "$INFO" | grep '^master_host:' | cut -d: -f2 | tr -d '\r')
link_status=$(echo "$INFO" | grep '^master_link_status:' | cut -d: -f2 | tr -d '\r')
echo "Master: $master"
echo "Link status: $link_status"
if [ "$link_status" != "up" ]; then
echo "⚠️ Replication link down!"
fi
fi4.3 Replication Metric Reference
connected_slaves – number of replicas attached to the master
master_link_status – replication link state (up/down)
slave_repl_offset – offset of the slave
master_repl_offset – offset of the master
replication_lag – delay in seconds; normal 0, alert if >10 s
5. Client Connection Metrics
5.1 Connection Statistics
redis-cli INFO clients
# connected_clients – current client count
# cluster_connections – number of cluster connections
# maxclients – configured client limit
# blocked_clients – clients blocked on commands such as BLPOP
# tracking_clients – number of clients using client‑tracking5.2 Client List
# List all clients
redis-cli CLIENT LIST
# Example line:
# id=1 addr=192.168.1.100:12345 fd=8 name= age=100 idle=0 flags=N db=0 sub=0 pub=0 multi=-1 cmd=ping
# Fields: id, addr, fd, idle, flags, cmd, etc.5.3 Connection Monitoring Script
#!/bin/bash
# redis_clients_check.sh – monitor client connections
echo "=== Redis Client Check ==="
INFO=$(redis-cli INFO clients)
CLIENTS=$(echo "$INFO" | grep '^connected_clients:' | cut -d: -f2 | tr -d '\r')
MAX=$(echo "$INFO" | grep '^maxclients:' | cut -d: -f2 | tr -d '\r')
BLOCKED=$(echo "$INFO" | grep '^blocked_clients:' | cut -d: -f2 | tr -d '\r')
echo "Current connections: $CLIENTS"
echo "Maximum limit: $MAX"
echo "Blocked clients: $BLOCKED"
if [ "$MAX" != "0" ]; then
USAGE=$(echo "scale=2; $CLIENTS*100/$MAX" | bc)
echo "Connection usage: ${USAGE}%"
if (( $(echo "$USAGE > 80" | bc -l) )); then
echo "⚠️ Connection count approaching limit!"
fi
fi
# List idle connections > 1 hour
echo "Idle connections > 1h:"
redis-cli CLIENT LIST | awk -F',' '/idle=[3-9][0-9][0-9][0-9]/ {print $0}'6. Keyspace Statistics
6.1 Keyspace Info
redis-cli INFO keyspace
# db0:keys=1000000,expires=500000,avg_ttl=3600000000
# db1:keys=0,expires=0,avg_ttl=0
# Fields: keys, expires, avg_ttl (nanoseconds)6.2 Keyspace Analysis Script
#!/bin/bash
# redis_keyspace_stats.sh – analyze keyspace growth and large keys
echo "=== Redis Keyspace Statistics ==="
# Show current keyspace info
redis-cli INFO keyspace
# Record history (db0 only)
DBINFO=$(redis-cli INFO keyspace | grep '^db0:')
KEYS=$(echo "$DBINFO" | cut -d: -f2 | cut -d, -f1 | cut -d= -f2)
DATE=$(date '+%Y-%m-%d %H:%M:%S')
echo "$DATE $KEYS" >> /tmp/redis_keys_history.txt
# Show recent history
if [ -f /tmp/redis_keys_history.txt ]; then
tail -20 /tmp/redis_keys_history.txt
else
echo "No history"
fi
# Compute growth rate if previous record exists
if [ -f /tmp/redis_keys_prev.txt ]; then
PREV=$(cat /tmp/redis_keys_prev.txt)
PREV_KEYS=$(echo "$PREV" | awk '{print $2}')
PREV_TIME=$(echo "$PREV" | awk '{print $1}')
PREV_TS=$(date -d "$PREV_TIME" +%s 2>/dev/null || echo 0)
CUR_TS=$(date +%s)
DIFF=$((KEYS - PREV_KEYS))
TIME_DIFF=$((CUR_TS - PREV_TS))
if [ $TIME_DIFF -gt 0 ]; then
RATE=$(echo "scale=2; $DIFF/$TIME_DIFF" | bc)
echo "Key growth rate: $RATE keys/s"
fi
fi
# Save current for next run
echo "$DATE $KEYS" > /tmp/redis_keys_prev.txt
# Top 10 large keys (sample 1000 keys)
echo "Top large keys:"
redis-cli --scan --pattern '*' | head -1000 | while read key; do
type=$(redis-cli TYPE "$key" 2>/dev/null)
case $type in
string) len=$(redis-cli STRLEN "$key" 2>/dev/null) ;;
list) len=$(redis-cli LLEN "$key" 2>/dev/null) ;;
set) len=$(redis-cli SCARD "$key" 2>/dev/null) ;;
zset) len=$(redis-cli ZCARD "$key" 2>/dev/null) ;;
hash) len=$(redis-cli HLEN "$key" 2>/dev/null) ;;
*) len=0 ;;
esac
echo "$key|$type|$len"
done | awk -F'|' '{print $2,$3}' | sort -nr | head -107. Command Statistics
7.1 commandstats
redis-cli INFO commandstats
# cmdstat_get:calls=1000000,usec=5000000,usec_per_call=5.00
# cmdstat_set:calls=500000,usec=3000000,usec_per_call=6.00
# cmdstat_del:calls=100000,usec=1000000,usec_per_call=10.007.2 Slow Command Analysis
#!/bin/bash
# analyze_slow_commands.sh – list commands with highest average latency
echo "=== Slow Command Analysis ==="
redis-cli INFO commandstats | grep '^cmdstat' | while IFS=: read -r cmd stats; do
calls=$(echo "$stats" | grep -oP 'calls=\K\d+')
usec=$(echo "$stats" | grep -oP 'usec=\K\d+')
if [ -n "$calls" ] && [ -n "$usec" ]; then
avg=$(echo "scale=2; $usec/$calls" | bc)
echo "$cmd: $calls calls, avg $avg µs"
fi
done | sort -t: -k4 -rn | head -108. Throughput Metrics
8.1 QPS (Queries Per Second)
# View stats
redis-cli INFO stats
# instantaneous_ops_per_sec – operations per second (QPS)
# total_commands_processed – cumulative command count
# rejected_commands – number of commands rejected due to overload8.2 QPS Monitoring Script
#!/bin/bash
# redis_qps_monitor.sh – monitor queries per second
echo "=== Redis QPS Monitoring ==="
INFO=$(redis-cli INFO stats)
OPS=$(echo "$INFO" | grep '^instantaneous_ops_per_sec:' | cut -d: -f2 | tr -d '\r')
TOTAL=$(echo "$INFO" | grep '^total_commands_processed:' | cut -d: -f2 | tr -d '\r')
REJECTED=$(echo "$INFO" | grep '^rejected_commands:' | cut -d: -f2 | tr -d '\r')
CONN=$(echo "$INFO" | grep '^total_connections_received:' | cut -d: -f2 | tr -d '\r')
echo "Current QPS: $OPS"
echo "Total commands: $TOTAL"
echo "Rejected commands: $REJECTED"
echo "Total connections: $CONN"
# Record history for trend analysis
echo "$(date '+%Y-%m-%d %H:%M:%S') $OPS" >> /tmp/redis_qps_history.txt
if [ -f /tmp/redis_qps_history.txt ]; then
echo "
QPS trend (last 10 entries):"
tail -10 /tmp/redis_qps_history.txt
fi8.3 Latency Monitoring
#!/bin/bash
# redis_latency_monitor.sh – basic latency benchmark and built‑in test
echo "=== Redis Latency Monitoring ==="
# Simple ping benchmark (5 runs)
for i in {1..5}; do
start=$(date +%s%N)
redis-cli PING > /dev/null
end=$(date +%s%N)
latency=$(( (end-start)/1000000 ))
echo " Test $i: ${latency} ms"
done
# Built‑in latency distribution if redis-cli is available
if command -v redis-cli >/dev/null; then
echo "
Latency distribution:"
redis-cli --latency-history
fi9. Sentinel and Cluster Status
9.1 Redis Sentinel Monitoring
# View Sentinel masters
redis-cli SENTINEL masters
# View a specific master
redis-cli SENTINEL master mymaster
# View slaves of a master
redis-cli SENTINEL slaves mymaster
# View Sentinel instances
redis-cli SENTINEL sentinels mymaster9.2 Sentinel Monitoring Script
#!/bin/bash
# redis_sentinel_monitor.sh – check Sentinel state
echo "=== Redis Sentinel Monitoring ==="
MASTER_NAME="mymaster"
MASTER=$(redis-cli SENTINEL get-master-addr-by-name "$MASTER_NAME")
echo "Master address: $MASTER"
MASTER_STATUS=$(redis-cli SENTINEL master "$MASTER_NAME" | grep '^status' | awk '{print $2}')
echo "Master status: $MASTER_STATUS"
# List slaves
echo "
Slaves:"
redis-cli SENTINEL slaves "$MASTER_NAME" | while read line; do echo " $line"; done
# Subjective down (SDOWN) state
SDOWN=$(redis-cli SENTINEL masters | grep -A1 "name" | head -2)
echo "
Subjective down state: $SDOWN"9.3 Redis Cluster Status
# Cluster info
redis-cli cluster info
# Sample output
# cluster_state:ok
# cluster_slots_assigned:16384
# cluster_slots_ok:16384
# cluster_nodes:69.4 Cluster Monitoring Script
#!/bin/bash
# redis_cluster_monitor.sh – monitor cluster health
echo "=== Redis Cluster Monitoring ==="
INFO=$(redis-cli cluster info)
STATE=$(echo "$INFO" | grep '^cluster_state:' | cut -d: -f2 | tr -d '\r')
SLOTS=$(echo "$INFO" | grep '^cluster_slots_assigned:' | cut -d: -f2 | tr -d '\r')
SLOTS_OK=$(echo "$INFO" | grep '^cluster_slots_ok:' | cut -d: -f2 | tr -d '\r')
NODES=$(echo "$INFO" | grep '^cluster_nodes:' | cut -d: -f2 | tr -d '\r')
echo "Cluster state: $STATE"
echo "Assigned slots: $SLOTS"
echo "Healthy slots: $SLOTS_OK"
echo "Node count: $NODES"
if [ "$SLOTS" != "16384" ]; then
echo "⚠️ Slot assignment incomplete!"
fi
if [ "$SLOTS" != "$SLOTS_OK" ]; then
echo "⚠️ Faulty slots detected!"
fi
# Detailed node info
echo "
Node details:"
redis-cli cluster nodes | while read line; do echo " $line"; done10. Monitoring Tool Comparison
10.1 redis-cli Built‑in Tools
MONITOR – real‑time command stream
INFO – runtime statistics
--bigkeys – scan for large keys
--latency / --latency-history – latency testing
SLOWLOG GET – slow query log
MEMORY STATS / MEMORY USAGE – memory analysis
10.2 Prometheus + Grafana
scrape_configs:
- job_name: 'redis'
static_configs:
- targets: ['localhost:6379']
metrics_path: /metrics10.3 Grafana Dashboards
Redis Dashboard (ID: 763)
Redis / Prometheus (ID: 14091)
10.4 Tool Comparison Summary
redis-cli – no installation, full feature set, but no long‑term storage.
Prometheus – time‑series storage and alerting, requires exporter and server components.
Grafana – rich visualizations, depends on a data source such as Prometheus.
redis_exporter – standard metric collector for Prometheus, adds a small resource overhead.
11. Alert Threshold Reference (Prometheus Rules)
11.1 Memory Alerts
groups:
- name: redis
rules:
# Warning when memory usage > 80%
- alert: RedisMemoryUsageHigh
expr: redis_memory_used_bytes / redis_memory_max_bytes > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "Redis memory usage high"
description: "Instance {{ $labels.instance }} memory usage {{ $value | humanizePercentage }}"
# Critical when memory usage > 90%
- alert: RedisMemoryUsageCritical
expr: redis_memory_used_bytes / redis_memory_max_bytes > 0.9
for: 1m
labels:
severity: critical
annotations:
summary: "Redis memory usage critical"
description: "Instance {{ $labels.instance }} memory almost exhausted!"
# Fragmentation ratio > 1.5
- alert: RedisHighFragmentation
expr: redis_mem_fragmentation_ratio > 1.5
for: 10m
labels:
severity: warning11.2 Connection Alerts
- alert: RedisHighConnections
expr: redis_connected_clients / redis_config_maxclients > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "Redis connections high"
description: "Instance {{ $labels.instance }} connection usage {{ $value | humanizePercentage }}"11.3 Replication Alerts
- alert: RedisReplicationDown
expr: redis_connected_slaves < 1
for: 1m
labels:
severity: critical
annotations:
summary: "Redis replication down"
description: "No replica connected for instance {{ $labels.instance }}"11.4 Performance Alerts
- alert: RedisHighEviction
expr: rate(redis_evicted_keys_total[5m]) > 10
for: 5m
labels:
severity: warning
- alert: RedisCommandRejected
expr: redis_rejected_commands_total > 0
for: 1m
labels:
severity: critical12. Dashboard Design Checklist
12.1 Core Critical Metrics
Memory usage (used_memory / maxmemory)
Client connections (connected_clients)
Replication link status (master_link_status)
Persistence status (last_bgsave_status)
12.2 Important Metrics
QPS (instantaneous_ops_per_sec)
Replication lag (slave_repl_offset)
Fragmentation ratio (mem_fragmentation_ratio)
Evicted keys count
Command latency (commandstats)
12.3 Optional Metrics
Lua engine memory (used_memory_lua)
Tracking clients (tracking_clients)
Blocked clients (blocked_clients)
12.4 Monitoring Checklist
【Infrastructure】
- Process alive (redis-cli PING)
- Restart count
- Uptime
【Memory】
- Usage ratio
- Fragmentation
- Eviction count
- maxmemory config
【Connections】
- Client count
- Blocked clients
- Max client limit
【Persistence】
- RDB last save status
- AOF last write status
- BGSAVE progress
【Replication】
- Master‑slave link status
- Replication lag
- Slave count
【Performance】
- QPS
- Command latency
- Slow query count12.5 Alert Handling Workflow
【Alert Trigger】
↓
【Validate Alert】
↓
【Initial Triage】
- Memory → check growth, eviction policy
- Connections → check leaks, idle connections
- Replication → check network, slave status
- Performance → check slow queries, big keys
↓
【Quick Fix】
- Scale memory
- Clean idle connections
- Restart replication
↓
【Root Cause Analysis】
- Identify data growth source
- Review application logic
- Tune configuration
↓
【Long‑Term Measures】
- Adjust thresholds
- Refine capacity planning
- Improve alert rules12.6 Quick Command Reference
Memory – INFO memory (grep used_memory)
Connections – INFO clients (grep connected_clients)
Persistence – INFO persistence (grep rdb/aof)
Replication – INFO replication (grep role)
Keyspace – INFO keyspace (grep db0)
Command stats – INFO commandstats (cmdstat_*)
Performance – INFO stats (instantaneous_ops_per_sec)
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
