How to Build a Fine‑Grained Redis Monitoring System with Elastic‑Stack
This article explains how a large car‑IoT platform built a comprehensive Redis monitoring solution using Elastic‑Stack, covering server‑side metrics, application‑side instrumentation, joint analysis, implementation details, findings, and future optimization recommendations.
Background
Redis is widely used in both business applications and big‑data scenarios, but it can become fragile if not managed properly. Historically, deployments evolved from memcached to various Redis modes (single instance, master‑slave, sentinel, proxy, cluster), yet many companies still lack a complete view of Redis health.
Problem Statement
The company operates a car‑IoT service with millions of users, using a Redis‑Cluster architecture shared by dozens of backend applications (over 200 instances). As usage grew, the following issues appeared:
Cluster nodes crashing
Nodes becoming unresponsive (zombie state)
Some applications experiencing very slow responses
Root causes were traced to insufficient architecture and operations monitoring; standard Redis INFO commands only expose coarse metrics and cannot reveal detailed usage patterns.
Monitoring Goals
The monitoring system should answer questions such as:
What is the hotness distribution of Redis keys?
Which applications consume the most memory?
Which applications generate the highest request volume?
Are any applications using Redis data types incorrectly?
How is Redis usage distributed across modules?
Where are the hotspot problems in the cluster?
Solution Overview
The team chose the Elastic‑Stack (Elasticsearch, Logstash, Kibana) combined with Metricbeat to collect both server‑side and application‑side data, enabling joint analysis.
Server‑Side Monitoring
Metricbeat gathers OS‑level metrics (CPU, memory, network I/O, disk I/O, process info). For Redis‑specific metrics, a custom agent periodically runs INFO on each cluster node, parses the output, computes simple statistics, and writes JSON files that Logstash ingests into Elasticsearch.
Limitations of Metricbeat’s built‑in Redis module include:
Inability to represent master‑slave relationships in a cluster
Only cumulative counters are provided (e.g., total command count), making peak detection difficult
Dynamic cluster changes (node addition/removal) are not captured
Service‑side logs are collected with Filebeat using the Redis module.
Application‑Side Monitoring
Instrumentation was added to the Java Jedis client:
Modified Connection.java to record command start time, end time, duration, key size, etc.
Modified JedisClusterCommand.java to capture the full key name.
Each Redis command now logs the following fields:
r_host – server address and port
r_cmd – command type (GET, SET, HGET, …)
r_start – timestamp when the command started
r_cost – execution time
r_size – size of the key/value being set or retrieved
r_key – key name
r_keys – hierarchical key components (e.g., app_module_variable)
A custom Logback layout adds the application host IP and name (app_ip, app_host) to each log entry. Logstash then forwards these logs to Elasticsearch.
Joint Analysis
By correlating server‑side metrics with application‑side behavior, the team can identify scenarios such as a large key causing CPU spikes or a misused data type leading to performance degradation.
Implementation Details
Key configuration snippets (shown as images) include Metricbeat YAML, Logstash pipelines, and Logback XML layout. The architecture diagrams illustrate the data flow from Redis nodes and application instances through Metricbeat/Filebeat, Logstash, and finally Kibana dashboards.
Findings After Two Weeks of Monitoring
Some keys exceeded 1 MB, causing long latency and potential blocking.
Several applications were using Redis as a primary database.
Lists were being used as message queues, with hundreds of thousands of items per operation.
Certain applications generated more than half of the total cluster traffic.
Future Plans
Eliminate misuse of Redis data structures on the application side.
Partition the cluster into dedicated sub‑clusters for heavy‑use applications.
Require architectural review for any new Redis‑dependent modules.
Conclusion
The monitoring system acted as the “eyes” of the architecture team, enabling early detection and resolution of performance problems. Proper monitoring is essential for mastering Redis at scale, especially for architecture and operations teams.
Q&A
Q1: How many Redis instances should run on a single machine?
A: Base the number on CPU cores (Redis is single‑threaded per instance), available memory (each instance needs free memory for forked processes), and network bandwidth.
Q2: How to handle large hash keys?
A: Split large hashes into smaller strings; avoid expiring whole hashes when only parts need expiration.
Q3: Does logging large keys affect QPS?
A: Only the key name and size are logged; logging frameworks like Logback can be configured for asynchronous output or forwarded to Kafka if needed.
Q4: How to build a slow‑query report for MongoDB with Elasticsearch?
A: Use the same Elastic‑Stack approach: ingest both metrics and logs into Elasticsearch and build dashboards.
Q5: How to avoid blocking when running INFO ALL frequently?
A: Schedule the collection at a reasonable interval (e.g., every 5 seconds) and capture the snapshot via a custom client wrapper.
Q6: How to instrument Jedis on the client side?
A: Modify the two core classes (Connection.java and JedisClusterCommand.java) as shown in the article; the code examples are included in the diagrams.
Q7: Should the monitoring stack run inside Kubernetes?
A: The server‑side components (Metricbeat, Filebeat) can run in Docker/Kubernetes, but the client‑side instrumentation remains in the application process.
Q8: How many Elasticsearch nodes are needed and should SSD be used?
A: A minimal production cluster starts with three nodes; SSDs are optional for hot data, while cold data can reside on HDD.
Q9: Which monitoring tool is best for limited resources: Elasticsearch, Prometheus, or Zabbix?
A: Elasticsearch (Elastic‑Stack) offers both log and metric capabilities and is generally more versatile than pure time‑series solutions like Prometheus or Zabbix.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
