Resolving Frequent Crashes of a Single-Node Prometheus Deployment: Analysis and Solutions
This article analyzes why a single Prometheus instance repeatedly runs out of memory and crashes, explains the underlying storage mechanisms, and presents practical solutions such as metric reduction, retention tuning, federation architecture, and remote storage integration to improve stability and scalability.
Problem scenario: a company built its own infrastructure with a single Prometheus node monitoring hosts, operating systems, databases, and Kubernetes clusters. The node frequently runs out of memory and crashes, with long reload times, even after doubling memory.
Cause analysis: Prometheus stores recent samples in memory for the current two‑hour time window and writes data to a write‑ahead log (WAL) on disk. Large numbers of targets generate many samples, exceeding the memory needed for two hours, and the native TSDB is not designed for massive data volumes.
Example directory layout of a Prometheus data directory:
$ ls -l ./data/
total 20
drwxr-xr-x. 3 root root 68 Jan 7 13:33 01FRSGF2NBJ21HZW1QM2594CB9 # 块目录
drwxr-xr-x. 3 root root 68 Jan 7 15:00 01FRSNCMS0Q1Y9Q9KKT6TTXH3C # 块目录
drwxr-xr-x. 2 root root 34 Jan 7 15:00 chunks_head
-rw-r--. 1 root root 0 Jan 7 10:33 lock
-rw-r--. 1 root root 20001 Jan 7 15:19 queries.active
drwxr-xr-x. 2 root root 54 Jan 7 15:00 wal # 预写日志目录PromQL query to see the rate of samples being appended:
rate(prometheus_tsdb_head_samples_appended_total[5m])Solution: Reduce the number of scraped metrics and shorten retention time, or better, adopt a Prometheus federation architecture that distributes targets across multiple instances and aggregates results at a central node. Remote Write/Remote Read can offload historic data to an external storage system, alleviating pressure on the native TSDB.
The proposed architecture splits collection and storage, uses remote storage for long‑term data, and limits the federation hierarchy to three layers. It also recommends placing alerting rules on leaf nodes and keeping the overall hierarchy shallow.
While the federation approach solves the immediate memory and TSDB bottlenecks, it introduces network complexity and potential data latency between leaf and central nodes.
For a deeper dive, the author points to a paid tutorial series on mastering Prometheus monitoring.
DevOps Operations Practice
We share professional insights on cloud-native, DevOps & operations, Kubernetes, observability & monitoring, and Linux systems.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.