Operations 6 min read

Resolving Frequent Crashes of a Single-Node Prometheus Deployment: Analysis and Solutions

This article analyzes why a single Prometheus instance repeatedly runs out of memory and crashes, explains the underlying storage mechanisms, and presents practical solutions such as metric reduction, retention tuning, federation architecture, and remote storage integration to improve stability and scalability.

DevOps Operations Practice

Mar 14, 2024

Resolving Frequent Crashes of a Single-Node Prometheus Deployment: Analysis and Solutions

Problem scenario: a company built its own infrastructure with a single Prometheus node monitoring hosts, operating systems, databases, and Kubernetes clusters. The node frequently runs out of memory and crashes, with long reload times, even after doubling memory.

Cause analysis: Prometheus stores recent samples in memory for the current two‑hour time window and writes data to a write‑ahead log (WAL) on disk. Large numbers of targets generate many samples, exceeding the memory needed for two hours, and the native TSDB is not designed for massive data volumes.

Example directory layout of a Prometheus data directory:

$ ls -l ./data/
total 20
drwxr-xr-x. 3 root root    68 Jan  7 13:33 01FRSGF2NBJ21HZW1QM2594CB9   # 块目录
drwxr-xr-x. 3 root root    68 Jan  7 15:00 01FRSNCMS0Q1Y9Q9KKT6TTXH3C   # 块目录
drwxr-xr-x. 2 root root    34 Jan  7 15:00 chunks_head
-rw-r--. 1 root root     0 Jan  7 10:33 lock
-rw-r--. 1 root root 20001 Jan  7 15:19 queries.active
drwxr-xr-x. 2 root root    54 Jan  7 15:00 wal                         # 预写日志目录

PromQL query to see the rate of samples being appended: rate(prometheus_tsdb_head_samples_appended_total[5m]) Solution: Reduce the number of scraped metrics and shorten retention time, or better, adopt a Prometheus federation architecture that distributes targets across multiple instances and aggregates results at a central node. Remote Write/Remote Read can offload historic data to an external storage system, alleviating pressure on the native TSDB.

The proposed architecture splits collection and storage, uses remote storage for long‑term data, and limits the federation hierarchy to three layers. It also recommends placing alerting rules on leaf nodes and keeping the overall hierarchy shallow.

While the federation approach solves the immediate memory and TSDB bottlenecks, it introduces network complexity and potential data latency between leaf and central nodes.

For a deeper dive, the author points to a paid tutorial series on mastering Prometheus monitoring.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Monitoring performance prometheus scaling Federation

Written by

DevOps Operations Practice

We share professional insights on cloud-native, DevOps & operations, Kubernetes, observability & monitoring, and Linux systems.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.