Operations 6 min read

Resolving Frequent Crashes of a Single-Node Prometheus Deployment: Analysis and Solutions

This article analyzes why a single Prometheus instance repeatedly runs out of memory and crashes, explains the underlying storage mechanisms, and presents practical solutions such as metric reduction, retention tuning, federation architecture, and remote storage integration to improve stability and scalability.

DevOps Operations Practice
DevOps Operations Practice
DevOps Operations Practice
Resolving Frequent Crashes of a Single-Node Prometheus Deployment: Analysis and Solutions

Problem scenario: a company built its own infrastructure with a single Prometheus node monitoring hosts, operating systems, databases, and Kubernetes clusters. The node frequently runs out of memory and crashes, with long reload times, even after doubling memory.

Cause analysis: Prometheus stores recent samples in memory for the current two‑hour time window and writes data to a write‑ahead log (WAL) on disk. Large numbers of targets generate many samples, exceeding the memory needed for two hours, and the native TSDB is not designed for massive data volumes.

Example directory layout of a Prometheus data directory:

$ ls -l ./data/
total 20
drwxr-xr-x. 3 root root    68 Jan  7 13:33 01FRSGF2NBJ21HZW1QM2594CB9   # 块目录
drwxr-xr-x. 3 root root    68 Jan  7 15:00 01FRSNCMS0Q1Y9Q9KKT6TTXH3C   # 块目录
drwxr-xr-x. 2 root root    34 Jan  7 15:00 chunks_head
-rw-r--. 1 root root     0 Jan  7 10:33 lock
-rw-r--. 1 root root 20001 Jan  7 15:19 queries.active
drwxr-xr-x. 2 root root    54 Jan  7 15:00 wal                         # 预写日志目录

PromQL query to see the rate of samples being appended:

rate(prometheus_tsdb_head_samples_appended_total[5m])

Solution: Reduce the number of scraped metrics and shorten retention time, or better, adopt a Prometheus federation architecture that distributes targets across multiple instances and aggregates results at a central node. Remote Write/Remote Read can offload historic data to an external storage system, alleviating pressure on the native TSDB.

The proposed architecture splits collection and storage, uses remote storage for long‑term data, and limits the federation hierarchy to three layers. It also recommends placing alerting rules on leaf nodes and keeping the overall hierarchy shallow.

While the federation approach solves the immediate memory and TSDB bottlenecks, it introduces network complexity and potential data latency between leaf and central nodes.

For a deeper dive, the author points to a paid tutorial series on mastering Prometheus monitoring.

MonitoringperformancePrometheusScalingFederation
DevOps Operations Practice
Written by

DevOps Operations Practice

We share professional insights on cloud-native, DevOps & operations, Kubernetes, observability & monitoring, and Linux systems.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.