Operations 9 min read

How to Prevent ZooKeeper Disk Exhaustion: Snapshots, Logs, and Tuning Tips

This article explains why ZooKeeper can run out of disk space due to excessive snapshots and transaction logs, describes the underlying file‑generation mechanism, and provides concrete configuration parameters and best‑practice recommendations to control file growth and keep the cluster stable.

Alibaba Cloud Native
Alibaba Cloud Native
Alibaba Cloud Native
How to Prevent ZooKeeper Disk Exhaustion: Snapshots, Logs, and Tuning Tips

Background

High transaction rates or missing cleanup policies cause ZooKeeper to accumulate many snapshot and transaction‑log files, eventually filling the disk, crashing the server, and making the node unavailable.

Data File Generation in ZooKeeper

ZooKeeper keeps all Znode data in an in‑memory map. At a point in time the map is serialized to a snapshot file. Subsequent changes are appended to a separate transaction log . On restart the latest snapshot is loaded and the transaction log is replayed to restore state.

File Types and Disk Impact

The two disk‑consuming file types are:

Snapshot files – directory configured by dataDir.

Transaction‑log files – directory configured by dataLogDir.

Frequent writes or large data volumes cause these files to grow rapidly.

ZooKeeper snapshot and transaction log architecture
ZooKeeper snapshot and transaction log architecture

Automatic Cleanup Configuration

autopurge.snapRetainCount – number of recent snapshots to retain (new in ZooKeeper 3.4.0).

autopurge.purgeInterval – cleanup interval in hours; must be > 0 (minimum 1 hour).

When both parameters are set, ZooKeeper periodically deletes older snapshots and logs, reducing disk pressure.

When Snapshots Are Created

Node startup after loading existing data files.

Election of a new leader in the ensemble.

During normal operation when SyncRequestProcessor decides a snapshot is needed (see shouldSnapshot logic).

Snapshot‑Generation Logic

private boolean shouldSnapshot() {
    int logCount = zks.getZKDatabase().getTxnCount();
    long logSize = zks.getZKDatabase().getTxnSize();
    return (logCount > (snapCount / 2 + randRoll))
        || (snapSizeInBytes > 0 && logSize > (snapSizeInBytes / 2 + randSize));
}

The method triggers a snapshot when either the number of transactions ( logCount) or the total log size ( logSize) exceeds a threshold derived from the configured snapCount and snapSizeInBytes (plus random offsets). A new snapshot is written, the current log is flushed, and a fresh log file is started.

Configurable Snapshot Thresholds

snapCount – Java system property zookeeper.snapCount; number of transactions that cause a snapshot.

snapSizeLimitInKb – Java system property zookeeper.snapSizeLimitInKb; size‑based trigger (in kilobytes).

Increasing these values reduces snapshot frequency but may lengthen restart time because larger snapshots must be read.

Best‑Practice Recommendations

Configure the four parameters together: autopurge.snapRetainCount and autopurge.purgeInterval to schedule regular removal of old files. snapCount and snapSizeLimitInKb to tune how often snapshots are taken.

Balance the settings: a larger snapCount / snapSizeLimitInKb lowers disk usage but can increase recovery time; a smaller autopurge.purgeInterval (minimum 1 hour) may still be insufficient for extremely high‑write workloads, so monitor disk usage and adjust accordingly.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

OperationsConfigurationZooKeepersnapshotsdiskTransaction Log
Alibaba Cloud Native
Written by

Alibaba Cloud Native

We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.