Operations 16 min read

Troubleshooting TiKV Disk Space Issues: Causes, Diagnosis, and Solutions

This guide explains how to diagnose and fix TiKV disk‑space problems by identifying oversized log files, redundant space‑placeholder files, and excessive RocksDB/Titan data, offering command‑line checks, configuration tweaks such as enabling log rotation, disabling reserve space, and tuning GC and Titan discardable‑ratio to restore balanced storage.

vivo Internet Technology

Oct 30, 2024

Troubleshooting TiKV Disk Space Issues: Causes, Diagnosis, and Solutions

This article, authored by Yuan Jianwei from the Vivo Internet Storage team, introduces the troubleshooting methodology and solutions for TiKV disk space problems.

Background

Vivo’s rapid business expansion led many services to adopt lightweight Redis clusters for KV storage. As data volume grew, some Redis clusters became cold but still stored large amounts of data, prompting a shift to a self‑developed TiKV‑based KV system that separates compute and storage and provides Redis‑compatible protocols.

The mixed‑deployment of TiKV instances on the same servers caused various issues, including unexpected disk space consumption, latency jitter, and unbalanced load.

Problem Overview

When disk space falls below the low-space-ratio threshold, PD’s scheduler marks the node as resource‑insufficient, leading to uneven load distribution and potentially affecting the availability of all TiKV instances on that machine.

Typical contributors to high disk usage are:

Log files

Buffer (placeholder files)

Data (RocksDB and Titan)

The article focuses on the first three dimensions.

1. Log Files

Older TiKV versions may lack log rotation for RocksDB and raftdb logs, causing logs to accumulate indefinitely. Logs are usually located under the Data directory.

Example commands to quantify log usage:

# Check raftdb.info logs
 du -sh -c raftdb.info.*

# Check rocksdb.info logs
 du -sh -c rocksdb.info.*

Typical output shows several gigabytes of log data, e.g., 6.5G total. If logs exceed tens of gigabytes, they become a primary cause of disk pressure.

Remediation suggestions include:

Upload logs to a centralized system (e.g., Graylog) before deletion.

Separate log disks from data disks.

Transfer logs to cheaper storage and compress before removal.

Enable log rotation in newer TiKV releases.

2. Placeholder Files

TiKV creates space_placeholder_file under the deployment bin or Data directories to reserve space for replay after a restart. In mixed‑deployment scenarios, these files are duplicated across nodes, inflating total disk usage.

Command to locate them: ll | grep space_place Sample output:

-rw-r--r-- 1 root root 199941579980 Aug 5 2021 space_placeholder_file

If the placeholder file size is comparable to the KV data size, it should be considered for removal or reduction.

To disable the placeholder, adjust TiKV’s configuration:

[storage]
reserve-space = "0MB"

3. Slow GC and Titan Data

When gc.enable-compaction-filter is enabled, GC may become sluggish, leading to accumulation of obsolete data. Monitoring Tikv details.GC and Tikv details.GC.TiKV AutoGC Working helps identify the issue.

Configuration changes to accelerate GC:

[gc]
# Enable GC by compaction filter
enable-compaction-filter = true

Adjust RocksDB write‑rate to limit I/O impact:

[rocksdb]
rate-bytes-per-sec = "500MB"

After applying changes, restart TiKV and observe the GC speed metrics.

Titan, introduced to mitigate write amplification, stores large values in blob files. Excessive Titan data can be diagnosed via monitoring panels (e.g., Server.CF size, TitanDB‑kv.Live blob size) and by inspecting the data directories:

$ du -h -d 1 db/titandb
937G    db/titandb

$ du -h -d 1 raft/titandb/
1.1T    raft/titandb/

To control Titan’s space consumption, tune rocksdb.defaultcf.titan.discardable-ratio (default 0.5). Lowering it to 0.2 or 0.1 reduces retained obsolete data but may affect performance.

[rocksdb.defaultcf.titan]
discardable-ratio = 0.2

After adjusting the ratio, perform a rolling restart, ensuring leader regions are migrated before each node restarts.

Summary

The article walks through three main dimensions—logs, placeholder files, and data (RocksDB/Titan)—to diagnose and resolve common TiKV disk‑space issues, providing concrete commands, configuration snippets, and operational precautions.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Configuration troubleshooting storage log analysis TiKV disk space

Written by

vivo Internet Technology

Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.