Operations 7 min read

Why a Wrong mount_maxsize Crashed Our TFS Cluster and How We Fixed It

A misconfigured mount_maxsize limited each Data Server to 20 GB, causing 96% storage usage, and after correction led to block corruption that required a custom script to clean up, illustrating the importance of proper storage settings and automated remediation in TFS operations.

ITPUB
ITPUB
ITPUB
Why a Wrong mount_maxsize Crashed Our TFS Cluster and How We Fixed It

Environment Overview

The company's TFS cluster consists of 2 Name Servers and 5 Data Servers. Image upload is handled by the Tengine‑tfs module, while image processing uses Lua to invoke the gm command. Each Data Server is equipped with disks in a 300 GB × 2 + 2 TB × 12 configuration, giving a raw capacity of 120 TB.

Problem

During a routine inspection the storage usage of the TFS cluster was reported at 96 %. Each Data Server showed only 20 GB of usable space. Investigation revealed that the mount_maxsize parameter in ds.conf on every DS was set to 20 GB, preventing the 2 TB disks from being utilized.

Initial Fix

The solution was to set mount_maxsize to a reasonable value. Each Data Server was taken offline one by one, reformatted, and then brought back online with the corrected configuration.

Side Effects

After the adjustment, image uploads became intermittent. Logs showed errors such as "block creation failed" and "block loss, unable to execute copy task," eventually causing a full service outage.

Investigation

Using the command ssm -s NSIP:PORT -i block the block list was examined. The COPYS column should match the replica count (2). Some blocks displayed COPYS values of 0 or 1, indicating corrupted or unsynchronized blocks.

Block List Sample

Filesystem      1K-blocks       Used Available Use% Mounted on
/dev/sdb1   1922859824 1616498672 208685480  89% /opt/websuite/tfs/data/disk1
/dev/sdc1   1922859824 1616498660 208685492  89% /opt/websuite/tfs/data/disk2
... (remaining disks omitted for brevity)

Cleanup Script

#!/bin/bash
badblk=`/opt/websuite/tfs/bin/ssm -s 172.16.4.71:8100 -i block | awk '($8 ~ /0/)' | awk '{print $1}'`
for i in $badblk ; do
    /opt/websuite/tfs/bin/admintool -s 172.16.4.71:8100 -i "removeblk $i"
 done

The script identifies blocks with COPYS equal to 0 and removes them via admintool. After execution, all blocks reported COPYS = 2 and image upload functionality returned to normal.

Lessons Learned

Automating repetitive maintenance tasks reduces human error; proper configuration of mount_maxsize is critical for utilizing available disk capacity; and verifying that management commands are compatible with the installed TFS version prevents unexpected failures during remediation.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

OperationsLinuxstorageTFSblock cleanupmount_maxsize
ITPUB
Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.