Why a Wrong mount_maxsize Crashed Our TFS Cluster and How We Fixed It
A misconfigured mount_maxsize limited each Data Server to 20 GB, causing 96% storage usage, and after correction led to block corruption that required a custom script to clean up, illustrating the importance of proper storage settings and automated remediation in TFS operations.
Environment Overview
The company's TFS cluster consists of 2 Name Servers and 5 Data Servers. Image upload is handled by the Tengine‑tfs module, while image processing uses Lua to invoke the gm command. Each Data Server is equipped with disks in a 300 GB × 2 + 2 TB × 12 configuration, giving a raw capacity of 120 TB.
Problem
During a routine inspection the storage usage of the TFS cluster was reported at 96 %. Each Data Server showed only 20 GB of usable space. Investigation revealed that the mount_maxsize parameter in ds.conf on every DS was set to 20 GB, preventing the 2 TB disks from being utilized.
Initial Fix
The solution was to set mount_maxsize to a reasonable value. Each Data Server was taken offline one by one, reformatted, and then brought back online with the corrected configuration.
Side Effects
After the adjustment, image uploads became intermittent. Logs showed errors such as "block creation failed" and "block loss, unable to execute copy task," eventually causing a full service outage.
Investigation
Using the command ssm -s NSIP:PORT -i block the block list was examined. The COPYS column should match the replica count (2). Some blocks displayed COPYS values of 0 or 1, indicating corrupted or unsynchronized blocks.
Block List Sample
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/sdb1 1922859824 1616498672 208685480 89% /opt/websuite/tfs/data/disk1
/dev/sdc1 1922859824 1616498660 208685492 89% /opt/websuite/tfs/data/disk2
... (remaining disks omitted for brevity)Cleanup Script
#!/bin/bash
badblk=`/opt/websuite/tfs/bin/ssm -s 172.16.4.71:8100 -i block | awk '($8 ~ /0/)' | awk '{print $1}'`
for i in $badblk ; do
/opt/websuite/tfs/bin/admintool -s 172.16.4.71:8100 -i "removeblk $i"
doneThe script identifies blocks with COPYS equal to 0 and removes them via admintool. After execution, all blocks reported COPYS = 2 and image upload functionality returned to normal.
Lessons Learned
Automating repetitive maintenance tasks reduces human error; proper configuration of mount_maxsize is critical for utilizing available disk capacity; and verifying that management commands are compatible with the installed TFS version prevents unexpected failures during remediation.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
