Why Does PyTorch Struggle with UFS Storage? Insights and Optimizations
A detailed case study reveals why PyTorch training on UFS file storage suffers severe I/O bottlenecks, compares it with local SSD and SSHFS, and presents practical optimizations such as using cv2.imdecode, caching DataLoader handles, and converting small‑file datasets into large UFS files to close the performance gap.
Background
When assisting an AI customer with a performance case on UFS file storage, we discovered that PyTorch training I/O performance was far below the hardware I/O capability.
Initial Observations
fio tests showed UFS can deliver over 10 Gbps bandwidth and more than 20 k IOPS per instance, and the customer’s previous MXNet/TensorFlow workloads achieved UFS’s theoretical performance. However, in a single‑node small‑model training scenario using PyTorch, the I/O was much slower.
Parameter Tuning Attempts
We first tried adjusting PyTorch parameters such as batch_size and DataLoader workers. Changing batch_size did not improve throughput because I/O remains single‑queue. Increasing worker_num raised CPU and memory overhead without reducing network latency.
Storage Comparison Tests
We benchmarked three storage options (local SSD, SSHFS, UFS) using CV2 and PIL image loading. Results showed UFS was significantly slower (e.g., CV2: 72 imgs/s on UFS vs 555 imgs/s on SSD, PIL: 115 imgs/s on UFS vs 3508 imgs/s on SSD).
Conclusions from Initial Tests
TensorFlow/MXNet have no I/O issues.
Adjusting PyTorch parameters has little impact.
PyTorch + UFS suffers a large I/O performance gap.
Root Cause Analysis
Strace revealed that CV2 reads each small file with many SEEK calls, causing repeated network round‑trips. NFS‑based UFS adds extra latency due to its three‑layer architecture (access, index, data). PIL avoids SEEK but still incurs network overhead.
Optimization Direction 1: Reduce UFS Interaction Frequency
Switching from cv2.imread to cv2.imdecode halves the number of system calls, cutting per‑file latency from ~12 ms to ~6 ms. Caching file handles in the DataLoader also lowers metadata overhead, bringing performance close to local SSD for moderate dataset sizes.
Optimization Direction 2: Convert Small Files to a Large Dataset
We built a tool that packs the directory of small images into a single UFS file with an index, allowing random reads without per‑file metadata lookups. After conversion, UFS read performance reached or exceeded local SSD.
Performance After Optimizations
Server‑side metadata pre‑fetch reduced latency by more than 50 %. In benchmark runs, the UFS dataset achieved comparable epoch counts to local SSD over 5‑20 hour training periods.
Final Recommendation
For PyTorch training on many small files, use the dataset conversion tool or switch to cv2.imdecode and handle caching. UFS remains suitable for large‑scale training when combined with these optimizations.
UCloud Tech
UCloud is a leading neutral cloud provider in China, developing its own IaaS, PaaS, AI service platform, and big data exchange platform, and delivering comprehensive industry solutions for public, private, hybrid, and dedicated clouds.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
