Artificial Intelligence 14 min read

Why Does PyTorch Struggle with UFS Storage? Insights and Optimizations

A detailed case study reveals why PyTorch training on UFS file storage suffers severe I/O bottlenecks, compares it with local SSD and SSHFS, and presents practical optimizations such as using cv2.imdecode, caching DataLoader handles, and converting small‑file datasets into large UFS files to close the performance gap.

UCloud Tech

Mar 24, 2020

Why Does PyTorch Struggle with UFS Storage? Insights and Optimizations

Background

When assisting an AI customer with a performance case on UFS file storage, we discovered that PyTorch training I/O performance was far below the hardware I/O capability.

Initial Observations

fio tests showed UFS can deliver over 10 Gbps bandwidth and more than 20 k IOPS per instance, and the customer’s previous MXNet/TensorFlow workloads achieved UFS’s theoretical performance. However, in a single‑node small‑model training scenario using PyTorch, the I/O was much slower.

Parameter Tuning Attempts

We first tried adjusting PyTorch parameters such as batch_size and DataLoader workers. Changing batch_size did not improve throughput because I/O remains single‑queue. Increasing worker_num raised CPU and memory overhead without reducing network latency.

Storage Comparison Tests

We benchmarked three storage options (local SSD, SSHFS, UFS) using CV2 and PIL image loading. Results showed UFS was significantly slower (e.g., CV2: 72 imgs/s on UFS vs 555 imgs/s on SSD, PIL: 115 imgs/s on UFS vs 3508 imgs/s on SSD).

Conclusions from Initial Tests

TensorFlow/MXNet have no I/O issues.

Adjusting PyTorch parameters has little impact.

PyTorch + UFS suffers a large I/O performance gap.

Root Cause Analysis

Strace revealed that CV2 reads each small file with many SEEK calls, causing repeated network round‑trips. NFS‑based UFS adds extra latency due to its three‑layer architecture (access, index, data). PIL avoids SEEK but still incurs network overhead.

Optimization Direction 1: Reduce UFS Interaction Frequency

Switching from cv2.imread to cv2.imdecode halves the number of system calls, cutting per‑file latency from ~12 ms to ~6 ms. Caching file handles in the DataLoader also lowers metadata overhead, bringing performance close to local SSD for moderate dataset sizes.

Optimization Direction 2: Convert Small Files to a Large Dataset

We built a tool that packs the directory of small images into a single UFS file with an index, allowing random reads without per‑file metadata lookups. After conversion, UFS read performance reached or exceeded local SSD.

Performance After Optimizations

Server‑side metadata pre‑fetch reduced latency by more than 50 %. In benchmark runs, the UFS dataset achieved comparable epoch counts to local SSD over 5‑20 hour training periods.

Final Recommendation

For PyTorch training on many small files, use the dataset conversion tool or switch to cv2.imdecode and handle caching. UFS remains suitable for large‑scale training when combined with these optimizations.

Optimization PyTorch AI training UFS storage performance

Written by

UCloud Tech

UCloud is a leading neutral cloud provider in China, developing its own IaaS, PaaS, AI service platform, and big data exchange platform, and delivering comprehensive industry solutions for public, private, hybrid, and dedicated clouds.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.