Lustre Distributed File System: Overview, Stripe Mechanism, I/O Performance Characteristics, and Optimization Practices
This article provides a comprehensive overview of the Lustre parallel distributed file system, detailing its architecture, stripe configuration, I/O performance traits, challenges with small files, and practical optimization techniques for high‑performance computing environments.
Lustre is a parallel distributed file system commonly used in large computer clusters. The name combines "Linux" and "Cluster". Development began in 1999 at Cluster File Systems Inc., and Lustre 1.0 was released in 2003 under the GNU GPLv2 license.
Lustre was co‑developed by HP, Intel, and the U.S. Department of Energy. Intel once announced the discontinuation of its enterprise HPC version but later confirmed continued support for the stable release, underscoring Lustre's prominence among distributed file systems.
1. Lustre Overview
Lustre is a cluster‑oriented storage architecture built on Linux, offering a POSIX‑compatible interface. Its two main features are high scalability and high performance, supporting tens of thousands of clients, petabyte‑scale storage, and hundreds of gigabytes of aggregated I/O throughput.
As a scale‑out storage solution, Lustre expands capacity and performance simply by adding servers. It excels in workloads with many concurrent large‑file reads/writes, though it is less suitable for workloads dominated by a large number of small files (LOSF).
Lustre is widely deployed in high‑performance computing (HPC); roughly 70% of the TOP‑10, 50% of the TOP‑30, and 40% of the TOP‑100 supercomputers use Lustre. It is also common in oil & gas, manufacturing, media, and finance sectors.
2. Lustre Stripe
Lustre uses object storage to split large files across multiple Object Storage Targets (OSTs) in a RAID‑0‑like fashion. Metadata resides on a Metadata Target (MDT). The stripe layout—stripe_count (number of OST objects), stripe_size (chunk size), stripe_offset (starting OST)—is stored as extended attributes on the inode.
When a client accesses a file, it first obtains the stripe layout from the Metadata Server (MDS) and then performs concurrent I/O directly with the relevant OSTs, boosting parallelism and aggregate bandwidth.
Stripe also enables storage of very large files, overcoming single‑OST size limits. However, it introduces additional load and risk, as failure of any OST object can render the whole file inaccessible.
Typical Lustre deployments limit a single file to 160 OST objects. With an EXT4 backend, a single file can reach up to 320 TB. OST selection follows two algorithms: Round‑Robin for initial allocation and a random weighted algorithm once the free‑space difference between any two OSTs exceeds a 20% threshold.
Stripe parameters can be set manually with lfs setstripe or left to defaults (stripe_count = 1, stripe_size = 1 MB, stripe_offset = ‑1). In practice, the stripe count should be tuned based on data size, network, and access patterns; excessive stripe counts can degrade performance and increase risk.
3. Lustre I/O Performance Characteristics
(1) Write performance exceeds read performance: Writes are asynchronous and can be aggregated, while reads require more disk seeks and lack a read cache on OSTs, leading to lower throughput and higher CPU usage.
(2) Excellent performance for large files: The separation of metadata and data, stripe distribution, and network design favor sequential I/O on large files, achieving aggregate bandwidths up to 240 GB/s in verified tests.
(3) Poor performance for small files: Each small‑file operation incurs extra metadata lookups and network round‑trips, and the underlying EXT3/EXT4 backend is not optimized for small‑file metadata access, resulting in bandwidths below 4 MB/s on a 4‑OST gigabit setup.
4. Lustre Small‑File Optimizations
Because Lustre is not suited for LOSF workloads, the following mitigations are recommended:
(1) Aggregate small files into larger archives (e.g., tar) or use loopback mounts to store them as a single large file.
(2) Use O_DIRECT I/O with 4 KB request sizes and disable file locking.
(3) Write data sequentially; sequential I/O outperforms random small‑file access.
(4) Deploy SSDs or higher‑performance disks for OSTs to improve IOPS.
(5) Prefer RAID 1+0 over RAID 5/6 to reduce checksum overhead for small‑file workloads.
Additional tuning parameters include disabling LNET debugging ( sysctl -w lnet.debug=0 ), increasing client dirty cache ( lctl set_param osc.*.max_dirty_mb=256 ), raising RPC parallelism ( echo 32 > /proc/fs/lustre/osc/*-OST000*/max_rpcs_in_flight ), setting stripe count to 1 for small files ( lfs setstripe -c 0/1/-1 /path/filename ), and using local file locking ( mount -t lustre -o localflock ).
For extreme cases, a loopback mount can be used to hide metadata overhead:
dd if=/dev/zero of=/mnt/lustre/loopback/scratch bs=1048576 count=1024 losetup /dev/loop0 /mnt/lustre/loopback/scratch mkfs -t ext4 /dev/loop0 mount /dev/loop0 /mnt/losf
5. Lustre I/O Best Practices
Key recommendations for achieving optimal performance:
(1) Use a single process to read shared small files and distribute data internally. (2) For small files (1 MB–1 GB), set stripe count to 1. (3) For medium files (>1 GB), limit stripe count to ≤ 4. (4) For very large files (>1 GB), use > 4 OST objects but avoid sequential I/O or file‑per‑process patterns. (5) Limit the number of files per directory; set stripe count to 1 for directories containing many small files. (6) Store small files on a single OST to improve single‑process performance. (7) Avoid frequent open/close operations. (8) Prefer ls or lfs find over ls -l to reduce metadata traffic. (9) Consider I/O middleware such as ADIOS, HDF5, or MPI‑IO for further gains.
Article source: ICT_Architect WeChat public account.
Note: When reposting, please credit the author, source, QR code, and full article information.
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.