How Nydus’s RAFS v6 and EROFS over fscache Deliver Near‑Native Container Image Performance
The article explains how Nydus’s new RAFS v6 image format, built on the kernel‑native EROFS filesystem and integrated with fscache, overcomes the performance and scalability limitations of traditional OCIv1 and user‑space container image solutions by enabling kernel‑mode on‑demand loading, fine‑grained chunk deduplication, asynchronous prefetching, and superior I/O efficiency, with benchmark results demonstrating near‑native speeds.
Background
OCI v1 container images are stored as layered tar archives. The format forces the whole image (or large layers) to be downloaded before the container can start, leading to slow start‑up, coarse deduplication granularity and high kernel‑user communication overhead.
Limitations of Existing User‑Space Solutions
Third‑party formats wrap OCI layers in custom block formats but still rely on a user‑space file‑system layer (FUSE or virtiofs). This introduces:
excessive system‑call overhead, especially for random small I/O;
inability to separate metadata from data, preventing efficient compression or scanning;
no support for rootless mounts;
complex merging or sub‑setting of images without rewriting blobs.
Nydus Architecture Review
The original Nydus implementation used a pure user‑space image format accessed via FUSE/virtiofs. While functional, it suffered the same performance penalties as other user‑space solutions when containers accessed many small files.
RAFS v6 Image Format
To overcome these issues, the Dragonfly community designed RAFS v6, a kernel‑native image format built on the EROFS read‑only filesystem. Key characteristics:
Image data blobs are stored separately from a bootstrap metadata file.
File data is split into fixed‑size chunks (typically 4 KB) and deduplicated at the chunk level.
Metadata can be read without pulling the corresponding data blob, dramatically reducing data transferred during start‑up.
EROFS File System Features
Native read‑only block filesystem with page‑aligned blocks.
Tail‑packing reduces space while keeping high access performance.
Block‑level addressing is mmap‑friendly and eliminates post‑I/O processing.
Supports direct I/O, multiple back‑ends (block device, FSDAX), and an extensible on‑disk format.
EROFS over fscache On‑Demand Loading
The solution integrates Linux’s fscache/cachefiles subsystem. When a container accesses a file:
If the data is already cached, the kernel reads it directly from the cache file (no user‑space round‑trip).
If the data is missing, the request is forwarded to the nydusd daemon, which fetches the required chunk from the remote registry, writes it into the fscache file, and wakes the waiting container process.
Additional optimizations:
Asynchronous prefetching – the daemon can download ahead of the container’s request.
Network‑I/O aggregation – larger blocks are fetched than the immediate request to amortize latency.
Performance Evaluation
Benchmarks were run on an ecs.i2ne.4xlarge instance (16 vCPU, 128 GiB RAM, NVMe SSD). Three workload groups were measured.
Read / RandRead I/O : fscache mode matched the “loop” mode (EROFS mounted via a loop device) and outperformed the FUSE‑based solution, though it remained slightly behind native ext4.
Metadata Operations : Tar‑based metadata tests showed EROFS‑based images surpassing ext4 because all metadata is tightly packed in a read‑only layout.
Typical Workload – Linux Kernel Compilation : fscache mode achieved performance comparable to loop and native ext4, and better than FUSE.
Test commands:
fio -ioengine=psync -bs=4k -direct=0 -rw=read -numjobs=1 fio -ioengine=psync -bs=4k -direct=0 -rw=randread -numjobs=1 tar -cf /dev/null <linux_src_dir> time make -j16Advantages of the New Scheme
Asynchronous Prefetch : nydusd can download data before the container accesses it, eliminating user‑space context switches.
Network I/O Optimization : Larger pre‑fetch windows reduce per‑request latency.
Higher Performance : Near‑native read/write speeds once the image is fully cached.
High‑Density Deployment : Each container image appears as a single cache file, simplifying management of hundreds of containers.
Fault Recovery & Hot Upgrade : With the image fully cached, the user‑space daemon can be restarted or upgraded without disrupting running containers.
Unified Solution for runc and rund : RAFS v6 and EROFS over fscache work for both traditional OCI runtimes and the newer rund runtime.
Future Work and Integration
The EROFS‑over‑fscache feature has been merged into the Linux 5.19 mainline kernel. Ongoing work includes back‑porting to OpenAnolis kernels (5.10 and 4.19), adding FSDAX support, and further performance tuning.
References
Dragonfly community SIG high‑performance storage: https://openanolis.cn/sig/high-perf-storage Linux 5.19 commit (EROFS over fscache):
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=65965d9530b0c320759cd18a9a5975fb2e098462FUSE passthrough example:
https://github.com/libfuse/libfuse/blob/master/example/passthrough_hp.ccNydus image service repository: https://github.com/dragonflyoss/image-service LWN.net coverage:
https://lwn.net/SubscriberLink/896140/3d7b8c63b70776d4/Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Native
We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
