Nydus Container Image Acceleration: Design, Implementation, and Production Experience
The article introduces Nydus, an open‑source container image acceleration solution that replaces the traditional tarball‑based OCI format with a user‑space file system, enabling on‑demand loading, chunk‑level deduplication, secure distribution via Dragonfly, and demonstrates its large‑scale deployment success during Alibaba's 2020 Double‑11 event and in production workloads.
During the 2020 Double‑11 shopping festival, Ant Group’s technical team leveraged the newly open‑sourced Nydus container image acceleration project to boost core transaction chain performance, achieving over 50% faster scaling and dramatically reduced failure rates.
Challenges of Traditional Container Images
Early Docker images, referred to as “fat containers,” did not prioritize pull speed. As cloud‑native adoption grew, large image sizes caused long download times, network jitter, I/O bottlenecks, and wasted storage because most of the image data (over 90%) is never read during container start‑up.
Attempts such as image pre‑warming and trimming proved insufficient: pre‑warming wastes space and bandwidth, while trimming lacks precise data usage metrics. Studies show that pulling images can consume up to 76% of container start time while only 6.4% of the data is actually used.
Design Goals of Nydus
Nydus was built to provide on‑demand loading, efficient deduplication, and strong security for serverless and high‑performance scenarios. The solution replaces the tarball‑based OCI format with a new image format consisting of two layers:
Metadata Layer: A self‑verifying Merkle‑tree where each file and directory is a hash node; file hashes are derived from their chunk data, and directory hashes from child hashes.
Data Layer: Files are split into fixed‑size chunks, compressed, and stored in blob files; chunks can be shared across files and images, enabling cross‑image deduplication.
The user‑space read‑only file system is implemented via FUSE (suitable for runc) or VirtioFS (for secure containers like Kata). A p2p‑based distribution layer built on Dragonfly provides low‑latency, encrypted transfers, cache mechanisms, and seed/super‑node coordination.
Key Features
On‑demand image download – containers start after only a small metadata layer (tens to hundreds of KB) is fetched.
Chunk‑level deduplication – reduces storage and bandwidth consumption.
Elimination of unused data – only needed data is retained.
End‑to‑end data integrity verification via Merkle‑tree hashes.
Compatibility with OCI distribution and artifact standards.
Support for multiple back‑ends (registry, NAS, S3‑compatible object storage).
Integration with Dragonfly p2p for efficient large‑scale distribution.
Production Experience
In Alibaba Cloud’s massive production environment, Nydus achieved 5‑10× read/write performance improvements over traditional FUSE/VirtioFS, added read‑IO auditing, and introduced metadata signing to protect against tampering. Offline compute tasks saw a 5× increase in throughput, failure rates dropped to 0.01%, and pod ready times shrank to seconds.
During the 2020 Double‑11 event, Nydus powered core transaction services, enabling rapid batch scaling with zero failures.
Future Outlook
The project continues to be open‑sourced, with active participation in the OCI community’s OCIv2 discussions. Ant Group and Alibaba Cloud aim to make Nydus the reference implementation for the next‑generation OCI image format, improving stability, security, and ease‑of‑use across the cloud‑native ecosystem.
AntTech
Technology is the core driver of Ant's future creation.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.