Cloud Native 20 min read

Dragonfly: Alibaba's P2P Large‑Scale File and Container Image Distribution System

Dragonfly is Alibaba's P2P‑based infrastructure that dramatically accelerates massive file and container image distribution by forming peer networks, applying smart compression and flow‑control, and reducing registry traffic by over 99%, enabling tens of thousands of servers to receive multi‑gigabyte files simultaneously during peak events.

Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Dragonfly: Alibaba's P2P Large‑Scale File and Container Image Distribution System

During the 2017 Tmall Double‑11 event, Alibaba reached a transaction peak of 325,000 transactions per second and a database processing peak of 42 million operations per second; Dragonfly (蜻蜓) was used to push a 5 GB data file to tens of thousands of servers simultaneously, demonstrating its large‑scale file distribution capability.

The Birth of Dragonfly

Rapid growth of Alibaba services in 2015 caused daily release volumes to exceed 20 000, overwhelming file servers and network bandwidth. Expanding servers only shifted the bottleneck to storage and cross‑IDC bandwidth, prompting the development of a P2P‑based solution, Dragonfly.

Design Goals

Alleviate file‑source overload by forming P2P networks among hosts.

Accelerate distribution while keeping per‑host download speed stable.

Reduce cross‑region bandwidth consumption.

Support large‑file download with resumable capability.

Control host disk I/O and network I/O to avoid impacting business workloads.

System Architecture

Dragonfly consists of three layers:

Config Service : manages all Cluster Managers, provides hosts with the nearest Cluster Manager list, and handles system configuration.

Cluster Manager (CM) : downloads files from the source, creates torrent‑like chunk metadata, and schedules P2P data exchange among peers.

Host (dfget) : a wget‑like client that downloads files, participates in P2P sharing, and supports resumable downloads.

Hosts receive dfget commands via Alibaba's StarAgent, allowing a single command to trigger simultaneous downloads on thousands of machines. A Java SDK is also available for programmatic file push.

P2P Networking Logic

When two hosts request the same file, the CM checks its local cache; if absent, it downloads the file, splits it into chunks, and serves those chunks to hosts. Hosts download chunks concurrently, share completed chunks with peers, and record progress in metadata for resumable downloads. After completion, MD5 checks ensure integrity. The CM also enforces HTTP cache policies and periodically cleans disk space.

Performance Evaluation

Tests comparing traditional download with Dragonfly P2P download show that traditional mode’s download time grows with client count, while Dragonfly sustains performance up to 7 000 concurrent clients. After 1 200 clients, traditional mode stalls due to source overload, whereas Dragonfly continues smoothly.

From Release System to Infrastructure

After Double‑11 2015, Dragonfly handled 120 000 downloads per month (4 TB). By Double‑11 2016, it reached 1.4 × 10⁸ downloads per month (708 TB). By April 2017, it covered over 90 % of Alibaba’s large‑scale file distribution, processing 3 × 10⁸ downloads per week and 977 TB of data.

Alibaba’s Container Technology

Alibaba’s container runtime, Pouch, evolved from the LXC‑based T4 (2011) and is now open‑source, powering almost all business workloads. Container images consist of layered filesystems; each layer is identified by an ID and size.

Image Distribution Challenges

Traditional Docker pull fetches each image layer directly from the registry, causing the registry to become a bottleneck when thousands of hosts request the same image, especially across regions.

Dragonfly integrates via a dfget proxy that intercepts Docker pull requests, obtains or creates chunk tasks from the CM, and distributes image layers using the same P2P mechanism.

Design Objectives for Image Distribution

Support tens of thousands of concurrent pulls.

Non‑intrusive to Docker, Registry, or other runtimes.

Compatible with Docker, Pouch, Rocket, Hyper, etc.

Enable image pre‑warming.

Handle images up to 30 GB.

Provide security (HTTP headers, symmetric encryption).

Native Docker vs. Dragonfly

Two experiments were conducted:

Single‑client tests showed comparable latency, with Dragonfly’s smart compression slightly faster.

Multi‑client tests (10, 200, 1 000 concurrent) demonstrated up to 20× speed‑up and up to 57× improvement when source bandwidth was limited, while reducing registry outbound traffic by over 99.5 %.

Alibaba Practice Results

Dragonfly now processes nearly 2 billion distribution events per month, moving 3.4 PB of data, with container image distribution accounting for about half of the traffic.

Intelligent Features

Smart Flow Control : dynamically adjusts disk and network I/O limits based on real‑time workload analysis.

Smart Scheduling : uses multi‑dimensional data (hardware, location, network, historical rates) and gradient‑descent algorithms to assign optimal chunk tasks.

Smart Compression : compresses high‑gain file portions, achieving ~40 % compression ratio and up to 60 % traffic reduction at 1 000 concurrent pulls.

Security : supports HTTP header authentication and symmetric encryption for sensitive files.

Open‑Source : Dragonfly’s code is released on GitHub to foster community collaboration.

Conclusion

By combining P2P technology with intelligent compression, flow‑control, and security, Dragonfly solves large‑scale file and container image distribution challenges, delivering up to 57× speed‑up, reducing registry traffic by >99.5 %, and becoming a core infrastructure component that supports Alibaba’s rapid business expansion and massive Double‑11 promotions.

AlibabaPerformanceCloud Nativep2pContainer ImagesFile Distribution
Alibaba Cloud Infrastructure
Written by

Alibaba Cloud Infrastructure

For uninterrupted computing services

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.