How Virtio-fs Achieves Crash Recovery for High‑Availability Secure Containers
This article explains the design of Virtio-fs, its architecture and high‑availability features, and details the crash‑recovery mechanism—including crash models, state preservation, supervisor coordination, request idempotence, downtime optimization, and hot upgrade/migration—implemented by ByteDance's STE team for secure container workloads.
Virtio-fs Background Introduction
Virtio-fs, proposed by Red Hat in 2018, enables sharing host directories with virtual machines, addressing the storage challenges of secure containers such as Kata Containers, which run runc containers inside lightweight VMs and cannot directly mount host directories.
Virtio-fs Architecture
Based on the FUSE protocol, Virtio-fs adds a Virtio transport layer in the kernel FUSE module, allowing VM kernels to send FUSE requests to a host-side virtiofsd daemon, which processes them on the host filesystem, providing POSIX‑compatible, high‑performance directory sharing.
Virtio-fs High‑Availability Features
The open‑source Virtio-fs lacks built‑in HA features. When the virtiofsd process crashes, VM applications block on I/O, making the VM unavailable. The STE team implemented crash recovery and are exploring hot upgrade and live migration.
Crash Model
Crashes are modeled as fail‑stop errors caused by external termination, OOM killer, or process bugs. Restoring the VM after a crash requires resetting the virtiofsd state to the point of failure.
State Preservation
Three key state components are saved:
In‑flight requests : Unfinished FUSE/virtio requests are logged via the vhost‑user inflight I/O tracking feature and replayed after restart.
Guest inode ↔ host file mapping : A flat‑map structure records dynamic inode and file descriptor mappings for later restoration.
File descriptors : Open file descriptors are saved as global file handles using name_to_handle_at(2) and restored with open_by_handle_at(2) .
Crash Recovery Process
The supervisor detects a virtiofsd crash and launches a new daemon.
The new daemon loads persisted state and listens on the vhost‑user socket.
QEMU reconnects, renegotiates memory layout and virtqueue addresses, and sends the inflight log.
The daemon reprocesses unfinished requests.
Normal file operations resume, completing the recovery.
Idempotence and Consistency
Replaying unfinished requests can violate idempotence and consistency because some operations may have already altered internal state. The team classified FUSE request types and applied specific handling, such as relaxing error checks or adding lightweight logging, to ensure safe re‑execution.
Downtime Optimization
Two optimizations reduce recovery downtime: (1) millisecond‑level retry intervals for vhost‑user socket reconnection, and (2) on‑demand file‑handle restoration that spreads the cost over the first file access after recovery, achieving sub‑100 ms downtime.
Hot Upgrade and Migration
Hot upgrade can be performed by terminating the old daemon and starting an upgraded one, leveraging the crash‑recovery path. A dedicated control channel between the supervisor and daemon enables state‑preserving upgrades without full reconnection, further reducing downtime. For live migration, the team extended the approach to transfer saved state and coordinated with QEMU’s vhost‑user‑fs device migration.
Outlook
The crash‑recovery feature has been submitted as an RFC patchset to the QEMU and Virtio‑fs upstream projects, with community feedback incorporated. Future work will continue to explore deeper integration of Virtio‑fs with FUSE and virtualization file‑system research.
ByteDance SYS Tech
Focused on system technology, sharing cutting‑edge developments, innovation and practice, and analysis of industry tech hotspots.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.