Operations 16 min read

How Virtio-fs Achieves Crash Recovery for High‑Availability Secure Containers

This article explains the design of Virtio-fs, its architecture and high‑availability features, and details the crash‑recovery mechanism—including crash models, state preservation, supervisor coordination, request idempotence, downtime optimization, and hot upgrade/migration—implemented by ByteDance's STE team for secure container workloads.

ByteDance SYS Tech
ByteDance SYS Tech
ByteDance SYS Tech
How Virtio-fs Achieves Crash Recovery for High‑Availability Secure Containers

Virtio-fs Background Introduction

Virtio-fs, proposed by Red Hat in 2018, enables sharing host directories with virtual machines, addressing the storage challenges of secure containers such as Kata Containers, which run runc containers inside lightweight VMs and cannot directly mount host directories.

Virtio-fs Architecture

Based on the FUSE protocol, Virtio-fs adds a Virtio transport layer in the kernel FUSE module, allowing VM kernels to send FUSE requests to a host-side virtiofsd daemon, which processes them on the host filesystem, providing POSIX‑compatible, high‑performance directory sharing.

Virtio-fs High‑Availability Features

The open‑source Virtio-fs lacks built‑in HA features. When the virtiofsd process crashes, VM applications block on I/O, making the VM unavailable. The STE team implemented crash recovery and are exploring hot upgrade and live migration.

Crash Model

Crashes are modeled as fail‑stop errors caused by external termination, OOM killer, or process bugs. Restoring the VM after a crash requires resetting the virtiofsd state to the point of failure.

State Preservation

Three key state components are saved:

In‑flight requests : Unfinished FUSE/virtio requests are logged via the vhost‑user inflight I/O tracking feature and replayed after restart.

Guest inode ↔ host file mapping : A flat‑map structure records dynamic inode and file descriptor mappings for later restoration.

File descriptors : Open file descriptors are saved as global file handles using name_to_handle_at(2) and restored with open_by_handle_at(2) .

Crash Recovery Process

The supervisor detects a virtiofsd crash and launches a new daemon.

The new daemon loads persisted state and listens on the vhost‑user socket.

QEMU reconnects, renegotiates memory layout and virtqueue addresses, and sends the inflight log.

The daemon reprocesses unfinished requests.

Normal file operations resume, completing the recovery.

Idempotence and Consistency

Replaying unfinished requests can violate idempotence and consistency because some operations may have already altered internal state. The team classified FUSE request types and applied specific handling, such as relaxing error checks or adding lightweight logging, to ensure safe re‑execution.

Downtime Optimization

Two optimizations reduce recovery downtime: (1) millisecond‑level retry intervals for vhost‑user socket reconnection, and (2) on‑demand file‑handle restoration that spreads the cost over the first file access after recovery, achieving sub‑100 ms downtime.

Hot Upgrade and Migration

Hot upgrade can be performed by terminating the old daemon and starting an upgraded one, leveraging the crash‑recovery path. A dedicated control channel between the supervisor and daemon enables state‑preserving upgrades without full reconnection, further reducing downtime. For live migration, the team extended the approach to transfer saved state and coordinated with QEMU’s vhost‑user‑fs device migration.

Outlook

The crash‑recovery feature has been submitted as an RFC patchset to the QEMU and Virtio‑fs upstream projects, with community feedback incorporated. Future work will continue to explore deeper integration of Virtio‑fs with FUSE and virtualization file‑system research.

High AvailabilityContainer SecurityVirtualizationFUSEKVMcrash recoveryvirtio-fs
ByteDance SYS Tech
Written by

ByteDance SYS Tech

Focused on system technology, sharing cutting‑edge developments, innovation and practice, and analysis of industry tech hotspots.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.