Cloud Native 19 min read

How ByteFUSE Revolutionizes High‑Performance Cloud‑Native Storage with FUSE and RDMA

ByteFUSE, a user‑space FUSE‑based solution for ByteNAS, delivers low‑latency, high‑throughput, POSIX‑compatible storage across AI training, database backup, and search services by replacing NFS with a cloud‑native architecture that leverages CSI, RDMA, and kernel‑module hot‑upgrade techniques.

ByteDance SYS Tech

Aug 1, 2023

How ByteFUSE Revolutionizes High‑Performance Cloud‑Native Storage with FUSE and RDMA

ByteFUSE is a project jointly developed by the ByteNAS and STE teams, offering high reliability, extreme performance, POSIX compatibility, and support for a wide range of scenarios such as online services, AI training, system disks, database backup, message queues, symbol tables, and compilation workloads. It is deployed at a scale of tens of thousands of machines with a total throughput close to 100 GB/s and capacity of dozens of petabytes.

Background

ByteNAS is a self‑developed, high‑performance, highly‑scalable distributed file system that fully complies with POSIX semantics and powers critical ByteDance services. Early external access relied on NFS via the TTGW four‑layer load balancer, which introduced throughput limits, extra network hops, additional machine costs, and difficulty in customization and performance tuning.

Goal

High performance and low latency with a business‑friendly architecture.

Full POSIX semantics compatibility.

Support for one‑write‑multiple‑read and many‑write‑many‑read patterns.

Strong maintainability and customizable feature capabilities.

Evolution Roadmap

ByteFUSE 1.0 – Basic Features and Cloud‑Native Deployment

Native FUSE integration with ByteNAS. Architecture diagram:

The solution includes a ByteFUSE Daemon that embeds the ByteNAS SDK and forwards FUSE requests to the storage cluster. A CSI plugin built on the Kubernetes CSI specification enables cloud‑native deployment:

CSI‑Driver supports static volumes; mount/unmount operations launch or destroy a FUSE client, and the driver records mount‑point status for recovery.

FUSE Client runs per mount point in the 1.0 architecture.

ByteFUSE 2.0 – Cloud‑Native Architecture Upgrade

The architecture moves to a single Daemon model, separating the FUSE Daemon from the CSI‑Driver. This allows resource reuse across mount points and enables smooth CSI upgrades without disrupting I/O.

Support for Kata containers is added via VDUSE, a ByteDance‑developed framework that uses virtio and shared‑memory communication instead of the traditional /dev/fuse character device, providing crash recovery and hot‑upgrade capabilities.

Consistency model enhancements introduce a CTO (Close‑to‑Open) model and five configurable consistency options.

Daemon high availability is achieved through VDUSE’s inflight I/O tracking, which persists in‑flight requests and enables crash recovery and hot upgrades.

Kernel module hot‑upgrade is realized with DKMS, allowing automatic recompilation on kernel updates and multi‑version coexistence of kernel modules.

ByteFUSE 3.0 – Extreme Performance Optimization

The 3.0 version adopts a Run‑to‑Completion (RTC) thread model, eliminating four thread switches per request present in 2.0. A shared‑nothing design and non‑blocking locks ensure the RTC thread never blocks, reducing latency.

RDMA and a user‑space protocol stack (Tarzan) replace the kernel TCP/IP stack, cutting context switches, data copies, and CPU usage.

Full‑link zero‑copy is achieved by registering the RDMA/Tarzan DMA buffer with VDUSE (umem), removing one of the two copies that remained after the FUSE PageCache to Bounce Buffer step.

Kernel FUSE optimizations include multi‑queue support, huge‑page usage, and increased maximum I/O sizes (8 MB data, 32 KB directory reads). These changes raise metadata performance by ~25 % and data throughput by 2.5×, with a 1 MB write latency reduction of several hundred microseconds.

mdtest -d /mnt/mdtest/ -b 6 -I 8 -z 4 -i 30

Use Cases

ES Compute‑Storage Separation

ByteFUSE provides high‑throughput, low‑latency storage for Elasticsearch shard replicas, reducing storage costs by tens of millions of yuan annually.

AI Training

In AI training scenarios, ByteFUSE meets the demand for ultra‑high throughput and low latency, avoiding NFS‑related failures and performance bottlenecks.

Other Business Scenarios

Database backup, message queue, symbol table, and compilation workloads have migrated from NFS to ByteFUSE to overcome TTGW throughput and stability limitations.

Future Outlook

Expand ByteFUSE to B2B scenarios requiring ultra‑low latency and ultra‑high throughput.

Add non‑POSIX semantics and custom interfaces such as I/O fencing.

Develop a FUSE PageCache Extension to allow user‑space cache access.

Further enhance kernel module hot‑upgrade for seamless updates of existing volumes.

Support GPU Direct Storage to enable direct data transfer between RDMA NICs and GPUs.

References

https://kubernetes-csi.github.io/docs/

https://www.redhat.com/en/blog/introducing-vduse-software-defined-datapath-virtio

https://juejin.cn/post/7171280231238467592

https://lore.kernel.org/lkml/[email protected]/

https://lwn.net/Articles/900178/

https://docs.nvidia.com/gpudirect-storage/overview-guide/index.html

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Optimization Kubernetes Distributed File System FUSE RDMA

Written by

ByteDance SYS Tech

Focused on system technology, sharing cutting‑edge developments, innovation and practice, and analysis of industry tech hotspots.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.