How ByteFUSE Revolutionizes High‑Performance Cloud‑Native Storage with FUSE and RDMA
ByteFUSE, a user‑space FUSE‑based solution for ByteNAS, delivers low‑latency, high‑throughput, POSIX‑compatible storage across AI training, database backup, and search services by replacing NFS with a cloud‑native architecture that leverages CSI, RDMA, and kernel‑module hot‑upgrade techniques.
ByteFUSE is a project jointly developed by the ByteNAS and STE teams, offering high reliability, extreme performance, POSIX compatibility, and support for a wide range of scenarios such as online services, AI training, system disks, database backup, message queues, symbol tables, and compilation workloads. It is deployed at a scale of tens of thousands of machines with a total throughput close to 100 GB/s and capacity of dozens of petabytes.
Background
ByteNAS is a self‑developed, high‑performance, highly‑scalable distributed file system that fully complies with POSIX semantics and powers critical ByteDance services. Early external access relied on NFS via the TTGW four‑layer load balancer, which introduced throughput limits, extra network hops, additional machine costs, and difficulty in customization and performance tuning.
Goal
High performance and low latency with a business‑friendly architecture.
Full POSIX semantics compatibility.
Support for one‑write‑multiple‑read and many‑write‑many‑read patterns.
Strong maintainability and customizable feature capabilities.
Evolution Roadmap
ByteFUSE 1.0 – Basic Features and Cloud‑Native Deployment
Native FUSE integration with ByteNAS. Architecture diagram:
The solution includes a ByteFUSE Daemon that embeds the ByteNAS SDK and forwards FUSE requests to the storage cluster. A CSI plugin built on the Kubernetes CSI specification enables cloud‑native deployment:
CSI‑Driver supports static volumes; mount/unmount operations launch or destroy a FUSE client, and the driver records mount‑point status for recovery.
FUSE Client runs per mount point in the 1.0 architecture.
ByteFUSE 2.0 – Cloud‑Native Architecture Upgrade
The architecture moves to a single Daemon model, separating the FUSE Daemon from the CSI‑Driver. This allows resource reuse across mount points and enables smooth CSI upgrades without disrupting I/O.
Support for Kata containers is added via VDUSE, a ByteDance‑developed framework that uses virtio and shared‑memory communication instead of the traditional /dev/fuse character device, providing crash recovery and hot‑upgrade capabilities.
Consistency model enhancements introduce a CTO (Close‑to‑Open) model and five configurable consistency options.
Daemon high availability is achieved through VDUSE’s inflight I/O tracking, which persists in‑flight requests and enables crash recovery and hot upgrades.
Kernel module hot‑upgrade is realized with DKMS, allowing automatic recompilation on kernel updates and multi‑version coexistence of kernel modules.
ByteFUSE 3.0 – Extreme Performance Optimization
The 3.0 version adopts a Run‑to‑Completion (RTC) thread model, eliminating four thread switches per request present in 2.0. A shared‑nothing design and non‑blocking locks ensure the RTC thread never blocks, reducing latency.
RDMA and a user‑space protocol stack (Tarzan) replace the kernel TCP/IP stack, cutting context switches, data copies, and CPU usage.
Full‑link zero‑copy is achieved by registering the RDMA/Tarzan DMA buffer with VDUSE (umem), removing one of the two copies that remained after the FUSE PageCache to Bounce Buffer step.
Kernel FUSE optimizations include multi‑queue support, huge‑page usage, and increased maximum I/O sizes (8 MB data, 32 KB directory reads). These changes raise metadata performance by ~25 % and data throughput by 2.5×, with a 1 MB write latency reduction of several hundred microseconds.
<code>mdtest -d /mnt/mdtest/ -b 6 -I 8 -z 4 -i 30</code>Use Cases
ES Compute‑Storage Separation
ByteFUSE provides high‑throughput, low‑latency storage for Elasticsearch shard replicas, reducing storage costs by tens of millions of yuan annually.
AI Training
In AI training scenarios, ByteFUSE meets the demand for ultra‑high throughput and low latency, avoiding NFS‑related failures and performance bottlenecks.
Other Business Scenarios
Database backup, message queue, symbol table, and compilation workloads have migrated from NFS to ByteFUSE to overcome TTGW throughput and stability limitations.
Future Outlook
Expand ByteFUSE to B2B scenarios requiring ultra‑low latency and ultra‑high throughput.
Add non‑POSIX semantics and custom interfaces such as I/O fencing.
Develop a FUSE PageCache Extension to allow user‑space cache access.
Further enhance kernel module hot‑upgrade for seamless updates of existing volumes.
Support GPU Direct Storage to enable direct data transfer between RDMA NICs and GPUs.
References
https://kubernetes-csi.github.io/docs/
https://www.redhat.com/en/blog/introducing-vduse-software-defined-datapath-virtio
https://juejin.cn/post/7171280231238467592
https://lore.kernel.org/lkml/[email protected]/
https://lwn.net/Articles/900178/
https://docs.nvidia.com/gpudirect-storage/overview-guide/index.html
ByteDance SYS Tech
Focused on system technology, sharing cutting‑edge developments, innovation and practice, and analysis of industry tech hotspots.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.