How PolarFS Achieves Ultra‑Low Latency and High Reliability for Cloud‑Native Databases
PolarFS is a user‑space, ultra‑low‑latency distributed file system designed for POLARDB that leverages RDMA, NVMe SSDs, and a novel ParallelRaft protocol to deliver near‑local‑SSD performance, strong consistency, and seamless failover in a cloud‑native environment.
Background
PolarFS is a distributed file system built to support POLARDB, a cloud‑native database that separates compute and storage. By moving the I/O stack to user space and exploiting RDMA and NVMe SSDs, PolarFS reduces end‑to‑end latency to levels comparable with a local PCIe SSD.
Design Goals
Separate hardware for compute and storage nodes, allowing independent customization.
Aggregate storage across nodes into a single pool, reducing fragmentation and enabling horizontal scaling.
Provide high availability and reliability for database instances, simplifying migration and failover.
Enable cloud‑database services to benefit from virtualized compute environments and enhanced features such as multi‑read replicas and snapshots.
System Architecture
PolarFS consists of two management layers: virtualized storage resource management (providing logical volumes for each database instance) and metadata management (handling file operations and concurrency).
Key Components
libpfs : a lightweight user‑space library that replaces the standard file‑system interface, keeping the entire I/O path in user space.
PolarSwitch : a daemon on compute nodes that forwards I/O requests to the appropriate ChunkServer.
ChunkServer : runs on storage nodes, manages I/O for each Chunk, uses a hybrid 3DXPoint + NVMe SSD WAL buffer, and replicates writes via a custom ParallelRaft protocol.
PolarCtrl : the control‑plane master that monitors ChunkServers, manages volume creation, chunk layout, metadata, and performs periodic CRC checks.
Storage Organization
PolarFS organizes storage into three layers:
Volume : logical storage space per database instance (10 GB–100 TB) containing filesystem metadata, journal, and Paxos files.
Chunk : the smallest data distribution unit, each stored on a single NVMe SSD (typical size 10 GB), reducing metadata overhead and enabling efficient load balancing.
Block : 64 KB units within a Chunk, dynamically mapped and cached in memory for fast I/O.
I/O Flow
A write request from POLARDB travels through libpfs to PolarSwitch, which maps it to the target Chunk and forwards it to the primary ChunkServer. The request is placed in a pre‑allocated buffer, written to the WAL via SPDK, replicated to follower ChunkServers using RDMA, and finally applied to the data block after majority acknowledgment.
ParallelRaft Protocol
To overcome Raft’s serialization bottleneck under high concurrency, PolarFS introduces ParallelRaft, which relaxes strict ordering while preserving safety properties. Log entries that do not overlap in storage range can be committed and applied out of order; conflicting entries are serialized. A look‑behind buffer records recent LBA modifications to detect conflicts, enabling safe out‑of‑order application.
Centralized Control with Local Autonomy
PolarCtrl acts as a centralized master for metadata and resource management, while ChunkServers operate autonomously, handling replication and leader election locally via ParallelRaft. This hybrid design avoids a single point of failure and minimizes metadata I/O.
Performance Evaluation
Benchmarks using Sysbench show that POLARDB on PolarFS achieves write latency close to a single‑node SSD and significantly higher TPS compared to traditional RDS offerings, while maintaining strong data reliability.
Snapshots and Failover
PolarFS provides instant filesystem snapshots built from per‑ChunkServer local snapshots, enabling rapid logical backups of massive databases. The shared‑access design allows read‑only instances to serve queries without lock contention, and failed write instances can be promoted to writable nodes without data inconsistency.
Conclusion
PolarFS demonstrates that a purpose‑built, user‑space, cloud‑native distributed file system can deliver ultra‑low latency, high availability, and seamless integration with cloud databases, paving the way for future optimizations with emerging hardware such as NVM and FPGA.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
