Fundamentals 15 min read

How InfiniFS Revolutionizes Metadata for Billion-File Distributed Filesystems

This article summarizes the InfiniFS paper, detailing how its access‑content decoupling, speculative path resolution, and optimistic metadata caching enable efficient metadata handling for data‑center‑scale file systems supporting billions of files.

Big Data Technology Tribe
Big Data Technology Tribe
Big Data Technology Tribe
How InfiniFS Revolutionizes Metadata for Billion-File Distributed Filesystems

Abstract

The paper proposes InfiniFS, a metadata service designed for ultra‑large distributed file systems that manage billions of files. It addresses three core challenges—directory‑tree partitioning for locality and load balancing, high path‑resolution latency, and near‑root hotspot—by introducing three techniques: decoupling directory metadata into access and content parts, speculative path resolution, and an optimistic access‑metadata cache. Experiments show InfiniFS achieves higher throughput and lower latency than existing systems.

Background and Motivation

Modern data centers aim to run a single file‑system instance across the entire facility to serve hundred‑billion‑file workloads. This raises metadata‑service challenges such as balancing locality with load distribution, deep‑directory path latency, and hotspot contention near the root.

2.1 Large‑Scale File Systems

File systems provide a hierarchical namespace where each file or directory carries metadata. Path resolution and metadata processing are the two fundamental steps when accessing a file like /home/Alice/paper.tex. A single, global file system offers three advantages:

Global data sharing : a unified namespace eliminates cross‑cluster data duplication.

Higher resource utilization : idle capacity in one cluster can be used by others.

Reduced operational complexity : maintaining one system is far simpler than thousands of independent clusters.

2.2 Scalable Metadata Challenges

Challenge 1 : Achieving both high metadata locality and good load balancing is difficult as the directory tree grows.

Challenge 2 : Deep directory structures cause path‑resolution latency to increase linearly with depth; real workloads show many files deeper than ten levels.

Challenge 3 : Near‑root hotspots force frequent reads of root‑adjacent directories, and traditional client‑side caches (e.g., lease‑based) struggle with consistency under massive concurrency.

2.3 Real Workload Characteristics

Analysis of production workloads from three Pangu file‑system instances at Alibaba reveals:

File operations constitute 95.8% of all metadata ops.

Directory readdir accounts for ~93.3% of directory ops.

Rename and set_permission are negligible (~0.0083%).

Design and Implementation

InfiniFS’s architecture rests on three core technologies:

Decoupled directory metadata : separates access metadata (name, ID, permissions) from content metadata (entry list, timestamps).

Speculative path resolution : generates predictable directory IDs using a cryptographic hash of parent ID, name, and version, allowing clients to guess IDs and issue parallel lookups.

Lazy‑invalidating client cache : clients cache directory metadata optimistically; servers lazily verify cache validity on request.

3.1 Overview

The system consists of:

Clients that access InfiniFS via a user‑space library or FUSE, employing speculative path resolution and optimistic caching.

Metadata servers that store access‑content decoupled partitions in a KV store, keep data in memory, and log updates to NVMe SSDs, using an invalidation list for lazy validation.

A rename coordinator that serializes rename and permission‑setting operations and propagates invalidations.

3.2 Access‑Content Decoupled Partitioning

Traditional fine‑ or coarse‑grained partitioning cannot simultaneously achieve locality and balance because they treat directory metadata as a monolith. InfiniFS observes that directory metadata naturally splits into two independent parts, enabling independent grouping for locality and partitioning for load distribution.

Metadata operations are classified into three groups:

Operations affecting only the target file/directory (e.g., open, stat).

Operations affecting the target and its parent (e.g., create, delete, readdir).

Rename operations affecting two files/directories and their parents.

By grouping each directory’s content metadata with its children’s access metadata, InfiniFS forms “per‑directory groups” that preserve locality for the majority (~90%) of operations while allowing fine‑grained hash‑based partitioning for load balancing. Consistent hashing maps these groups to servers, minimizing data movement during scaling.

Metadata is stored in a KV store with three key types:

Directory access metadata

Directory content metadata

File metadata

Lookup example:

1. Retrieve access metadata for <0, A> → ID=1. 2. Retrieve access metadata for <1, B> → ID=2. 3. Retrieve content metadata for <2> (directory B). 4. Retrieve file metadata for <2, file>.

Reference: Satadru Pan et al., “Facebook’s Tectonic Filesystem: Efficiency from Exascale,” FAST 21.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

metadata servicelarge-scale storagedistributed file systemsfilesystem designInfiniFS
Big Data Technology Tribe
Written by

Big Data Technology Tribe

Focused on computer science and cutting‑edge tech, we distill complex knowledge into clear, actionable insights. We track tech evolution, share industry trends and deep analysis, helping you keep learning, boost your technical edge, and ride the digital wave forward.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.