Fundamentals 14 min read

Inside Taobao’s High‑Performance Distributed File System (TFS): Architecture & Scaling

This article explains the design, storage mechanisms, high‑availability architecture, scaling strategies, multi‑data‑center disaster recovery, operational management, and future plans of Taobao’s distributed file system (TFS), a highly available and scalable storage solution for massive unstructured data.

21CTO

Nov 14, 2015

Inside Taobao’s High‑Performance Distributed File System (TFS): Architecture & Scaling

Architecture Overview

FS (Taobao File System, TFS) is a highly available, high‑performance, and highly scalable distributed file system built on ordinary Linux servers, providing massive unstructured data storage. It is widely used across Taobao services, with the largest deployed cluster storing nearly a hundred billion files. TFS is open‑sourced on TaoCode.

Storage Mechanism

TFS clusters consist of a name server (nameserver) and data servers (dataservers). Data is stored and organized in blocks (default 64 MiB, configurable). Multiple small files share a block, which has an index for fast location. Each block is replicated across different racks for reliability. Blocks have globally unique IDs assigned by the nameserver; files within a block have unique file IDs assigned by the dataserver.

Nameservers run in HA mode with two servers sharing a virtual IP. The active server handles requests and synchronizes changes to the standby; an HA agent monitors health and fails over the VIP if needed.

Each dataserver runs multiple processes, each managing a disk. Upon startup, a dataserver reports its blocks to the nameserver and sends periodic heartbeats. The nameserver tracks heartbeats and replicates blocks from failed dataservers to maintain the configured replica count.

All metadata resides in memory on the nameserver. It builds a block‑to‑server map to allocate writable blocks (using a simple round‑robin strategy) and to locate blocks for read requests. Background threads monitor block health and rebalance load by migrating data when necessary.

When a client writes a file, the nameserver assigns a block, and the client receives a filename composed of cluster ID, block ID, and file ID. The client can later read or delete the file using this name. Deletions are marked; when a block’s deleted‑file ratio exceeds a threshold, the block is compacted during low‑traffic periods.

Clients cache block‑to‑dataserver mappings locally to reduce nameserver load. If the cache is stale, the client falls back to the nameserver. For higher hit rates, TFS can cache mappings in Tair, a distributed key/value store.

TFS also supports custom filenames and large‑file storage. Custom filenames are mapped to internal TFS names via a dedicated metadata server. Large files are split into 2 MiB chunks, each stored as a separate TFS file; the client reassembles them on read.

Standard C++ and Java clients are provided. To simplify client upgrades, an Nginx module proxies all TFS read/write requests and offers a RESTful interface, making backend upgrades transparent to users.

Smooth Scaling

When new machines are added, deploying dataserver processes on them automatically integrates them into the cluster. The nameserver creates new blocks on the added dataservers, which then begin serving read/write requests.

Because TFS front‑ends cache via CDN, access patterns are random with little hotspot. The nameserver balances load by migrating data from heavily used dataservers to newly added ones, keeping capacity usage across the cluster roughly even.

Data Center Disaster Recovery

TFS uses multi‑replica storage and can span multiple data centers. Each physical cluster synchronizes data to form a logical cluster. In a primary‑secondary setup, the primary provides read/write, the secondary read‑only; logs are replayed to keep them in sync.

For active‑active multi‑data‑center deployments, each cluster assigns distinct block ID ranges (e.g., odd IDs to one cluster, even IDs to another) to avoid write conflicts. Clients fail over to the nearest cluster and retry other clusters if a file is unavailable.

Operations Management

All cluster metadata is stored in MySQL and managed by a resource‑management server (rcserver). Configuration templates define replica counts per cluster. When a new cluster is launched, rcserver generates its configuration from the template.

Applications receive an appkey and resource allocation. Clients fetch configuration from rcserver, report usage statistics, and receive updates (e.g., switching to a healthy cluster). Monitoring agents detect hardware failures and trigger automatic failover or capacity‑based scaling.

Future Work

Future development focuses on improving efficiency and reducing storage and operational costs. TFS plans to adopt erasure coding to replace traditional replica‑based backup, potentially cutting storage costs by 25‑50%.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Taobao High Availability Distributed File System scalable storage TFS

Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.