Operations 15 min read

Inside Taobao’s High‑Performance Distributed File System (TFS): Architecture & Scaling

Taobao’s File System (TFS) is a highly available, high‑performance distributed storage solution built on Linux servers, featuring name‑server and data‑server clusters, block‑level replication, HA mechanisms, client caching, seamless scaling, multi‑data‑center disaster recovery, and open‑source support for C++, Java, and Nginx integration.

21CTO
21CTO
21CTO
Inside Taobao’s High‑Performance Distributed File System (TFS): Architecture & Scaling

Architecture Overview

FS (Taobao File System) is a high‑availability, high‑performance, highly scalable distributed file system built on ordinary Linux servers, providing massive unstructured data storage. It is widely used across Taobao services, with the largest deployed cluster storing nearly a hundred billion files. TFS is open‑sourced on TaoCode.

The TFS cluster consists of a name server (nameserver) and data servers (dataservers). Data is stored and organized in blocks (default 64 MB, configurable). Multiple small files share a block, which is indexed for fast location. Each block is replicated across different racks for reliability. Blocks have globally unique IDs assigned by the nameserver; files within a block have unique file IDs assigned by the dataserver, together uniquely identifying a file.

High‑Availability Design

Nameservers run in HA mode with two servers sharing a virtual IP (VIP). The active nameserver holds the VIP; a HA agent monitors both nodes and switches the VIP to the standby if the active fails, ensuring continuous service.

Dataservers typically run multiple processes per machine, each managing a disk to maximize I/O. Upon startup, a dataserver reports its blocks to the nameserver and sends periodic heartbeats. If a dataserver stops reporting, the nameserver replicates its blocks to maintain the configured replica count.

Storage Mechanism

All metadata resides in the nameserver’s memory without persistent storage. The nameserver builds a block‑to‑server map from reports, allocating writable blocks for writes and locating blocks for reads. Background threads monitor block health and balance load by migrating data when needed.

Write operations use a simple round‑robin block allocation. After a file is written to multiple dataservers, the client receives a filename encoding the cluster ID, block ID, and file ID. Reads resolve the block via the nameserver, then fetch data from the appropriate dataserver, retrying other replicas on failure.

Deletion marks files but does not immediately reclaim space; when a block’s deleted‑file ratio exceeds a threshold, a compaction process runs during low‑traffic periods.

Clients cache block‑to‑dataserver mappings locally to reduce nameserver load. When the cache is stale, the client falls back to the nameserver. Remote caching via Tair (Taobao’s distributed key/value store) further improves hit rates.

Support for Custom Names and Large Files

Custom filenames are managed by a separate metadata server (metaserver) that maps user‑defined names to TFS filenames. Large files are split into 2 MB chunks, each stored as a separate TFS file; the client assembles these chunks on read.

Client Interfaces and Nginx Proxy

TFS provides standard C++ and Java clients. To simplify client upgrades, an open‑source Nginx module proxies all TFS read/write requests, exposing a RESTful API. Adding support for new languages only requires implementing the HTTP protocol to the Nginx proxy.

Smooth Scaling

When expanding capacity, operators add new machines with dataserver processes. The nameserver detects the new dataservers, creates blocks on them, and they immediately begin serving reads and writes. Because front‑end CDN caching makes file access random, load is roughly proportional to stored data size. The nameserver rebalances data by migrating portions from heavily used dataservers to the new ones, keeping capacity utilization balanced.

Data‑Center Disaster Recovery

TFS uses multi‑replica storage and supports multi‑data‑center disaster recovery by deploying physical clusters in different sites and synchronizing them to form a logical cluster. A typical logical cluster has one primary and multiple backups; the primary handles reads/writes, while backups serve reads only. Logs of write/delete operations are replayed on backups to keep data consistent.

For active‑active deployments across data centers, each primary cluster assigns distinct block ID ranges (e.g., odd vs. even) to avoid write conflicts, and synchronizes writes to the other primary.

Clients perform failover by first contacting the nearest physical cluster; if the file is unavailable, they retry other clusters.

Operations Management

All TFS resources are stored in a MySQL database and managed by a resource‑management server (rcserver). Configuration templates define replica counts per cluster. When a new machine joins, rcserver generates its configuration from the template.

Applications receive an appkey and resource allocation; clients fetch configuration from rcserver at startup and periodically send keep‑alive messages with usage statistics. rcserver can redirect applications to alternative clusters if issues arise.

Monitoring agents run on every TFS node, tracking service health and capacity. Automatic actions include taking faulty nodes offline and triggering expansions when usage exceeds thresholds.

Future Work

Future development focuses on improving efficiency and reducing storage and operational costs. TFS plans to adopt erasure coding to replace traditional replica‑based redundancy, potentially cutting storage costs by 25‑50%.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Taobaohigh availabilitystorage architecturescalingDistributed File System
21CTO
Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.