Big Data 13 min read

Design and Evolution of ByteDance's Multi‑Datacenter HDFS Architecture

This article explains how ByteDance extended the Apache HDFS architecture with a multi‑datacenter design, introducing components such as DanceNN, NNProxy, and BookKeeper to achieve scalable storage, cross‑datacenter data placement, and rack‑level disaster recovery for petabyte‑scale workloads.

DataFunTalk
DataFunTalk
DataFunTalk
Design and Evolution of ByteDance's Multi‑Datacenter HDFS Architecture

HDFS (Hadoop Distributed File System) is the foundational storage module of Apache Hadoop, providing high‑throughput massive data storage. Since its release in April 2006, it remains widely used; at ByteDance, the service has grown to a "double‑10" scale with over 100,000 nodes and 10 EB of data.

Typical usage scenarios include offline workloads (e.g., Hive, ClickHouse, Presto, machine‑learning training data) and near‑line workloads (e.g., ByteMQ, streaming checkpoint). While many companies operate multiple isolated HDFS clusters, ByteDance adopts a federated large‑cluster deployment spanning three data centers (A, B, C) with a single logical cluster and multiple nameservices.

Motivated by rapid business growth, the need for low‑latency near‑line access, rack‑level disaster tolerance, and efficient operations of a massive cluster, ByteDance focused on two key challenges: scaling capacity across data centers and providing data‑center‑level disaster recovery.

The community HDFS architecture consists of three layers: Client (access via HDFS SDK), metadata management (NameNode with Federation/NameService), and data management (DataNode). ByteDance’s custom version retains this core but adds three proprietary components:

DanceNN – a C++ reimplementation of the NameNode, protocol‑compatible with the community version.

NNProxy – a NameNode proxy that unifies namespaces for Federation.

BookKeeper – Apache BookKeeper used as a shared edit‑log store, providing data‑center‑aware configuration for HA.

In the dual‑datacenter evolution (A → A,B), data placement ensures each file has at least one replica in each data center, with clients preferring local replicas to reduce cross‑datacenter bandwidth. The design leverages the unchanged DanceNN protocol, so applications need no changes.

For disaster recovery, the metadata layer is extended: the NameNode stack (ZooKeeper → BookKeeper → DanceNN) operates independently across data centers, storing edit‑log replicas (four copies with a 1:1 data‑center distribution) to enable rapid failover while maintaining consistency.

Supporting systems such as Balancer, Mover, and operational tools are adapted to be aware of the multi‑datacenter placement and to ensure smooth transitions with minimal business impact.

The multi‑datacenter architecture extends the dual‑datacenter model to include a third data center (C) when resources in B become scarce. The design preserves existing placement policies, limits migration cost, and meets ByteDance’s disaster‑recovery standards. Data can be placed across any two of the three data centers (A/B, B/C, A/C) based on bandwidth and resource considerations.

Overall, ByteDance’s multi‑datacenter HDFS solution demonstrates a unique approach to scaling petabyte‑level storage, achieving cross‑datacenter capacity expansion, low‑latency access, and rack‑level fault tolerance while maintaining compatibility with the open‑source HDFS ecosystem.

disaster recoveryHDFSmulti-datacenterByteDancebig data storagedata placement
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.