Big Data 22 min read

Exploring ByteDance's EB‑Scale HDFS: Architecture, Multi‑Datacenter Challenges, Tiered Storage, and Data Protection Practices

This article presents an in‑depth overview of ByteDance's EB‑scale HDFS, covering its new features, multi‑datacenter architecture, tiered storage implementation, data management services, capacity and fault‑tolerance strategies, as well as practical data‑protection mechanisms and related Q&A.

DataFunSummit

Feb 6, 2024

Exploring ByteDance's EB‑Scale HDFS: Architecture, Multi‑Datacenter Challenges, Tiered Storage, and Data Protection Practices

ByteDance's HDFS has been operating in production for over ten years and now handles tens of exabytes of data across offline, near‑line, and online scenarios, supporting large‑scale analytics, machine‑learning workloads, and storage‑compute separation architectures.

The system introduces several new features: a C++‑based metadata layer with high‑availability via ZooKeeper and BookKeeper, a redesigned data layer supporting HDD/SSD media, and a Data Management service that reduces cost and improves stability by tightly integrating with upper‑level compute ecosystems.

To meet the demands of rapid business growth, ByteDance deployed a multi‑datacenter architecture with three‑zone high‑availability components, dual‑zone NameNode and BookKeeper, and five‑zone DataNode support, addressing challenges in resource management, heterogeneous infrastructure, and strict stability and compliance requirements.

Cross‑datacenter traffic is mitigated through replica placement policies (ReplicaPolicy, Majority DC), write‑order optimization, and a storage‑compute affinity scheduler (ResLack) that dynamically places computation and data in the same zone, leveraging the Data Management service for data migration.

Tiered storage is realized with four layers—SSD cache, remote SSD cache, high‑replication HDD, and EC storage—managed by a data‑scoring system that automatically migrates data based on access patterns, TTL, and usage metrics, while a distributed GC framework cleans up unused data efficiently.

Data protection against accidental deletion is achieved via the ByteBrain service, which combines rule‑based filtering, machine‑learning models, and user feedback to identify and recover mistakenly deleted paths, having prevented tens of petabytes of data loss in production.

The discussed technologies are exposed externally through the CloudFS product on Volcano Engine, offering high‑performance, storage‑compute‑separated services for big‑data and machine‑learning workloads.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Distributed storage Data Protection HDFS Tiered Storage multi-datacenter

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.