Exploring ByteDance's EB‑Scale HDFS: Architecture, Multi‑Datacenter Challenges, Tiered Storage, and Data Protection Practices
This article presents an in‑depth overview of ByteDance's EB‑scale HDFS, covering its new features, multi‑datacenter architecture, tiered storage implementation, data management services, capacity and fault‑tolerance strategies, as well as practical data‑protection mechanisms and related Q&A.
ByteDance's HDFS has been operating in production for over ten years and now handles tens of exabytes of data across offline, near‑line, and online scenarios, supporting large‑scale analytics, machine‑learning workloads, and storage‑compute separation architectures.
The system introduces several new features: a C++‑based metadata layer with high‑availability via ZooKeeper and BookKeeper, a redesigned data layer supporting HDD/SSD media, and a Data Management service that reduces cost and improves stability by tightly integrating with upper‑level compute ecosystems.
To meet the demands of rapid business growth, ByteDance deployed a multi‑datacenter architecture with three‑zone high‑availability components, dual‑zone NameNode and BookKeeper, and five‑zone DataNode support, addressing challenges in resource management, heterogeneous infrastructure, and strict stability and compliance requirements.
Cross‑datacenter traffic is mitigated through replica placement policies (ReplicaPolicy, Majority DC), write‑order optimization, and a storage‑compute affinity scheduler (ResLack) that dynamically places computation and data in the same zone, leveraging the Data Management service for data migration.
Tiered storage is realized with four layers—SSD cache, remote SSD cache, high‑replication HDD, and EC storage—managed by a data‑scoring system that automatically migrates data based on access patterns, TTL, and usage metrics, while a distributed GC framework cleans up unused data efficiently.
Data protection against accidental deletion is achieved via the ByteBrain service, which combines rule‑based filtering, machine‑learning models, and user feedback to identify and recover mistakenly deleted paths, having prevented tens of petabytes of data loss in production.
The discussed technologies are exposed externally through the CloudFS product on Volcano Engine, offering high‑performance, storage‑compute‑separated services for big‑data and machine‑learning workloads.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.