Scaling ByteDance Feature Store to EB‑Level with Apache Iceberg: Architecture, Practices, and Future Roadmap
This article describes how ByteDance tackled petabyte‑scale feature storage by adopting Apache Iceberg, detailing the problem background, design choices, implementation of COW and MOR back‑fill strategies, performance optimizations, and future plans such as lake‑cold‑layering and materialized views.
The rapid growth of ByteDance's feature store reached exabyte (EB) total size and petabyte (PB) daily increments, creating challenges in storage cost, read/write amplification, schema validation, and end‑to‑end user experience. The presentation outlines the end‑to‑end workflow from online feature extraction, protobuf storage in HDFS, distributed reading via the Primus framework, to model training and feature back‑fill.
To address the inefficiencies of row‑based storage, the team selected Apache Iceberg as the table format. Iceberg provides a layered architecture (engine, table format, file format, cache, object storage) and abstracts file details, enabling schema evolution, time travel, MVCC, and engine‑agnostic APIs. The catalog, metadata, manifest, and data files are described in detail.
Two back‑fill approaches are compared: traditional Copy‑On‑Write (COW), which reads the entire snapshot and rewrites all data, and the more efficient Merge‑On‑Read (MOR) that reads only required columns and writes incremental updates. MOR reduces read/write amplification, saves storage, and lowers compute costs, while COW is simpler but resource‑intensive.
The practical implementation integrates Iceberg with Spark, Flink, and the proprietary Primus framework, using Parquet as the underlying file format and Cloud File System (CFS) for caching. Additional platform services such as snapshot expiration, data expiration, cleanup, rollback, and statistics are provided to manage the lake at scale.
Future roadmap includes lake‑cold‑layering for cost optimization, materialized views for query acceleration, self‑optimizing mechanisms, and broader engine support. The overall platform architecture combines cloud‑native management, runtime services, and serverless operations to deliver a unified data lake solution.
In the Q&A, the authors explain why Iceberg was chosen over Hudi—primarily for predicate push‑down, robust metadata management, and schema evolution—and affirm that the Iceberg‑based lake will continue to evolve as a core component of ByteDance's data infrastructure.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.