How ByteDance Scaled Feature Storage with Iceberg and Parquet: A Big Data Case Study
ByteDance tackled massive feature‑storage challenges by replacing row‑based HDFS files with columnar Parquet and the Iceberg table format, enabling schema evolution, selective reads, efficient backfill, and training optimizations that cut storage costs by over 40% and reduced CPU and network I/O dramatically.
Background
ByteDance processes petabytes of feature data daily, with millions of cores used for model training, leading to three major pain points: long feature extraction cycles, large storage footprints, and bandwidth‑limited model training.
Pain Points
Long extraction cycles – Online extraction requires weeks to validate new features, and offline research is costly.
High storage consumption – Row‑based storage inflates space usage and hampers data reuse.
Training bandwidth bottleneck – Models read all features in a row store, causing massive I/O despite needing only a few hundred columns.
Requirements
Store raw features for offline research.
Enable offline feature investigation.
Support feature backfill.
Reduce storage cost.
Lower training cost by reading only needed features.
Accelerate training by minimizing copy/serialization overhead.
Solution Architecture
The stack consists of business services (e.g., Douyin, Toutiao) → platform layer (UI, access control) → framework layer (Spark for feature processing, Primus for training) → format layer (Parquet files with Iceberg tables) → scheduler/storage layer (Yarn & K8s, HDFS).
Technical Choices
Parquet columnar storage reduces space compared to row storage and allows predicate push‑down, enabling selective reads during training. However, Parquet requires a predefined schema, making schema evolution and backfill difficult.
Iceberg was adopted to provide schema evolution, transactional writes, and concurrent read/write capabilities.
Concurrency and Schema Evolution
Iceberg creates a new snapshot for each write, so readers see a consistent view while writers operate on a new snapshot. Schema changes are handled via Iceberg’s evolution support, avoiding the “all‑columns read” problem of row stores.
Backfill Strategies
Traditional Copy‑On‑Write (COW) backfill rewrites entire files, incurring high CPU, I/O, and storage costs. ByteDance developed a MOR (Merge‑On‑Read) backfill that reads only required columns, writes update files, and merges them with data files during compaction.
Training Optimizations
After moving to Iceberg, initial training slowed because the framework still deserialized rows. The team introduced vectorized reads that deliver Arrow batches directly to the trainer, eliminating extra serialization steps.
Results
The new pipeline achieved over 40% storage cost reduction, a 13% CPU reduction, and a 40% drop in network I/O while maintaining training speed.
Future Plans
Support upserts for partial data back‑flow.
Introduce materialized views for frequently accessed datasets.
Implement data skipping to push more predicates down to storage.
Provide Arrow‑based preprocessing APIs to accelerate downstream training.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Volcano Engine Developer Services
The Volcano Engine Developer Community, Volcano Engine's TOD community, connects the platform with developers, offering cutting-edge tech content and diverse events, nurturing a vibrant developer culture, and co-building an open-source ecosystem.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
