Big Data 13 min read

How ByteDance Scaled Feature Storage with Iceberg and Parquet: A Big Data Case Study

ByteDance tackled massive feature‑storage challenges by replacing row‑based HDFS files with columnar Parquet and the Iceberg table format, enabling schema evolution, selective reads, efficient backfill, and training optimizations that cut storage costs by over 40% and reduced CPU and network I/O dramatically.

Volcano Engine Developer Services

Jun 20, 2022

How ByteDance Scaled Feature Storage with Iceberg and Parquet: A Big Data Case Study

Background

ByteDance processes petabytes of feature data daily, with millions of cores used for model training, leading to three major pain points: long feature extraction cycles, large storage footprints, and bandwidth‑limited model training.

Pain Points

Long extraction cycles – Online extraction requires weeks to validate new features, and offline research is costly.

High storage consumption – Row‑based storage inflates space usage and hampers data reuse.

Training bandwidth bottleneck – Models read all features in a row store, causing massive I/O despite needing only a few hundred columns.

Requirements

Store raw features for offline research.

Enable offline feature investigation.

Support feature backfill.

Reduce storage cost.

Lower training cost by reading only needed features.

Accelerate training by minimizing copy/serialization overhead.

Solution Architecture

The stack consists of business services (e.g., Douyin, Toutiao) → platform layer (UI, access control) → framework layer (Spark for feature processing, Primus for training) → format layer (Parquet files with Iceberg tables) → scheduler/storage layer (Yarn & K8s, HDFS).

Technical Choices

Parquet columnar storage reduces space compared to row storage and allows predicate push‑down, enabling selective reads during training. However, Parquet requires a predefined schema, making schema evolution and backfill difficult.

Iceberg was adopted to provide schema evolution, transactional writes, and concurrent read/write capabilities.

Concurrency and Schema Evolution

Iceberg creates a new snapshot for each write, so readers see a consistent view while writers operate on a new snapshot. Schema changes are handled via Iceberg’s evolution support, avoiding the “all‑columns read” problem of row stores.

Backfill Strategies

Traditional Copy‑On‑Write (COW) backfill rewrites entire files, incurring high CPU, I/O, and storage costs. ByteDance developed a MOR (Merge‑On‑Read) backfill that reads only required columns, writes update files, and merges them with data files during compaction.

Training Optimizations

After moving to Iceberg, initial training slowed because the framework still deserialized rows. The team introduced vectorized reads that deliver Arrow batches directly to the trainer, eliminating extra serialization steps.

Results

The new pipeline achieved over 40% storage cost reduction, a 13% CPU reduction, and a 40% drop in network I/O while maintaining training speed.

Future Plans

Support upserts for partial data back‑flow.

Introduce materialized views for frequently accessed datasets.

Implement data skipping to push more predicates down to storage.

Provide Arrow‑based preprocessing APIs to accelerate downstream training.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Data Lake Training Optimization Iceberg Parquet feature storage

Written by

Volcano Engine Developer Services

The Volcano Engine Developer Community, Volcano Engine's TOD community, connects the platform with developers, offering cutting-edge tech content and diverse events, nurturing a vibrant developer culture, and co-building an open-source ecosystem.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.