Big Data 19 min read

How ByteDance Scales Real‑Time Data Warehouses with Hudi and Flink

This article details ByteDance's practical experience building real‑time data warehouses on a data lake using Hudi, Flink, and related optimizations, covering scenario analysis, architecture, performance challenges, and future roadmap for scalable, low‑latency analytics.

ITPUB
ITPUB
ITPUB
How ByteDance Scales Real‑Time Data Warehouses with Hudi and Flink

Real‑Time Data Warehouse Scenarios

Three typical business scenarios drive the design of a real‑time data warehouse built on a data lake:

Scenario 1 – Short video & live‑streaming : massive log volume, batch‑stream reuse, latency requirement ≤ 5 minutes.

Scenario 2 – Live e‑commerce & related services : medium volume, sub‑minute latency, need for low‑cost back‑tracking and cold‑start.

Scenario 3 – E‑commerce & education : small data sets, second‑level latency, strong consistency and high QPS.

Challenges of Using a Data Lake for Real‑Time Warehousing

Traditional offline warehouses suffer from two major drawbacks:

Timeliness : data is refreshed only daily or hourly.

Update cost : partial updates require full‑partition rewrites, which is inefficient.

A data lake combined with Apache Hudi provides both low‑latency visibility and efficient incremental updates, enabling true batch‑stream reuse.

Implementation Steps

Video metadata ingestion – The original pipeline used three Hive tables (MySQL → Table 1, Redis → Table 2, then Join) with nightly dumps and deduplication, causing high peak resource usage and a 3.5‑hour readiness delay. By switching to Hudi upserts on an hourly basis, the pipeline reduces peak resource consumption by ~40 % and shortens data‑ready time by ~3.5 hours. -- Example Hudi upsert (SQL) Near‑real‑time validation – Previously, an hourly job dumped Kafka data to Hive for validation. After adopting Hudi, Flink writes directly to Hudi tables and Presto queries the tables for immediate validation, improving developer productivity and data quality.

SQL simplification – Original submissions required complex scripts with many parameters and DDL definitions. A unified catalog now auto‑detects schema and parameters, allowing pure‑SQL lake‑ingestion statements.

Overall Architecture

Data is ingested from MySQL and Kafka via Flink into Hudi tables. Lake‑side computation (e.g., dimension enrichment) can be performed before the data lands in Hudi. Analytics are served by Spark or Presto for ad‑hoc queries, while high‑QPS online services read from a KV store that is populated from Hudi.

Real‑Time Multi‑Dimensional Aggregation

Kafka streams are upserted into a lightweight Hudi aggregation layer. Presto performs heavy multi‑dimensional aggregations for dashboards. For high‑QPS use cases, pre‑computed results are materialized into a KV store; future work includes materialized view support.

Performance Optimizations

Write stability – Async compaction service : Flink only performs incremental writes and schedules compaction plans; a separate Compaction Service pulls pending plans from the Hudi Metastore and runs Spark compaction jobs, isolating write tasks from compaction.

Efficient update indexing : Hash‑based file location and hash filtering accelerate both writes and query pruning.

Request model optimization : Timeline polling is moved from Hudi Metastore to a JobManager cache, raising request‑per‑second capacity from hundreds of thousands to near ten million.

MergeOnRead column pruning : Log files are stored in columnar Parquet; column pruning is pushed down to the scan layer and applied before merging, reducing serialization overhead.

Parallel read : BaseFiles are split into multiple tasks, increasing read parallelism.

Combine Engine : Instead of Avro serialization, Spark InternalRow or Flink RowData are read directly, dramatically improving MergeOnRead and compaction performance.

Real‑Time Data Association

Multiple streams write to the same Hudi table without conflict at the file level. Column‑level conflicts are detected via the Hudi Metastore; conflicting writes are rejected, ensuring a consistent wide table after merge.

Future Roadmap

Extensible hash index : A scalable hash‑index design to support re‑hashing and large‑scale updates.

Table Management Service : Automates compaction, cleaning, clustering, and index building, exposing a transparent service to users.

Enhanced metadata service : Adds schema evolution and concurrency control to the Hudi Metastore, while remaining Hive‑compatible.

Unified batch‑stream platform :

Unified SQL layer – same SQL works across Flink, Spark, and Presto.

Unified storage – Hudi serves as the single source of truth.

Unified catalog – centralized metadata management.

Key Q&A Highlights

Q1: MergeOnRead now uses columnar Parquet logs instead of row‑based Avro.

Q2: Async compaction is executed by a separate Compaction Service that pulls plans from the Hudi Metastore.

Q3: Hudi tables are managed via a Hudi Metastore compatible with Hive Metastore APIs.

Q4: Multi‑stream writes use separate log files; column‑level conflicts are detected and rejected by the Metastore.

Q5: Near‑real‑time Hudi capabilities may eventually replace Kafka stream tables for sub‑second use cases.

Q6: Lake‑inside computation is still in pilot phases for selected scenarios.

Q7: Bucket Index replaces Bloom Filter for large‑scale data, reducing false positives and resource waste.

Illustrations

Scenario Overview
Scenario Overview
Real‑Time Warehouse Exploration
Real‑Time Warehouse Exploration
Video Metadata Ingestion
Video Metadata Ingestion
Hudi Upsert Diagram
Hudi Upsert Diagram
Compaction Service
Compaction Service
SQL Simplification
SQL Simplification
Overall Hudi Pipeline
Overall Hudi Pipeline
Real‑Time Multi‑Dimensional Aggregation
Real‑Time Multi‑Dimensional Aggregation
Stability Improvements
Stability Improvements
High‑Efficiency Index
High‑Efficiency Index
Request Model Optimization
Request Model Optimization
MergeOnRead Column Pruning
MergeOnRead Column Pruning
Parallel Read Optimization
Parallel Read Optimization
Combine Engine
Combine Engine
Real‑Time Data Analysis
Real‑Time Data Analysis
Log Data Ingestion
Log Data Ingestion
Non‑Index Log Ingestion
Non‑Index Log Ingestion
Real‑Time Data Association
Real‑Time Data Association
Extensible Hash Index
Extensible Hash Index
Table Management Service
Table Management Service
Metadata Service Enhancements
Metadata Service Enhancements
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

FlinkHudi
ITPUB
Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.