How ByteDance Scales Real‑Time Data Warehouses with Hudi and Flink
This article details ByteDance's practical experience building real‑time data warehouses on a data lake using Hudi, Flink, and related optimizations, covering scenario analysis, architecture, performance challenges, and future roadmap for scalable, low‑latency analytics.
Real‑Time Data Warehouse Scenarios
Three typical business scenarios drive the design of a real‑time data warehouse built on a data lake:
Scenario 1 – Short video & live‑streaming : massive log volume, batch‑stream reuse, latency requirement ≤ 5 minutes.
Scenario 2 – Live e‑commerce & related services : medium volume, sub‑minute latency, need for low‑cost back‑tracking and cold‑start.
Scenario 3 – E‑commerce & education : small data sets, second‑level latency, strong consistency and high QPS.
Challenges of Using a Data Lake for Real‑Time Warehousing
Traditional offline warehouses suffer from two major drawbacks:
Timeliness : data is refreshed only daily or hourly.
Update cost : partial updates require full‑partition rewrites, which is inefficient.
A data lake combined with Apache Hudi provides both low‑latency visibility and efficient incremental updates, enabling true batch‑stream reuse.
Implementation Steps
Video metadata ingestion – The original pipeline used three Hive tables (MySQL → Table 1, Redis → Table 2, then Join) with nightly dumps and deduplication, causing high peak resource usage and a 3.5‑hour readiness delay. By switching to Hudi upserts on an hourly basis, the pipeline reduces peak resource consumption by ~40 % and shortens data‑ready time by ~3.5 hours. -- Example Hudi upsert (SQL) Near‑real‑time validation – Previously, an hourly job dumped Kafka data to Hive for validation. After adopting Hudi, Flink writes directly to Hudi tables and Presto queries the tables for immediate validation, improving developer productivity and data quality.
SQL simplification – Original submissions required complex scripts with many parameters and DDL definitions. A unified catalog now auto‑detects schema and parameters, allowing pure‑SQL lake‑ingestion statements.
Overall Architecture
Data is ingested from MySQL and Kafka via Flink into Hudi tables. Lake‑side computation (e.g., dimension enrichment) can be performed before the data lands in Hudi. Analytics are served by Spark or Presto for ad‑hoc queries, while high‑QPS online services read from a KV store that is populated from Hudi.
Real‑Time Multi‑Dimensional Aggregation
Kafka streams are upserted into a lightweight Hudi aggregation layer. Presto performs heavy multi‑dimensional aggregations for dashboards. For high‑QPS use cases, pre‑computed results are materialized into a KV store; future work includes materialized view support.
Performance Optimizations
Write stability – Async compaction service : Flink only performs incremental writes and schedules compaction plans; a separate Compaction Service pulls pending plans from the Hudi Metastore and runs Spark compaction jobs, isolating write tasks from compaction.
Efficient update indexing : Hash‑based file location and hash filtering accelerate both writes and query pruning.
Request model optimization : Timeline polling is moved from Hudi Metastore to a JobManager cache, raising request‑per‑second capacity from hundreds of thousands to near ten million.
MergeOnRead column pruning : Log files are stored in columnar Parquet; column pruning is pushed down to the scan layer and applied before merging, reducing serialization overhead.
Parallel read : BaseFiles are split into multiple tasks, increasing read parallelism.
Combine Engine : Instead of Avro serialization, Spark InternalRow or Flink RowData are read directly, dramatically improving MergeOnRead and compaction performance.
Real‑Time Data Association
Multiple streams write to the same Hudi table without conflict at the file level. Column‑level conflicts are detected via the Hudi Metastore; conflicting writes are rejected, ensuring a consistent wide table after merge.
Future Roadmap
Extensible hash index : A scalable hash‑index design to support re‑hashing and large‑scale updates.
Table Management Service : Automates compaction, cleaning, clustering, and index building, exposing a transparent service to users.
Enhanced metadata service : Adds schema evolution and concurrency control to the Hudi Metastore, while remaining Hive‑compatible.
Unified batch‑stream platform :
Unified SQL layer – same SQL works across Flink, Spark, and Presto.
Unified storage – Hudi serves as the single source of truth.
Unified catalog – centralized metadata management.
Key Q&A Highlights
Q1: MergeOnRead now uses columnar Parquet logs instead of row‑based Avro.
Q2: Async compaction is executed by a separate Compaction Service that pulls plans from the Hudi Metastore.
Q3: Hudi tables are managed via a Hudi Metastore compatible with Hive Metastore APIs.
Q4: Multi‑stream writes use separate log files; column‑level conflicts are detected and rejected by the Metastore.
Q5: Near‑real‑time Hudi capabilities may eventually replace Kafka stream tables for sub‑second use cases.
Q6: Lake‑inside computation is still in pilot phases for selected scenarios.
Q7: Bucket Index replaces Bloom Filter for large‑scale data, reducing false positives and resource waste.
Illustrations
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
