Real-Time Data Warehouse Practices with Hudi at ByteDance
This presentation details ByteDance's real‑time data‑warehouse implementations using Apache Hudi, covering scenario classifications, challenges of traditional offline warehouses, practical solutions for ingestion, upsert, validation, indexing, query optimization, and future plans for extensible indexing and unified batch‑stream processing.
01 Real‑Time Data Warehouse Scenario Introduction
ByteDance classifies three typical real‑time warehousing scenarios: (1) short‑video and live‑stream workloads with large data volumes, five‑minute latency, and batch‑stream reuse; (2) live‑commerce or e‑commerce sub‑scenes with medium data volume, one‑minute latency, low‑cost back‑fill and cold‑start; (3) e‑commerce and education workloads with small data volume, second‑level latency, strong consistency and high QPS.
02 Real‑Time Data Warehouse Exploration
Traditional offline warehouses suffer from timeliness (day‑ or hour‑level) and update inefficiency (full partition rewrites). A data lake with Hudi provides both low‑latency ingestion and efficient updates, enabling batch‑stream convergence.
Solution steps include:
Video‑metadata landing: replace three Hive tables with Hudi upserts, reducing readiness time by ~3.5 hours and cutting peak resource usage by 40 %.
Near‑real‑time validation: hourly jobs dump Kafka data to Hive, then validate; after adopting Hudi, Flink writes directly to Hudi and Presto queries provide near‑real‑time visibility.
Usability issues (complex scripts, schema DDL) were addressed by moving to pure‑SQL submissions with a unified catalog that auto‑loads schemas and parameters.
03 Typical Scenario Practice
The end‑to‑end pipeline: MySQL/Kafka → Flink → Hudi (storage) → Spark/Presto for interactive queries; high‑QPS services connect via KV stores.
Real‑time multi‑dimensional aggregation writes incremental data to a lightweight Hudi layer; heavy aggregation is performed in Presto, with materialized views feeding KV for low‑latency products.
Key problems identified:
Write stability (high resource usage, frequent restarts, delayed compaction).
Poor update performance causing back‑pressure.
Limited concurrency due to Hudi Metastore load.
Slow query performance (up to 10 min latency, failures).
Targeted solutions:
Async compaction service: Flink handles only incremental writes and schedules compaction; a separate Compaction Service pulls pending plans from Hudi Metastore and runs Spark compaction jobs.
High‑efficiency indexing: hash‑based bucket index for fast file location and query pruning.
Request model optimization: cache write‑task plans to reduce Metastore polling, boosting RPS from hundreds of thousands to near ten‑million.
MergeOnRead column pruning: push column projection to scan layer and prune during log merge, reducing serialization overhead.
Parallel read optimization: split large BaseFiles into multiple tasks to increase read parallelism.
Combine Engine: bypass Avro by reading engine‑native rows (Spark InternalRow, Flink RowData), dramatically improving MergeOnRead and compaction performance.
04 Future Planning
Planned enhancements include an extensible hash index for scalable updates, a self‑adaptive Table Management Service to automate compaction, cleaning, clustering, and index building, and a richer metadata service supporting schema evolution and concurrency control.
Batch‑stream integration roadmap:
Unified SQL layer across Flink, Spark, Presto.
Unified storage based on Hudi.
Unified catalog for consistent metadata.
05 Q&A
Answers covered column‑pruning storage format (Avro vs Parquet), async compaction scheduling, Hudi Metastore management, multi‑stream write conflict detection, relationship between Kafka stream tables and Hudi, future use of Hudi for all streaming, reasons for adopting Bucket Index over Bloom Filter.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
