Big Data 17 min read

Real-Time Data Warehouse Solutions with Hudi: Scenarios, Challenges, and Optimizations

This article presents an in‑depth overview of real‑time data‑warehouse scenarios, discusses challenges such as timeliness, update efficiency, and resource consumption, and details practical solutions using Apache Hudi, Flink, Presto, and related optimizations for ingestion, indexing, compaction, and query performance.

Big Data Technology & Architecture

Feb 6, 2023

Real-Time Data Warehouse Solutions with Hudi: Scenarios, Challenges, and Optimizations

1. Real-Time Data Warehouse Scenario Introduction

To better implement a data lake, we communicated with business units and divided the use cases into three typical scenarios:

Scenario 1: Short video and live‑streaming workloads with massive log data, requiring batch‑stream reuse, tolerating minor inconsistencies, and a latency requirement of within five minutes.

Scenario 2: Live‑streaming or e‑commerce sub‑scenarios with medium data volume, minute‑level latency, focusing on low‑cost data back‑tracking and cold start.

Scenario 3: Small‑scale e‑commerce and education workloads demanding second‑level latency, strong consistency, and high QPS for full‑volume calculations.

Based on these characteristics, we built a set of solutions on top of the data lake, which will be explored through concrete cases.

2. Initial Exploration of Real-Time Data Warehouse Scenarios

This section discusses the initial exploration of ByteDance's real‑time data warehouse, the problems encountered, and the solutions applied.

Initially, there was skepticism about the data lake supporting online production, so we started with a conservative approach, selecting scenarios where the data lake showed clear advantages over existing solutions.

Two major issues of offline warehouses were identified:

Timeliness – typically day‑ or hour‑level.

Update inefficiency – updating part of an hour’s data required re‑writing the entire partition.

Data lakes address both issues by providing timeliness and efficient updates, while also supporting batch‑stream reuse.

1) Video Metadata‑Based Implementation

Our original design involved three Hive tables (Table 1, Table 2, Table 3). MySQL data was loaded into Table 1, Redis data into Table 2, and the two tables were joined. Two problems emerged:

High resource usage during peak hours due to large day‑level dumps concentrated at midnight.

Long readiness time because deduplication logic merged T‑1 day partitions with the current day, delaying availability.

By introducing Apache Hudi, we split the day‑level dump into hourly upserts. Hudi’s built‑in deduplication allowed Table 1 to act as a real‑time full dataset, enabling downstream joins as soon as the hourly upsert (e.g., 23 h) completed, reducing readiness time by about 3.5 hours and cutting peak‑hour resource consumption by ~40%.

2) Near‑Real‑Time Data Validation

When real‑time jobs undergo frequent changes (e.g., metric additions), we need to validate that the output matches expectations. Our previous approach dumped an hour of data from Kafka to Hive and then performed full‑volume validation, which is too slow for urgent cases. After adopting Hudi, we use Flink to upsert data directly into Hudi tables and query them via Presto, achieving near‑real‑time visibility and validation while improving development efficiency and data quality.

We initially submitted SQL scripts with many parameters and DDL schemas, which became cumbersome for wide tables and unacceptable for business users.

We replaced the script‑based submission with a pure‑SQL interface, leveraging a unified catalog to automatically read schemas and required parameters, simplifying the lake‑ingestion SQL.

3. Typical Scenario Practices

The following illustrates ByteDance’s end‑to‑end real‑time data warehouse built on Hudi.

Data can be ingested in real time from MySQL or Kafka via Flink directly into Hudi. Lake‑internal computations (e.g., Flink‑based processing) can also write back to Hudi. For analytics, Spark and Presto provide interactive BI queries. High‑QPS online services first read from a KV store before reaching business systems.

1) Real‑Time Multi‑Dimensional Aggregation

Kafka increments are written to a lightweight aggregation layer in Hudi. Presto performs on‑demand heavy aggregations for analytical dashboards. For high‑QPS, low‑latency products, we pre‑compute multi‑dimensional aggregates with Presto, then load them into a KV system, eventually moving toward materialized‑view‑based solutions.

Problems encountered:

Write stability issues: Flink tasks consume many resources, frequently restart, and compaction delays affect queries.

Poor update performance causing severe back‑pressure.

Limited concurrency: Hudi Metastore Service stability impacts scaling.

Query latency up to ten minutes and frequent failures.

Solutions:

Write stability: Decouple compaction from Flink by introducing an asynchronous Compaction Service that pulls pending compaction plans from Hudi Metastore and runs Spark batch jobs.

Efficient update indexing: Use hash‑based file location and hash filtering to speed up writes and queries.

Request model optimization: Shift WriteTask’s timeline polling from Hudi Metastore to a cached JobManager view, boosting RPS from hundreds of thousands to near ten million.

MergeOnRead column pruning: Push column pruning to the scan layer and perform map‑based log merging, reducing serialization overhead.

Parallel read optimization: Split large BaseFiles into multiple tasks to increase read parallelism.

Combine Engine: Bypass Avro serialization by reading Spark InternalRow or Flink RowData directly, dramatically improving MergeOnRead and compaction performance.

2) Real‑Time Data Analysis

We ingest detailed logs via Flink into Hudi and join with dimension tables to produce wide tables. Two main requirements are efficient log ingestion and real‑time data association.

Log ingestion optimization: Implement a Non‑Index approach that appends logs directly to LogFiles without primary‑key deduplication, greatly improving write throughput.

Real‑time association: Enable storage‑layer joins by allowing different streams to write distinct columns to Hudi and merging them during read, with conflict detection handled by Hudi Metastore.

4. Future Plans

1) Elastic and Extensible Index System

We plan to develop an Extensible Hash Index to improve bucket index scalability and support re‑hashing for large data volumes.

2) Adaptive Table Optimization Service

In collaboration with the community, we will launch a Table Management Service to automate compaction, cleaning, clustering, and index building, reducing user operational overhead.

3) Enhanced Metadata Service

Upcoming features include schema evolution support and concurrency control for simultaneous batch‑stream writes.

4) Unified Batch‑Stream Processing

Our roadmap aims for unified SQL across Flink, Spark, and Presto, unified storage via Hudi, and a unified catalog for metadata management.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Flink Data Lake presto Hudi

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.