Big Data 14 min read

Building ByteDance’s Real‑Time Data Warehouse with Hudi: Architecture & Solutions

This article explains how ByteDance designed and deployed a real‑time data warehouse on a data lake using Hudi, detailing three business scenarios, the challenges of latency, consistency and resource usage, and the engineering solutions—including upserts, compaction services, indexing, and future unified storage plans.

dbaplus Community

Jan 31, 2023

Building ByteDance’s Real‑Time Data Warehouse with Hudi: Architecture & Solutions

Real‑Time Data Warehouse Scenarios

ByteDance classified three typical workloads based on data volume and latency requirements:

Scenario 1 : Short‑video and live‑stream services generate massive log data. Required latency is ~5 minutes, and occasional data inconsistency is acceptable. The main goal is batch‑stream reuse.

Scenario 2 : Live‑stream or e‑commerce sub‑scenes with medium data volume. Required latency is sub‑minute, with a focus on low‑cost back‑fill and fast cold‑start.

Scenario 3 : E‑commerce and education workloads with relatively small data volume but high QPS. Required latency is second‑level, demanding strong consistency.

Initial Exploration of a Real‑Time Data Warehouse

Traditional offline warehouses suffer from two major drawbacks:

Timeliness limited to hour‑ or day‑level.

Update inefficiency: partial updates require full partition rewrites.

Hudi’s merge‑on‑read model provides both low‑latency visibility and efficient upserts, enabling batch‑and‑stream reuse on a data‑lake foundation.

Pilot steps:

Video‑metadata ingestion : The legacy pipeline used three Hive tables (MySQL → Table 1, Redis → Table 2, Join → Table 3). Daily dumps caused high peak resource usage and a 3.5‑hour data‑readiness delay.

Near‑real‑time validation : Hourly jobs dumped Kafka data to Hive for full‑data checks; urgent cases required sub‑hour checks.

By replacing daily dumps with hourly upsert operations on Hudi tables, peak resource consumption dropped ~40 % and data readiness improved by ~3.5 hours. For validation, Flink writes directly to Hudi and Presto queries the table, eliminating the dump step and delivering near‑real‑time data quality checks.

Typical Scenario Implementations

1. Real‑Time Multi‑Dimensional Aggregation

Architecture:

Kafka streams are written to a lightweight Hudi aggregation layer.

Presto performs on‑demand heavy aggregations for dashboards.

For high‑QPS, low‑latency products, Presto pre‑computes aggregates and pushes results to a KV store.

Challenges and engineering solutions:

Write stability : Decoupled asynchronous compaction via a dedicated Compaction Service, isolating Flink write tasks from Spark‑based compaction jobs.

Efficient update indexing : Hash‑based file location and hash‑filtering accelerate writes and queries.

Request‑model optimization : Shifted WriteTask timeline polling from the Hudi Metastore to a JobManager cache, raising request‑per‑second capacity from hundreds of thousands to tens of millions.

MergeOnRead column pruning : Pushed column‑pruning to the scan layer and stored key‑value pairs in a map during log merge, reducing serialization overhead.

Parallel read optimization : Split large BaseFiles into multiple tasks to increase read parallelism.

Combine Engine : Bypassed Avro serialization by reading Spark InternalRow or Flink RowData directly, improving MergeOnRead and compaction performance.

2. Real‑Time Data Analysis

Two sub‑scenarios are addressed:

Log‑type data ingestion : Adopted a NonIndex approach that appends records directly to log files without primary‑key deduplication, dramatically increasing write throughput.

Real‑time joins : Different streams write to separate Hudi tables; during compaction they are merged into a wide table. Conflict detection is performed via the Hudi Metastore, ensuring consistency across streams.

Future Planning

Planned enhancements aim to further reduce operational complexity and improve performance at scale:

Elastic Index System : Introduce an extensible hash index to support bucket‑index scalability and re‑hashing for large data volumes.

Adaptive Table Optimization Service : Provide a Table Management Service that automates compaction, cleaning, clustering, and index building, exposing a transparent API to users.

Metadata Service Enhancements : Add schema‑evolution support and concurrency control to enable safe batch‑and‑stream concurrent writes.

Unified Batch‑Stream Integration :

Unified SQL layer that allows Flink, Spark, and Presto to share the same query semantics.

Unified storage based on Hudi, offering a single source of truth for both batch and streaming data.

Unified catalog for consistent metadata across engines.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Flink Indexing Real-Time Data Warehouse Data Lake presto Spark Hudi

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.