Big Data 11 min read

How We Turned a Hive Data Warehouse into a Real‑Time Lakehouse with Apache Hudi

This article details the migration from a traditional Hive‑based data warehouse to a lakehouse architecture using Apache Hudi, covering the original Lambda setup, its pain points, lake‑vs‑warehouse differences, Hudi features, integration challenges, practical solutions, and future roadmap.

ITPUB

Mar 28, 2023

How We Turned a Hive Data Warehouse into a Real‑Time Lakehouse with Apache Hudi

Current Data Warehouse Situation

Before adopting a lakehouse, the team used a Lambda architecture: a real‑time pipeline built on Kafka and Kudu and an offline pipeline based on Hive and OLAP engines such as GP, ClickHouse, and StarRocks. Approximately 80% of workloads were offline (80,000+ daily tasks, 400k+ Hive tables) and 20% were real‑time (4,000+ tasks). The Lambda model increasingly showed four major drawbacks:

Redundant data computation: real‑time ingestion and nightly batch merges cause delays.

Complex development and maintenance: two separate pipelines require duplicated logic and different skill sets.

Storage bloat: temporary and intermediate tables explode storage usage.

Growing compute pressure: nightly windows cannot keep up with daytime data accumulation.

Differences Between Data Lake and Data Warehouse

The team evaluated data‑lake technologies and identified two key dimensions of difference:

Computation model : Lakes support incremental, stream‑read updates, while warehouses rely on full‑load or partition‑overwrite approaches.

Data management : Lakes use fine‑grained statistics and indexing (e.g., Bloom, bucket, HBase) to manage files, enabling faster ingestion and query, whereas warehouses mainly manage data by partitions.

Lakes also provide features absent in traditional warehouses, such as snapshots, time‑travel, and schema evolution.

Why Apache Hudi and Its Core Concepts

Hudi was chosen because it offers essential lakehouse capabilities: ACID transactions, Merge‑On‑Read, bulk load, incremental queries, and time‑travel. It also includes built‑in data‑ingestion functions, automatic snapshot commits, expired‑snapshot cleanup, small‑file merging, periodic MOR compaction, and rollback support. Its key abstractions—record key and payload—handle CDC as well as regular messages, allowing updates and partial merges during the write phase.

Hudi sits between storage (HDFS or object storage) and query engines, exposing an incremental stream‑read path that enables real‑time warehousing.

Write‑path modes: Copy‑On‑Write (write‑time merge, read‑optimized) and Merge‑On‑Read (read‑time merge, write‑optimized).

Indexing: Bloom, bucket, or HBase indexes enable efficient point‑lookups and query acceleration.

Timeline: Hudi maintains a timeline of actions (COMMIT, CLEAN) and states (REQUESTED, INFLIGHT) that underpins snapshot reads and rollbacks.

Lakehouse Integrated Practice

The team built a unified batch‑and‑stream architecture and developed a custom data‑integration solution. Highlights include:

Over 700 core ODS tables migrated to the lake.

ODS cleaning jobs now start at 00:05, reducing latency.

Data freshness improved from T+1 to minute‑level.

Multiple business lines have real‑time lakehouse scenarios in production.

Real‑time dimension joins are achieved via Hudi payload‑based partial updates.

Incremental statistics are realized with Flink’s cumulative windows feeding Hudi.

Data Integration Architecture, Challenges, and Solutions

Two integration approaches were evaluated: Flink CDC (feature‑complete) and a self‑developed MQ‑based pipeline. The team chose the self‑developed solution for data‑security and MQ reuse reasons. Version 1 of the integration architecture handled moderate data volumes and supported online back‑fills, but full‑load upserts caused high I/O pressure and limited parallelism.

Version 2 optimizes for massive data volumes with multi‑task parallelism, abstracts resource provisioning, Flink‑Hudi parameter tuning, and provides a one‑click sync capability.

Metadata declaration emerged as a major pain point: Flink jobs, Hive connectors, and MQ tables each required separate metadata definitions, leading to duplication and lineage collection difficulties. The team solved this by extending the Hive‑Connector to expose native meta‑columns, allowing Flink to modify properties via LIKE statements and enabling unified metadata queries across Flink, Hive, Spark, and Presto.

For data masking, a custom Flink‑SQL preview tool running on a YARN session cluster provides on‑demand data sampling (sub‑5‑second latency) and supports user‑defined encryption functions for instant masking. Data‑quality challenges include nightly MQ sampling and pre‑batch quality checks, while data‑loss incidents (e.g., Hudi‑4311, Hudi‑3912) were addressed with targeted bug fixes. Operational tips shared include adjusting Flink checkpoint intervals, tuning Hudi merge memory, increasing off‑heap memory for Flink, and managing .hoodie file counts to avoid RPC pressure.

Future Planning

The roadmap focuses on improving lakehouse usability, expanding its application scope, and enhancing on‑lake analytical capabilities.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Flink real-time analytics Data Warehouse Hive Apache Hudi

Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.