How We Turned a Hive Data Warehouse into a Real‑Time Lakehouse with Apache Hudi
This article details the migration from a traditional Hive‑based data warehouse to a lakehouse architecture using Apache Hudi, covering the original Lambda setup, its pain points, lake‑vs‑warehouse differences, Hudi features, integration challenges, practical solutions, and future roadmap.
Current Data Warehouse Situation
Before adopting a lakehouse, the team used a Lambda architecture: a real‑time pipeline built on Kafka and Kudu and an offline pipeline based on Hive and OLAP engines such as GP, ClickHouse, and StarRocks. Approximately 80% of workloads were offline (80,000+ daily tasks, 400k+ Hive tables) and 20% were real‑time (4,000+ tasks). The Lambda model increasingly showed four major drawbacks:
Redundant data computation: real‑time ingestion and nightly batch merges cause delays.
Complex development and maintenance: two separate pipelines require duplicated logic and different skill sets.
Storage bloat: temporary and intermediate tables explode storage usage.
Growing compute pressure: nightly windows cannot keep up with daytime data accumulation.
Differences Between Data Lake and Data Warehouse
The team evaluated data‑lake technologies and identified two key dimensions of difference:
Computation model : Lakes support incremental, stream‑read updates, while warehouses rely on full‑load or partition‑overwrite approaches.
Data management : Lakes use fine‑grained statistics and indexing (e.g., Bloom, bucket, HBase) to manage files, enabling faster ingestion and query, whereas warehouses mainly manage data by partitions.
Lakes also provide features absent in traditional warehouses, such as snapshots, time‑travel, and schema evolution.
Why Apache Hudi and Its Core Concepts
Hudi was chosen because it offers essential lakehouse capabilities: ACID transactions, Merge‑On‑Read, bulk load, incremental queries, and time‑travel. It also includes built‑in data‑ingestion functions, automatic snapshot commits, expired‑snapshot cleanup, small‑file merging, periodic MOR compaction, and rollback support. Its key abstractions—record key and payload—handle CDC as well as regular messages, allowing updates and partial merges during the write phase.
Hudi sits between storage (HDFS or object storage) and query engines, exposing an incremental stream‑read path that enables real‑time warehousing.
Write‑path modes: Copy‑On‑Write (write‑time merge, read‑optimized) and Merge‑On‑Read (read‑time merge, write‑optimized).
Indexing: Bloom, bucket, or HBase indexes enable efficient point‑lookups and query acceleration.
Timeline: Hudi maintains a timeline of actions (COMMIT, CLEAN) and states (REQUESTED, INFLIGHT) that underpins snapshot reads and rollbacks.
Lakehouse Integrated Practice
The team built a unified batch‑and‑stream architecture and developed a custom data‑integration solution. Highlights include:
Over 700 core ODS tables migrated to the lake.
ODS cleaning jobs now start at 00:05, reducing latency.
Data freshness improved from T+1 to minute‑level.
Multiple business lines have real‑time lakehouse scenarios in production.
Real‑time dimension joins are achieved via Hudi payload‑based partial updates.
Incremental statistics are realized with Flink’s cumulative windows feeding Hudi.
Data Integration Architecture, Challenges, and Solutions
Two integration approaches were evaluated: Flink CDC (feature‑complete) and a self‑developed MQ‑based pipeline. The team chose the self‑developed solution for data‑security and MQ reuse reasons. Version 1 of the integration architecture handled moderate data volumes and supported online back‑fills, but full‑load upserts caused high I/O pressure and limited parallelism.
Version 2 optimizes for massive data volumes with multi‑task parallelism, abstracts resource provisioning, Flink‑Hudi parameter tuning, and provides a one‑click sync capability.
Metadata declaration emerged as a major pain point: Flink jobs, Hive connectors, and MQ tables each required separate metadata definitions, leading to duplication and lineage collection difficulties. The team solved this by extending the Hive‑Connector to expose native meta‑columns, allowing Flink to modify properties via LIKE statements and enabling unified metadata queries across Flink, Hive, Spark, and Presto.
For data masking, a custom Flink‑SQL preview tool running on a YARN session cluster provides on‑demand data sampling (sub‑5‑second latency) and supports user‑defined encryption functions for instant masking. Data‑quality challenges include nightly MQ sampling and pre‑batch quality checks, while data‑loss incidents (e.g., Hudi‑4311, Hudi‑3912) were addressed with targeted bug fixes. Operational tips shared include adjusting Flink checkpoint intervals, tuning Hudi merge memory, increasing off‑heap memory for Flink, and managing .hoodie file counts to avoid RPC pressure.
Future Planning
The roadmap focuses on improving lakehouse usability, expanding its application scope, and enhancing on‑lake analytical capabilities.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
