Big Data 18 min read

How to Build a Lakehouse with RocketMQ and Apache Hudi: A Step‑by‑Step Guide

This article explains the Lakehouse architecture, its required features, the evolution of big‑data stacks, and provides a detailed, hands‑on guide for constructing a Lakehouse using RocketMQ (Connector & Stream) and Apache Hudi, including configuration, deployment, and sample code.

Alibaba Cloud Native

Jan 26, 2022

How to Build a Lakehouse with RocketMQ and Apache Hudi: A Step‑by‑Step Guide

Background and Lakehouse Requirements

The Lakehouse concept, originally proposed by Databricks, demands eight key capabilities: ACID transaction support, schema enforcement and governance, open storage formats (e.g., Parquet), direct BI access, separation of storage and compute, support for diverse data types (image, video, text, semi‑structured), multi‑workload compatibility (SQL, ML, analytics), and end‑to‑end streaming.

Evolution of Big‑Data Architectures

In the era of massive data volumes, a wide range of open‑source projects have emerged: messaging (RocketMQ, Kafka), compute (Flink, Spark, Storm), and storage (HDFS, HBase, Redis, Elasticsearch, Hudi, DeltaLake). While these components address specific use cases, architects often face challenges such as complex multi‑layer stacks, high latency, duplicated data copies, and steep learning curves.

Advantages of multi‑layer stacks: broad scenario coverage.

Disadvantages: long processing chains, high latency, duplicated storage, and increased operational cost.

These drawbacks motivate a unified architecture that reduces layers.

Lakehouse as an Upgraded Multi‑Layer Architecture

The Lakehouse compresses the traditional multi‑layer stack into a single storage layer and merges the messaging and compute layers using RocketMQ Stream. Data flows through a RocketMQ connector, is processed by RocketMQ Stream, and finally lands in Hudi via a RocketMQ‑Hudi connector, where Hudi provides indexing and unified APIs for downstream consumption.

RocketMQ Overview

RocketMQ, an Apache‑incubated messaging system, offers financial‑grade reliability, a minimalist architecture (NameServer + Broker clusters), low operational overhead, rich message types (transactional, delayed, ordered), and high throughput with millisecond‑level latency. It supports both on‑premise and cloud‑native deployments.

Key Commands

nohup sh bin/mqnamesrv &</code><code>nohup sh bin/mqbroker -n localhost:9876 &

kubectl apply -f example/rocketmq_cluster.yaml

CLI tools such as mqadmin enable cluster health checks and dynamic configuration updates.

RocketMQ Connector & Stream

The connector follows the OpenMessaging standard and integrates with many ecosystem sources and sinks (ActiveMQ, Cassandra, Elasticsearch, JDBC, Kafka, MySQL, etc.). Its architecture consists of a Runtime, Workers, and a pluggable Source‑&‑Sink framework. The Stream component provides a Flink‑compatible SQL layer with operators like window, join, and materialized views.

Apache Hudi Overview

Hudi is a streaming data‑lake platform that adds transactional capabilities to object storage (OSS, S3, etc.). Core features include MVCC/OCC concurrency control, record‑level upserts/deletes, automatic small‑file management, and query‑optimised layouts. Hudi tables can be queried via Spark, Flink, Hive, Presto, Trino, or Athena.

Transactional writes with MVCC/OCC.

Native support for record‑level updates and deletes.

Automatic small‑file compaction and clustering.

Practical Lakehouse Construction

1. Preparation

Required versions:

RocketMQ 4.9.0

rocketmq‑connect‑hudi 0.0.1‑SNAPSHOT

Apache Hudi 0.8.0

2. Build the RocketMQ‑Hudi Connector

Clone the source repository:

git clone https://github.com/apache/rocketmq-externals.git

Configure the connector plugin path in

/data/lakehouse/rocketmq-externals/rocketmq-connect-runtime/target/distribution/conf/connect.conf

Compile the connector:

cd rocketmq-externals/rocketmq-connect-hudi</code><code>mvn clean install -DskipTest -U

The resulting

rocketmq-connect-hudi-0.0.1-SNAPSHOT-jar-with-dependencies.jar

is the connector artifact.

3. Run the Runtime and Worker

cd /data/lakehouse/rocketmq-externals/rocketmq-connect-runtime</code><code>sh ./run_worker.sh   # start one or more workers

4. Create Topics for the Connector

Initialize the following internal topics on the RocketMQ cluster: connector-cluster-topic (cluster metadata) connector-config-topic (configuration) connector-offset-topic (sink offsets) connector-position-topic (source progress)

5. Deploy the RocketMQ‑Hudi Sink Task

curl http://${runtime-ip}:${runtime-port}/connectors/rocketmq-hudi-sink?config='{"connector-class":"org.apache.rocketmq.connect.hudi.connector.HudiSinkConnector","topicNames":"testhudi1","tablePath":"file:///tmp/hudi_connector_test","tableName":"hudi_test","insertShuffleParallelism":"2","upsertShuffleParallelism":"2","deleteParallelism":"2","source-record-converter":"org.apache.rocketmq.connect.runtime.converter.RocketMQConverter","source-rocketmq":"127.0.0.1:9876","src-cluster":"DefaultCluster","refresh-interval":"10000","schemaPath":"/data/lakehouse/config/user.avsc"}'

Successful deployment logs “Open HoodieJavaWriteClient successfully”.

6. Verify Data Flow

Messages produced to the source topic are automatically written into the Hudi table. The table can be queried with Spark:

spark-shell \
  --packages org.apache.hudi:hudi-spark3-bundle_2.12:0.9.0,org.apache.spark:spark-avro_2.12:3.0.1 \
  --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'

import org.apache.hudi.QuickstartUtils._
import org.apache.hudi.DataSourceReadOptions._
import org.apache.hudi.DataSourceWriteOptions._

val tableName = "hudi_test"
val basePath = "file:///tmp/hudi_connector_test"
val df = spark.read.format("hudi").load(basePath + "/*")
df.show()

Key Takeaways

The Lakehouse built with RocketMQ and Hudi offers a shorter data path, lower latency, reduced storage costs, and simplified operations, while still supporting diverse workloads such as BI, streaming analytics, and machine learning. However, it requires a reliable messaging layer and a robust data‑lake component with strong indexing capabilities.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Cloud Native Big Data RocketMQ Data Lake Lakehouse Apache Hudi

Written by

Alibaba Cloud Native

We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.