How to Build a Lakehouse with RocketMQ and Apache Hudi: A Step‑by‑Step Guide
This article explains the Lakehouse architecture, its required features, the evolution of big‑data stacks, and provides a detailed, hands‑on guide for constructing a Lakehouse using RocketMQ (Connector & Stream) and Apache Hudi, including configuration, deployment, and sample code.
Background and Lakehouse Requirements
The Lakehouse concept, originally proposed by Databricks, demands eight key capabilities: ACID transaction support, schema enforcement and governance, open storage formats (e.g., Parquet), direct BI access, separation of storage and compute, support for diverse data types (image, video, text, semi‑structured), multi‑workload compatibility (SQL, ML, analytics), and end‑to‑end streaming.
Evolution of Big‑Data Architectures
In the era of massive data volumes, a wide range of open‑source projects have emerged: messaging (RocketMQ, Kafka), compute (Flink, Spark, Storm), and storage (HDFS, HBase, Redis, Elasticsearch, Hudi, DeltaLake). While these components address specific use cases, architects often face challenges such as complex multi‑layer stacks, high latency, duplicated data copies, and steep learning curves.
Advantages of multi‑layer stacks: broad scenario coverage.
Disadvantages: long processing chains, high latency, duplicated storage, and increased operational cost.
These drawbacks motivate a unified architecture that reduces layers.
Lakehouse as an Upgraded Multi‑Layer Architecture
The Lakehouse compresses the traditional multi‑layer stack into a single storage layer and merges the messaging and compute layers using RocketMQ Stream. Data flows through a RocketMQ connector, is processed by RocketMQ Stream, and finally lands in Hudi via a RocketMQ‑Hudi connector, where Hudi provides indexing and unified APIs for downstream consumption.
RocketMQ Overview
RocketMQ, an Apache‑incubated messaging system, offers financial‑grade reliability, a minimalist architecture (NameServer + Broker clusters), low operational overhead, rich message types (transactional, delayed, ordered), and high throughput with millisecond‑level latency. It supports both on‑premise and cloud‑native deployments.
Key Commands
nohup sh bin/mqnamesrv &</code><code>nohup sh bin/mqbroker -n localhost:9876 & kubectl apply -f example/rocketmq_cluster.yamlCLI tools such as mqadmin enable cluster health checks and dynamic configuration updates.
RocketMQ Connector & Stream
The connector follows the OpenMessaging standard and integrates with many ecosystem sources and sinks (ActiveMQ, Cassandra, Elasticsearch, JDBC, Kafka, MySQL, etc.). Its architecture consists of a Runtime, Workers, and a pluggable Source‑&‑Sink framework. The Stream component provides a Flink‑compatible SQL layer with operators like window, join, and materialized views.
Apache Hudi Overview
Hudi is a streaming data‑lake platform that adds transactional capabilities to object storage (OSS, S3, etc.). Core features include MVCC/OCC concurrency control, record‑level upserts/deletes, automatic small‑file management, and query‑optimised layouts. Hudi tables can be queried via Spark, Flink, Hive, Presto, Trino, or Athena.
Transactional writes with MVCC/OCC.
Native support for record‑level updates and deletes.
Automatic small‑file compaction and clustering.
Practical Lakehouse Construction
1. Preparation
Required versions:
RocketMQ 4.9.0
rocketmq‑connect‑hudi 0.0.1‑SNAPSHOT
Apache Hudi 0.8.0
2. Build the RocketMQ‑Hudi Connector
Clone the source repository:
git clone https://github.com/apache/rocketmq-externals.gitConfigure the connector plugin path in
/data/lakehouse/rocketmq-externals/rocketmq-connect-runtime/target/distribution/conf/connect.conf.
Compile the connector:
cd rocketmq-externals/rocketmq-connect-hudi</code><code>mvn clean install -DskipTest -UThe resulting
rocketmq-connect-hudi-0.0.1-SNAPSHOT-jar-with-dependencies.jaris the connector artifact.
3. Run the Runtime and Worker
cd /data/lakehouse/rocketmq-externals/rocketmq-connect-runtime</code><code>sh ./run_worker.sh # start one or more workers4. Create Topics for the Connector
Initialize the following internal topics on the RocketMQ cluster: connector-cluster-topic (cluster metadata) connector-config-topic (configuration) connector-offset-topic (sink offsets) connector-position-topic (source progress)
5. Deploy the RocketMQ‑Hudi Sink Task
curl http://${runtime-ip}:${runtime-port}/connectors/rocketmq-hudi-sink?config='{"connector-class":"org.apache.rocketmq.connect.hudi.connector.HudiSinkConnector","topicNames":"testhudi1","tablePath":"file:///tmp/hudi_connector_test","tableName":"hudi_test","insertShuffleParallelism":"2","upsertShuffleParallelism":"2","deleteParallelism":"2","source-record-converter":"org.apache.rocketmq.connect.runtime.converter.RocketMQConverter","source-rocketmq":"127.0.0.1:9876","src-cluster":"DefaultCluster","refresh-interval":"10000","schemaPath":"/data/lakehouse/config/user.avsc"}'Successful deployment logs “Open HoodieJavaWriteClient successfully”.
6. Verify Data Flow
Messages produced to the source topic are automatically written into the Hudi table. The table can be queried with Spark:
spark-shell \
--packages org.apache.hudi:hudi-spark3-bundle_2.12:0.9.0,org.apache.spark:spark-avro_2.12:3.0.1 \
--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
import org.apache.hudi.QuickstartUtils._
import org.apache.hudi.DataSourceReadOptions._
import org.apache.hudi.DataSourceWriteOptions._
val tableName = "hudi_test"
val basePath = "file:///tmp/hudi_connector_test"
val df = spark.read.format("hudi").load(basePath + "/*")
df.show()Key Takeaways
The Lakehouse built with RocketMQ and Hudi offers a shorter data path, lower latency, reduced storage costs, and simplified operations, while still supporting diverse workloads such as BI, streaming analytics, and machine learning. However, it requires a reliable messaging layer and a robust data‑lake component with strong indexing capabilities.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Native
We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
