How Apache Hudi Powers the Next‑Gen AI‑Native Lakehouse: Insights from the Asia Meetup
The article recaps the Apache Hudi Asia Meetup hosted by JD, covering community updates, JD's data‑lake challenges, the upcoming Hudi 1.1 release, JD's architectural redesign, Kuaishou's real‑time lake adoption, and Huawei Cloud's deep optimizations, all aimed at building an AI‑native, real‑time lakehouse.
Hudi Community Update and 1.1 Preview
The Apache Hudi project has reached a mature stage with the 1.0 GA release. Ongoing work in the 1.x series focuses on three pillars:
Performance improvements for Flink, including an asynchronous write path that reduces serialization overhead and GC pressure.
A new Trino connector.
A pluggable table‑format layer that enables “write‑once, read‑many‑formats”.
Future roadmap aims to turn Hudi into a storage engine that serves the full spectrum from BI to AI, adding support for unstructured data management and vector‑search capabilities.
Key Technical Advances in Apache Hudi 1.1
The upcoming 1.1 release introduces several concrete enhancements:
Pluggable Table‑Format Architecture – decouples the storage layout from the write path, allowing a single write to be readable by multiple query engines (e.g., Spark, Trino, Flink).
Deep Flink Integration – an asynchronous writer converts Avro records to Flink RowData on‑the‑fly, achieving up to 3.5× higher streaming throughput compared with 1.0.
Native Writer – eliminates the Avro serialization step, directly emitting Parquet files and reducing GC churn.
AI‑Native Features – support for unstructured data, column‑group optimizations for sparse wide tables, and built‑in vector‑index (compatible with LanceDB) for similarity search.
JD.com Production Deployments
JD’s real‑time data platform rewrote the MoR (Merge‑On‑Read) table using an LSM‑Tree‑based layout. The redesign replaces the legacy “Avro + Append” model with a “Parquet + Create” model, delivering lock‑free concurrent writes. Combined with Engine‑Native formats, a Remote Partitioner, and incremental compaction, the new design yields 2–10× read/write performance.
For BI workloads, JD adopts a PartialUpdate‑style multi‑stream join that uses primary‑foreign key indexing, leveraging both inverted and forward indexes. An optional HBase backing provides low‑latency point‑lookup.
In AI scenarios, JD built a native Hudi SDK (NativeIO) that offers:
Data‑access layer exposing Hudi tables as native objects.
Cross‑language transformation utilities.
View‑management and high‑performance query modules.
Applying these capabilities to the ADM traffic‑data warehouse increased write throughput from 45 M to 80 M records per minute and doubled compaction speed, achieving near‑real‑time SKU dimension consistency.
Kuaishou Real‑Time Ingestion
Kuaishou migrated from a MySQL‑Hive pipeline to a MySQL‑Hudi 2.0 architecture. Key techniques include:
Hourly‑partitioned Hudi tables supporting full, incremental, and snapshot query modes.
A “Full Compact / Minor Compact” mechanism that optimizes data layout without locking.
Logical wide‑table column concatenation and an event‑time timeline metadata system that guarantees ordered data and lock‑free writes.
These changes reduced storage cost and cut data‑ready latency from days to minutes, while providing a unified lakehouse for both batch and streaming AI workloads.
Huawei Cloud Deep Optimizations
Huawei Cloud contributed three layers of enhancements to Apache Hudi:
Kernel‑Level Optimizations
RFC‑84/87 removed Avro serialization, delivering a 1–10× speedup for Flink writes and lowering GC pressure.
Introduced a LogIndex to eliminate streaming read bottlenecks on object storage.
Dynamic schema evolution supports CDC ingestion without downtime.
Column‑family support enables efficient handling of sparse wide tables (thousands of columns).
Hudi Native – rewritten in Rust with Arrow‑based I/O; exposed via a Java JNI interface compatible with Spark and Flink, providing a high‑performance I/O layer.
AI‑Oriented Extensions
Multimodal data management and ACID‑compliant storage of unstructured assets.
Vector search integration via LanceDB for similarity‑based AI applications.
These contributions illustrate how open‑source and enterprise efforts together advance lakehouse technology toward higher performance, richer data models, and AI‑native capabilities.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
