Big Data 12 min read

How Apache Hudi 1.1 Powers AI‑Native Lakehouse and Real‑Time Data Lakes

The JD‑hosted Apache Hudi Meetup showcased the 1.1 release’s pluggable table format, Flink performance gains, LSM‑Tree MoR redesign, and AI‑native features such as vector indexing, illustrating how the open‑source lakehouse is evolving to meet BI and multimodal AI workloads.

JD Retail Technology
JD Retail Technology
JD Retail Technology
How Apache Hudi 1.1 Powers AI‑Native Lakehouse and Real‑Time Data Lakes

Community Roadmap and Vision

The Apache Hudi community is advancing the 1.x series with three primary focus areas: improving Flink write performance, releasing a new Trino connector, and introducing a pluggable table‑format layer that enables a single write to be read by multiple query engines. The long‑term goal is to evolve Hudi from a transactional table format on a lake into an AI‑native lakehouse that supports both structured and unstructured data, vector search, and end‑to‑end model‑training workflows.

Apache Hudi 1.1 Technical Highlights

Pluggable table‑format architecture – decouples storage format from write path, allowing "write once, read many" across formats such as Parquet, ORC, and Iceberg.

Deep Flink integration – adds an asynchronous write generation mechanism and a native Flink writer that converts Avro records directly to RowData, reducing serialization overhead and garbage‑collection pressure. Benchmarks show a 3.5× increase in streaming‑to‑lake throughput compared with Hudi 1.0.

AI‑native extensions – native support for unstructured blobs, column‑group layouts optimized for multimodal datasets, built‑in vector indexing, and a unified storage layer that retains ACID guarantees and version control.

JD.com Production‑Grade Enhancements

Re‑engineered Merge‑On‑Read (MoR) tables using an LSM‑Tree layout and switched the update model from Avro + Append to Parquet + Create, achieving lock‑free concurrent writes.

Combined Engine‑Native file format, Remote Partitioner, and streaming incremental compaction, delivering 2–10× read/write performance gains in internal benchmarks.

Implemented a primary‑foreign‑key index inspired by Hudi PartialUpdate, using forward and inverted indexes stored in HBase to enable real‑time dimensional joins.

Developed a Hudi NativeIO SDK with four modules – data call, cross‑language transformation, view management, and high‑performance query – allowing model training directly from lake tables.

Applied these capabilities to the ADM layer, raising write throughput from 45 M to 80 M records per minute, doubling compaction efficiency, and achieving real‑time SKU consistency, thereby moving from offline (T+1) to real‑time processing.

Contributed 109 merged pull requests to the upstream Hudi project.

Kuaishou Real‑Time Lake Upgrade for BI & AI

Migrated from MySQL→Hive to MySQL→Hudi 2.0, introducing hour‑level partitioned Hudi tables that support full, incremental, and snapshot queries.

Designed Full Compact and Minor Compact mechanisms together with heterogeneous bucket layouts, reducing ingestion resource consumption and extending table lifecycles.

Achieved data‑ready latency improvement from days to minutes while lowering storage costs.

Built a unified lake that consolidates batch and streaming data, providing: (a) a single storage medium, (b) stream‑batch unified consumption, (c) logical wide‑table column concatenation, and (d) an event‑time timeline metadata system with lock‑free writes.

Huawei Cloud Deep Optimizations & AI Exploration

Platform layer : Developed LDMS, a fully managed lake‑warehouse service offering table lifecycle management, intelligent data‑layout optimization, and cost‑based optimizer statistics.

Kernel optimizations :

RFC‑84/87 removed Avro serialization, boosting Flink write performance by 1–10× and reducing GC pressure.

Introduced LogIndex to eliminate streaming read bottlenecks on object storage.

Added dynamic schema evolution for flexible CDC ingestion.

Adopted column‑family storage to efficiently handle sparse wide tables.

Native I/O layer : Re‑implemented Parquet read/write in Rust, switched to Arrow as the in‑memory format, and exposed a high‑performance Java JNI interface, enabling seamless integration with Spark, Flink, and other engines.

Ecosystem integration : Integrated LanceDB for high‑throughput vector search, supporting document retrieval and intelligent QA use cases while preserving ACID guarantees for raw object‑store files.

Key Takeaways

The presented enhancements across Apache Hudi, JD.com, Kuaishou, and Huawei demonstrate concrete engineering solutions for the "stream‑batch integration" challenge, multimodal data management, and real‑time AI workloads. Open‑source contributions and enterprise‑level optimizations are converging to make Hudi a scalable, AI‑ready storage engine for modern lakehouse architectures.

Big DataFlinkAIreal-time datalakehouseApache Hudi
JD Retail Technology
Written by

JD Retail Technology

Official platform of JD Retail Technology, delivering insightful R&D news and a deep look into the lives and work of technologists.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.