Big Data 6 min read

How Data Lakes Empower AI: Insights from Industry Experts

In a panel discussion, experts from Kuaishou, Ping An, and Datastrato explain how data lake architectures, columnar storage formats like Apache Iceberg, and vector‑enabled lake formats are enhancing feature management, supporting generative AI workloads, and accelerating machine‑learning pipelines.

DataFunTalk

Nov 6, 2024

How Data Lakes Empower AI: Insights from Industry Experts

The session opened with introductions of the speakers: Jin Guowei, Head of Data Business Platform at Kuaishou; Shao Saisai, co‑founder and CTO of Datastrato and Apache Foundation member; Tang Langfei, Data Intelligence Platform lead at Ping An; and Zhang Jing, Big Data architect at Kuaishou.

Jin asked the panel how lake‑warehouse integration can aid AI development. Shao explained that AI can be divided into pre‑GenAI and post‑GenAI stages. Before GenAI, traditional machine‑learning workloads rely heavily on feature engineering, requiring wide tables for efficient feature storage and updates.

She described a migration at Tencent from row‑oriented Protobuf to column‑oriented Apache Iceberg around 2021‑2022, which improved query performance, enabled schema evolution, and allowed column‑level lifecycle management, reducing storage costs and improving user experience.

In the GenAI era, data becomes more unstructured or semi‑structured, prompting the adoption of variant types in lake formats such as Iceberg and Spark. This supports efficient handling of semi‑structured data and improves read performance.

For large language model (LLM) workloads, building Retrieval‑Augmented Generation (RAG) systems requires vector databases or vector‑enabled lake formats; specialized formats like Lens are emerging to meet these needs.

Jin then invited Zhang to discuss Kuaishou’s approach to feature support. Zhang highlighted Kuaishou’s “sample lake” that unifies streaming and batch samples, enabling flexible column addition and deletion for rapid feature iteration, and supporting multimodal sample storage using variant types, a topic also discussed in the Apache Iceberg V3 community.

He also noted the importance of APIs: both Apache Iceberg and Delta Lake provide Rust and Python APIs, facilitating data access for AI developers, while Apache Hudi is developing a Rust project to improve the lake experience for AI workloads.

The discussion concluded that data lakes empower AI by offering efficient feature management, supporting new data formats for generative AI, and providing robust APIs, thereby creating new opportunities and challenges for the GenAI era.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data feature engineering AI Data Lake Apache Iceberg

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.