Big Data 6 min read

How Data Lakes Empower AI: Insights from Industry Experts

In a panel discussion, experts from Kuaishou, Ping An, and Datastrato explain how data lake architectures, columnar storage formats like Apache Iceberg, and vector‑enabled lake formats are enhancing feature management, supporting generative AI workloads, and accelerating machine‑learning pipelines.

DataFunTalk
DataFunTalk
DataFunTalk
How Data Lakes Empower AI: Insights from Industry Experts

The session opened with introductions of the speakers: Jin Guowei, Head of Data Business Platform at Kuaishou; Shao Saisai, co‑founder and CTO of Datastrato and Apache Foundation member; Tang Langfei, Data Intelligence Platform lead at Ping An; and Zhang Jing, Big Data architect at Kuaishou.

Jin asked the panel how lake‑warehouse integration can aid AI development. Shao explained that AI can be divided into pre‑GenAI and post‑GenAI stages. Before GenAI, traditional machine‑learning workloads rely heavily on feature engineering, requiring wide tables for efficient feature storage and updates.

She described a migration at Tencent from row‑oriented Protobuf to column‑oriented Apache Iceberg around 2021‑2022, which improved query performance, enabled schema evolution, and allowed column‑level lifecycle management, reducing storage costs and improving user experience.

In the GenAI era, data becomes more unstructured or semi‑structured, prompting the adoption of variant types in lake formats such as Iceberg and Spark. This supports efficient handling of semi‑structured data and improves read performance.

For large language model (LLM) workloads, building Retrieval‑Augmented Generation (RAG) systems requires vector databases or vector‑enabled lake formats; specialized formats like Lens are emerging to meet these needs.

Jin then invited Zhang to discuss Kuaishou’s approach to feature support. Zhang highlighted Kuaishou’s “sample lake” that unifies streaming and batch samples, enabling flexible column addition and deletion for rapid feature iteration, and supporting multimodal sample storage using variant types, a topic also discussed in the Apache Iceberg V3 community.

He also noted the importance of APIs: both Apache Iceberg and Delta Lake provide Rust and Python APIs, facilitating data access for AI developers, while Apache Hudi is developing a Rust project to improve the lake experience for AI workloads.

The discussion concluded that data lakes empower AI by offering efficient feature management, supporting new data formats for generative AI, and providing robust APIs, thereby creating new opportunities and challenges for the GenAI era.

Big Datamachine learningFeature EngineeringAIdata lakeApache Iceberg
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.