How Data Lakes Empower AI: Expert Insights on Feature Management, Columnar Storage, and Vector Formats
In a panel discussion, experts explain how data‑lake‑warehouse integration, columnar formats like Apache Iceberg, and emerging variant types enable efficient feature engineering, support large‑language‑model workloads, and provide flexible vector storage, thereby driving the evolution of AI from traditional ML to the GenAI era.
Experts debated how data lakes can accelerate AI development. Moderator Jin Guowei asked for views on lake‑warehouse integration in AI.
Shaosai Sai explained that AI development can be divided into pre‑GenAI and post‑GenAI stages. Before GenAI, traditional machine learning relied heavily on feature engineering, requiring wide tables for efficient feature access and updates.
Using a complex Tencent business as an example, she described how nested tables with thousands of columns stored in row‑oriented formats like Protobuf suffered from low query efficiency and high storage/computation costs because retrieving specific fields required reading entire rows.
In 2021‑2022, the system was migrated from Protobuf to Apache Iceberg, shifting from row‑level to column‑level storage, which dramatically improved query performance. Iceberg’s evolving schema allowed flexible addition or removal of features, and column‑level lifecycle management tracked usage and cleaned up unused columns, improving user experience, compression ratio, and reducing costs.
With the advent of GenAI, data increasingly includes unstructured or semi‑structured formats. Data‑lake formats now support variable‑length types (variant), and both Apache Iceberg and Spark have added this capability, enabling better handling of such data and faster reads.
For large language model (LLM) workloads, Retrieval‑Augmented Generation (RAG) systems require vector databases or vector‑type data‑lake formats. Specialized lake formats like Lens are designed to meet the vector storage needs of the LLM era.
Overall, the integration of data lakes and warehouses enhances AI by improving feature management efficiency and supporting new data formats, presenting both opportunities and challenges for the GenAI era.
Later, moderator Jin invited Zhang Jing to share Kuaishou’s approach to machine‑learning feature support. Zhang described Kuaishou’s “sample lake,” a unified stream‑batch storage that holds real‑time and offline samples and facilitates flexible feature engineering.
In offline research, users can dynamically concatenate or drop columns to evaluate feature impact, with multiple users adding columns in parallel for rapid iteration—an advantage provided by the data lake.
Kuaishou is also exploring multimodal sample storage, introducing variant types for efficient handling of unstructured data, a topic under discussion in Delta Lake and Apache Iceberg V3.
API support is another focus: Apache Iceberg and Delta Lake offer Rust and Python APIs, allowing users to access and process data with Python; Apache Hudi is also launching an RS project to improve AI‑focused lake experiences.
The discussion concluded that data lakes are crucial for AI, prompting the question of how they can best support AI workloads.
DataFun’s data‑lake workshop includes a chapter on AI vector computation, supporting efficient training and inference for large models.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.