Big Data 13 min read

AI‑Era Multimodal Data Lake Infrastructure: TBDS Design, Storage, Compute, and Governance

The article analyzes how Tencent Cloud's TBDS platform tackles the AI era's multimodal data lake challenges through a native storage format (Lance), elastic Ray‑based compute, standardized metadata with Gravitino, and automated governance via Lakekeeper, citing architecture details, performance numbers, and real‑world deployments.

DataFunSummit
DataFunSummit
DataFunSummit
AI‑Era Multimodal Data Lake Infrastructure: TBDS Design, Storage, Compute, and Governance

Challenges in AI‑era multimodal data lakes

AI reshapes data processing, creating a two‑way relationship between “AI for data” and “Data for AI”. Modern multimodal data—text, images, video, embeddings, annotations and versions—requires high‑performance random access, semantic retrieval and fine‑grained version control, exposing the limits of traditional data lakes.

TBDS platform overview

TBDS integrates foundational engines (HDFS, S3, Spark, Flink, Presto) under a unified metadata and permission framework, lowering the barrier for building AI‑centric data foundations.

Key innovations

Storage – Lance format

TBDS deeply integrates the AI‑designed Lance format, which separates storage and compute. Lance encodes multimodal files as fixed‑length elements, enabling O(1) random access for both fixed‑ and variable‑length data. It embeds high‑performance vector indexes (IVF, HNSW) that provide millisecond‑level semantic search. Combined with the distributed cache TBDS‑FS, Lance delivers superior throughput and cost efficiency for massive vector and multimodal workloads.

Compute – Ray framework

TBDS extends the Ray framework to schedule heterogeneous CPU/GPU resources across Kubernetes clusters. Ray’s Python‑native API aligns with AI developers, allowing unified preprocessing, vectorization and model training. Custom optimizations enable dynamic scaling across multiple K8s clusters, reducing data movement between preprocessing and training and improving overall AI development efficiency.

Metadata – Gravitino

TBDS co‑developed Gravitino (internal “Meet Service”), a lightweight IBC‑enabled catalog that replaces Hive Metastore. Gravitino standardizes metadata for new formats such as Lance, supports discovery of tables, functions, file sets and models, and enforces strong permission control under high concurrency, avoiding lock contention in multi‑tenant scenarios.

Governance – Lakekeeper

Lakekeeper automates incremental data monitoring, vector‑index health evaluation, small‑file merging and old‑version cleanup. Leveraging Ray’s distributed execution and TBDS‑FS caching, it provides continuous high‑performance retrieval while preventing storage bloat, turning complex governance tasks into background processes.

Architecture

Storage layer: TBDS‑FS abstracts HDFS, S3 and supports Iceberg and Lance formats.

Metadata & compute layer: Gravitino and Ray fuse management with heterogeneous compute (CPU/GPU) and enable intelligent governance.

API layer: Rich interfaces for data engineering, AI analytics and Data Agents cover the full spectrum from structured analysis to multimodal AI applications.

Practical deployment – Retrieval‑Augmented Generation (RAG)

In a RAG pipeline, raw text, images and embeddings are preprocessed and vectorized by Ray, stored in Lance tables, and queried through a dual‑path (keyword + vector) recall. This improves knowledge‑base answer quality by combining precise keyword matching with deep semantic understanding.

A large financial institution processed >10 TB of data and >10 billion text slices. The storage‑compute separation and elastic scaling reduced hardware costs by >70 % while delivering a high‑quality internal knowledge base for large models.

Technical insights

Converging storage and search is inevitable in the AI era. TBDS’s separation architecture maintains massive storage capacity while delivering search efficiency and cost advantages beyond traditional engines. It favors strong single‑table performance to simplify data modeling and keeps the stack open to evolving AI algorithms.

Conclusion

By combining Lance (storage), Ray (compute), Gravitino (metadata) and Lakekeeper (governance), TBDS constructs an AI‑native multimodal data lake that reduces cost, accelerates AI workflows and supports enterprise‑scale deployments.

big dataAI infrastructureGravitinoTBDSLance formatmultimodal data lakeLakekeeperRay framework
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.