Big Data 16 min read

How Big Data and AI Converge: Insights from Alibaba Cloud’s 2023 Conference

The talk outlines the evolution from model‑centric to data‑centric AI development, explains Alibaba Cloud’s integrated big data‑AI platform, showcases real‑world use cases like knowledge‑base QA and personalized recommendation, and details the underlying cloud‑native services that enable seamless data and AI collaboration.

Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
How Big Data and AI Converge: Insights from Alibaba Cloud’s 2023 Conference

Speaker and Topic

Speaker: Lin Wei, Alibaba Cloud researcher and chief architect of the AI platform PAI and DataWorks. Topic: Interpretation of big data‑AI integration.

AI Explosion and Engineering Challenges

2023 marks a year of AI explosion, driven by large language models. Training such models requires massive compute, abundant data, and efficient tools for rapid iteration.

From Model‑Centric to Data‑Centric Development

Historically, AI development focused on model architecture, limited by data and compute. Today, the paradigm shifts to data‑centric development, emphasizing large‑scale unsupervised training, extensive data cleaning, validation, and quality assessment. High‑quality data is crucial; poor data harms model performance.

Big Data‑AI Integrated Platform

Alibaba Cloud unifies data and AI at the infrastructure layer, offering CPU‑based clusters for big data and RDMA‑enabled heterogeneous clusters for large‑model training. The platform integrates data ingestion, large‑scale offline analysis, streaming computation, and AI model training (PAI). Model deployment leverages vector databases such as Hologres.

Use Cases

Knowledge‑base enhanced LLM QA : Clean and shard knowledge‑base data, embed it into vectors, store in a vector database, retrieve relevant vectors for query, and constrain the LLM to improve answer accuracy.

Personalized recommendation : Continuously update models with real‑time behavior data, combining offline batch processing and online feature extraction to generate model deltas applied daily.

Technical Foundations

Unified workspace : PAI aggregates diverse compute resources (ECS, streaming, GPU clusters, container services) into a single development environment.

Flow framework : Connects data processing and model training steps via static graphs, SDKs, or visual interfaces.

Serverless cloud‑native services : Shared compute resources at hardware, container, and service layers reduce cost while increasing complexity.

Unified scheduling : Enhances Kubernetes to handle heterogeneous workloads, supports high‑concurrency short‑duration big‑data tasks, and provides network‑topology‑aware scheduling for AI training.

Multi‑tenant security isolation : Implements robust isolation at storage and network layers to safely co‑locate big data and AI services.

Container Compute Service (ACS) : Allows fine‑grained resource allocation between big data and AI workloads on a unified substrate.

Multi‑level quota : Enables administrators to reallocate resources dynamically for large‑scale model training.

Topology‑aware scheduling : Optimizes All‑Reduce communication patterns, achieving 30‑40% performance gains.

MaxCompute 4.0 Data+AI : Introduces MaxFrame format to bridge data management and AI, and integrates Flink‑Paimon for streaming + online ML.

DatasetAcc : Provides near‑edge caching of remote data warehouse files to accelerate AI training pipelines.

DataWorks Copilot : A code assistant that translates natural‑language queries into SQL, fine‑tuned on domain‑specific data, improving developer productivity by ~30%.

DataWorks AI‑enhanced analysis : Uses AI to automatically generate data insights, accelerating data understanding.

Conclusion

The speaker emphasizes that big data and AI mutually reinforce each other, and the integrated platform aims to accelerate intelligent data applications.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Model TrainingAI Engineering
Alibaba Cloud Big Data AI Platform
Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.