Artificial Intelligence 20 min read

Unlocking Big Data Ops with Large Models: Opportunities, Challenges, Design

This article summarizes a Cloud Summit talk where Alibaba Cloud’s AI expert Zhang Yingying explains how large language models can enhance big‑data intelligent operations, covering opportunities, challenges, RAG‑based Q&A, multi‑agent diagnostics, and the engineering architecture needed for reliable, scalable deployment.

Alibaba Cloud Big Data AI Platform

Aug 8, 2025

Unlocking Big Data Ops with Large Models: Opportunities, Challenges, Design

Speaker and Agenda

Speaker: Zhang Yingying, Algorithm Expert, Alibaba Cloud Intelligent Group

Main Topics:

Opportunities and challenges of large models for big‑data intelligent operations

RAG‑based intelligent Q&A

Multi‑Agent based intelligent diagnosis

Large‑model application deployment architecture

1. Opportunities and Challenges

Large models, sparked by the 2022 ChatGPT release, bring natural interaction, extensive knowledge, reasoning, and self‑reflection capabilities. Combined with the two pain points of big‑data ops—fine‑grained maintenance and massive data analysis—two practical scenarios emerge: user‑facing intelligent Q&A and platform‑level intelligent diagnosis.

Directly applying a model to these scenarios faces hallucination, slow knowledge updates, and high fine‑tuning costs.

2. RAG‑Based Intelligent Q&A

To mitigate hallucination and knowledge‑staleness, Retrieval‑Augmented Generation (RAG) is introduced. An external vector database stores relevant documents; the model retrieves and incorporates them as context, turning a closed‑book exam into an open‑book one.

Knowledge construction uses a multi‑granularity extraction framework: the model first summarizes the whole document, then each top‑level heading, and finally extracts dialogue‑style knowledge with a fine‑tuned Doc2QA algorithm. The extracted knowledge is vectorized and stored in OpenSearch.

During retrieval, user queries are semantically expanded, followed by a hybrid of vector and sparse search, re‑ranking, and a graph‑based approach (RAG on Graph) that leverages document hierarchy and citation relationships to improve recall for difficult queries.

Prompt engineering and safety checks (e.g., hyperlink and code handling, sensitive‑word detection) are applied to ensure answer quality.

3. Multi‑Agent Intelligent Diagnosis

The diagnosis system mimics a fault‑response team. Each Agent consists of Memory, Tools, and Planning modules. Multiple specialized Agents (e.g., metric anomaly detection, log anomaly detection, historical fault similarity) are coordinated by a System Agent.

Core tools include:

Metric anomaly detection covering five typical patterns (mean shift, variance change, spikes, cliffs, trend alerts).

Log anomaly detection using high‑performance clustering and knowledge‑base matching.

Historical fault similarity search that retrieves both numeric and textual features from a database.

A forward‑feedback workflow follows the dependency topology, while a backward‑feedback loop allows agents to refine their conclusions after receiving information from peers. The final aggregated result is presented to SREs with detailed per‑agent analysis.

4. Deployment Architecture

The stack consists of a data layer (metrics, logs, topology, documents, events), an algorithm service layer (hosted on PAI or Flink, deployed as trigger‑based or resident services), and a large‑model service layer (model, prompt, workflow, conversation management). Agents are instantiated per cluster, with full observability for tool calls, reasoning chains, and multi‑agent workflows.

Development follows a decoupled model: algorithm developers work locally with LangChain, then deploy to the cloud. A visual UI allows role configuration, chain design, and tool management.

Observability dashboards expose each Agent’s tool inputs/outputs, enabling rapid debugging when model outputs deviate from expectations.

5. Summary and Outlook

The talk recaps the background, opportunities, and challenges of applying large models to intelligent ops, demonstrates RAG‑based Q&A and multi‑agent diagnosis, and shares the supporting engineering architecture. Future work includes stronger model bases, more natural human‑AI interaction, flexible workflow orchestration, and agile MLOps for continuous improvement.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

large language models RAG AI engineering Big Data Operations

Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.