Big Data 23 min read

Bilibili’s Large Language Model‑Based Intelligent Assistant for the Big Data Platform: Architecture, Principles, and Deployment

This article details Bilibili’s implementation of a large‑language‑model‑driven intelligent assistant for its massive big‑data platform, covering background, problem analysis, architectural design, knowledge‑base construction, precision and recall challenges, deployment across offline and real‑time Spark/Flink diagnostics, and future outlooks.

DataFunSummit
DataFunSummit
DataFunSummit
Bilibili’s Large Language Model‑Based Intelligent Assistant for the Big Data Platform: Architecture, Principles, and Deployment

Background

Bilibili operates a massive video‑sharing platform whose big‑data platform supports AI, commerce and many other services. The platform follows a "five‑layer integrated" plus "storage‑compute separation" architecture, with a distributed file system, intelligent scheduling, Spark/Flink engines, Kafka, ClickHouse and various custom tools.

Daily workloads include 270,000 offline tasks, ~20,000 ad‑hoc queries and 7,000 real‑time jobs, generating thousands of user consultations per week that require dedicated troubleshooting for task failures and slowdowns.

User Problems

Why does a task fail? (kernel bugs, component upgrades, data quality, memory issues, etc.)

Why does a task become slow? (hardware aging, resource scheduling pressure, data skew, etc.)

These issues are complex and time‑consuming to diagnose manually, motivating the need for an intelligent assistant.

Intelligent Diagnosis Assistant Goals

Answer private‑domain queries about internal tools and SQL.

Diagnose task failures or performance regressions.

Provide historical ("time‑machine") analysis.

Integrate various agents (host, network, Spark, Flink, etc.) using a ReAct‑style observation‑action loop.

Principle Analysis

Architecture Overview

The system consists of two main components: a knowledge base storing historical cases and solutions, and a user‑diagnosis engine that receives appid/jobid information and invokes multiple agents (Spark, Flink, host, network) via a ReAct mechanism to iteratively reach a satisfactory answer.

Knowledge‑Base Construction

Two stages: data indexing (chunking documents, vectorizing and storing in a vector DB) and user query (chunking the query, retrieving similar documents, and using a large language model for in‑context learning to generate the final answer).

Precision Challenges

Only embed questions, not answers, to improve relevance.

Perform semantic chunking rather than simple character or sentence splitting.

Recall Challenges

Apply metadata filtering to the knowledge base.

Use top‑k retrieval followed by reranking (e.g., bge‑reranker‑v2‑m3) to reduce hallucinations.

Prompt Engineering

Follow OpenAI’s guidelines, structuring prompts with clear subject‑verb‑object patterns to improve LLM performance.

Agent Paradigm

Agents act as routing mechanisms that direct user intents to the appropriate backend (offline or real‑time diagnostics) and maintain state to iteratively refine answers until a confidence threshold is met.

Technical Implementation

Architecture Design

The assistant is accessed via enterprise WeChat, the data platform UI, and a diagnostic system, offering both consulting (RAG‑based) and diagnosis (agent‑based) capabilities.

Offline Diagnosis

Collect Spark and Kyuubi event logs, extract ~20 diagnostic metrics (data skew, shuffle stalls, ANSI compliance, memory usage, etc.), store them in a meta‑warehouse, and expose them through a real‑time service that feeds the assistant.

Real‑time Diagnosis

Stream Flink job metrics and errors into ClickHouse, poll them every minute, cache the results, and let the assistant retrieve and explain anomalies on demand.

Target Audience

SRE engineers needing rapid root‑cause analysis of host or network issues.

Component operators (Spark, Flink) troubleshooting resource contention or job failures.

Data‑warehouse developers diagnosing slow ETL SQL jobs.

General internal users seeking guidance on SQL syntax or internal tools.

Challenges and Outlook

Current Challenges

Improving answer precision to avoid missing critical configuration details.

Handling heterogeneous, low‑quality documentation for knowledge‑base ingestion.

Addressing complex, multi‑faceted user queries that require deep causal analysis.

Future Directions

Building multi‑expert systems that chain agents across components (e.g., Spark → HMS → host).

Reducing inference latency by distributing LLM computation.

Enhancing product UX with richer visualizations and interactive feedback loops.

In summary, Bilibili’s intelligent assistant combines a large language model, vector‑based retrieval, and a suite of specialized agents to automate troubleshooting and knowledge sharing across its extensive big‑data ecosystem, while continuously addressing precision, recall, and user‑experience challenges.

Big DataFlinkRAGagentlarge language modelSparkintelligent assistant
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.