Bilibili’s Large Language Model‑Based Intelligent Assistant for the Big Data Platform: Architecture, Principles, and Deployment
This article details Bilibili’s implementation of a large‑language‑model‑driven intelligent assistant for its massive big‑data platform, covering background, problem analysis, architectural design, knowledge‑base construction, precision and recall challenges, deployment across offline and real‑time Spark/Flink diagnostics, and future outlooks.
Background
Bilibili operates a massive video‑sharing platform whose big‑data platform supports AI, commerce and many other services. The platform follows a "five‑layer integrated" plus "storage‑compute separation" architecture, with a distributed file system, intelligent scheduling, Spark/Flink engines, Kafka, ClickHouse and various custom tools.
Daily workloads include 270,000 offline tasks, ~20,000 ad‑hoc queries and 7,000 real‑time jobs, generating thousands of user consultations per week that require dedicated troubleshooting for task failures and slowdowns.
User Problems
Why does a task fail? (kernel bugs, component upgrades, data quality, memory issues, etc.)
Why does a task become slow? (hardware aging, resource scheduling pressure, data skew, etc.)
These issues are complex and time‑consuming to diagnose manually, motivating the need for an intelligent assistant.
Intelligent Diagnosis Assistant Goals
Answer private‑domain queries about internal tools and SQL.
Diagnose task failures or performance regressions.
Provide historical ("time‑machine") analysis.
Integrate various agents (host, network, Spark, Flink, etc.) using a ReAct‑style observation‑action loop.
Principle Analysis
Architecture Overview
The system consists of two main components: a knowledge base storing historical cases and solutions, and a user‑diagnosis engine that receives appid/jobid information and invokes multiple agents (Spark, Flink, host, network) via a ReAct mechanism to iteratively reach a satisfactory answer.
Knowledge‑Base Construction
Two stages: data indexing (chunking documents, vectorizing and storing in a vector DB) and user query (chunking the query, retrieving similar documents, and using a large language model for in‑context learning to generate the final answer).
Precision Challenges
Only embed questions, not answers, to improve relevance.
Perform semantic chunking rather than simple character or sentence splitting.
Recall Challenges
Apply metadata filtering to the knowledge base.
Use top‑k retrieval followed by reranking (e.g., bge‑reranker‑v2‑m3) to reduce hallucinations.
Prompt Engineering
Follow OpenAI’s guidelines, structuring prompts with clear subject‑verb‑object patterns to improve LLM performance.
Agent Paradigm
Agents act as routing mechanisms that direct user intents to the appropriate backend (offline or real‑time diagnostics) and maintain state to iteratively refine answers until a confidence threshold is met.
Technical Implementation
Architecture Design
The assistant is accessed via enterprise WeChat, the data platform UI, and a diagnostic system, offering both consulting (RAG‑based) and diagnosis (agent‑based) capabilities.
Offline Diagnosis
Collect Spark and Kyuubi event logs, extract ~20 diagnostic metrics (data skew, shuffle stalls, ANSI compliance, memory usage, etc.), store them in a meta‑warehouse, and expose them through a real‑time service that feeds the assistant.
Real‑time Diagnosis
Stream Flink job metrics and errors into ClickHouse, poll them every minute, cache the results, and let the assistant retrieve and explain anomalies on demand.
Target Audience
SRE engineers needing rapid root‑cause analysis of host or network issues.
Component operators (Spark, Flink) troubleshooting resource contention or job failures.
Data‑warehouse developers diagnosing slow ETL SQL jobs.
General internal users seeking guidance on SQL syntax or internal tools.
Challenges and Outlook
Current Challenges
Improving answer precision to avoid missing critical configuration details.
Handling heterogeneous, low‑quality documentation for knowledge‑base ingestion.
Addressing complex, multi‑faceted user queries that require deep causal analysis.
Future Directions
Building multi‑expert systems that chain agents across components (e.g., Spark → HMS → host).
Reducing inference latency by distributing LLM computation.
Enhancing product UX with richer visualizations and interactive feedback loops.
In summary, Bilibili’s intelligent assistant combines a large language model, vector‑based retrieval, and a suite of specialized agents to automate troubleshooting and knowledge sharing across its extensive big‑data ecosystem, while continuously addressing precision, recall, and user‑experience challenges.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.