Bilibili’s AI Assistant: Using Large Language Models to Tackle Massive Data Tasks

This article explains how Bilibili leverages a large‑language‑model‑based intelligent agent to diagnose and resolve failures and slowdowns in its massive big‑data platform, detailing the platform architecture, workload scale, common user issues, and the need for automated assistance.

DataFunTalk
DataFunTalk
DataFunTalk
Bilibili’s AI Assistant: Using Large Language Models to Tackle Massive Data Tasks

Introduction

The article shares Bilibili’s practice of building an intelligent agent assistant powered by large language models to help users troubleshoot problems in its massive data platform.

Architecture overview
Architecture overview

Background

Bilibili is a video‑sharing platform with massive data volumes. Its big‑data platform supports many services such as AI and commerce. The platform follows a “five‑layer integrated + storage‑compute separation” architecture: a distributed file system at the bottom, an intelligent scheduling layer in the middle, various compute engines (Spark, Flink), real‑time streams (Kafka), OLAP engines (ClickHouse), client tools, and custom CI/CD pipelines.

Platform components
Platform components

Scale and Challenges

The platform processes about 270,000 offline tasks, 20,000 ad‑hoc queries, and 7,000 real‑time jobs daily. User support volume is also high, with thousands of weekly inquiries, each team handling roughly three person‑days of troubleshooting per week.

User Issues

Users mainly encounter two problems with offline jobs: failures and performance degradation.

Why tasks fail

Kernel defects – upgrades without sufficient testing can cause large‑scale failures.

Dependency component issues – bugs or upgrades in shared components propagate failures.

Data quality problems – corrupted or malformed data leads to failures.

Other factors such as memory errors.

Why tasks become slow

Hardware aging – large storage arrays degrade over time, reducing I/O speed.

Resource scheduling pressure – massive user load and mixed‑deployment policies cause contention.

Data skew or distribution problems – uneven data partitions slow processing.

Need for Intelligent Assistance

Diagnosing these failures and slowdowns is labor‑intensive and time‑consuming, prompting the exploration of AI‑driven tools to automate root‑cause analysis and provide actionable guidance.

Intelligent assistant workflow
Intelligent assistant workflow

User Query Characteristics

Typical queries are concise and engineering‑focused, often consisting of a brief problem description plus a link or screenshot.

Large Language ModelAI OperationsBilibilibig data platformIntelligent Assistanttask troubleshooting
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.