How Bilibili Leverages Large Language Models to Automate Big Data Task Troubleshooting

This article explains Bilibili's large‑scale data platform architecture, the common offline‑task failures and slowdowns users encounter, and how a large language model‑driven intelligent assistant is being built to automatically diagnose and resolve these engineering problems.

DataFunTalk
DataFunTalk
DataFunTalk
How Bilibili Leverages Large Language Models to Automate Big Data Task Troubleshooting

Background Introduction

This article shares Bilibili's practice of an intelligent agent assistant based on large language models.

1. Overall Architecture and Scale

Bilibili is a video‑sharing platform with massive data. Its big‑data platform supports many business lines, including AI and commerce.

The platform follows a “five‑layer integrated + storage‑compute separation” architecture: a distributed file system at the bottom, an intelligent scheduling layer in the middle, various compute engines such as Spark and Flink, client tools, real‑time data streams (Kafka), OLAP engine (ClickHouse), plus custom tools and a CI/CD platform.

2. User Problems

Users of the offline compute system mainly face two issues: why tasks fail and why tasks become slower.

Why tasks fail

Kernel defects, especially after kernel upgrades without sufficient testing.

Dependency component problems; many tasks depend on shared resources that may have bugs or upgrades.

Data quality issues that cause failures.

Other reasons such as memory problems.

Why tasks become slower

Hardware aging; large‑scale storage disks degrade over time, reducing read/write speed.

Resource scheduling pressure; mixed deployment across departments can cause contention during peak periods.

Data distribution problems, including data skew.

Diagnosing these failures or slowdowns is complex and time‑consuming, prompting the exploration of intelligent methods to assist.

User Query Characteristics

Typical user queries are highly engineering‑focused, often consisting of a brief problem description plus a link or screenshot, with little additional context.

AI AssistantBilibilibig data platformtask troubleshooting
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.