How Bilibili Uses LLMs to Diagnose Big Data Platform Issues

This article explains how Bilibili leverages a large‑language‑model‑driven assistant to diagnose and resolve failures and slowdowns in its massive big‑data platform, detailing the platform’s five‑layer architecture, common task issues, and the need for intelligent troubleshooting tools.

DataFunTalk
DataFunTalk
DataFunTalk
How Bilibili Uses LLMs to Diagnose Big Data Platform Issues

Background

This article shares Bilibili’s practice of building an intelligent assistant powered by large language models (LLMs) to help troubleshoot its massive big‑data platform.

1. Overall Architecture and Scale

Bilibili is a video‑sharing platform with huge data volumes. Its big‑data platform supports many business lines such as AI and commerce. The platform follows a “five‑layer integrated + storage‑compute separation” architecture: a distributed file system at the bottom, an intelligent scheduling layer in the middle, various compute engines (Spark, Flink), real‑time streams (Kafka), an OLAP engine (ClickHouse), client tools, and custom CI/CD pipelines.

The platform runs about 270 000 offline tasks daily, roughly 20 000 ad‑hoc queries, and 7 000 critical real‑time jobs. Support teams receive thousands of weekly inquiries, each small team handling about three person‑days of tickets, requiring dedicated staff to answer questions about task failures or slowdowns.

2. Users’ Problems

For offline computation, users mainly ask two questions: why a task fails and why it becomes slow.

Why tasks fail

Kernel defects – upgrades without sufficient testing can cause large‑scale failures.

Dependency component issues – bugs or upgrades in dependent services propagate failures.

Data quality problems – corrupted or invalid input data leads to failures.

Other reasons such as memory‑related errors.

Why tasks become slow

Hardware aging – disk wear reduces read/write speed over time.

Resource scheduling pressure – massive user load and mixed‑deployment policies cause contention.

Data skew – uneven data distribution or problematic data slows processing.

Because diagnosing these causes is time‑consuming, Bilibili explores using intelligent methods (LLMs) to assist engineers.

Typical User Queries

Queries are usually terse, often just a problem description with a link or a screenshot, lacking detailed context.

The excerpt is taken from the e‑book “A Plain‑spoken Large‑Model Handbook”. Scan the QR code below to join the community and receive the e‑book.

QR code
QR code
big dataLLMAI assistantBilibilitask troubleshooting
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.