How Bilibili Leverages Large Language Models to Solve Big Data Platform Failures

This article explains Bilibili's massive video platform data architecture, the huge daily workload of offline and real‑time tasks, common user problems like task failures and slowdowns, their root causes, and how a large language model assistant is being used to automate troubleshooting.

DataFunSummit
DataFunSummit
DataFunSummit
How Bilibili Leverages Large Language Models to Solve Big Data Platform Failures

Background Introduction

Bilibili is a video sharing platform with massive data. Its big data platform supports many services, including AI and commerce.

1. Overall Architecture and Scale

The platform follows a “five‑layer integrated” plus “storage‑compute separation” architecture. The bottom layer is a distributed file system; the middle includes an intelligent scheduling layer and various compute engines such as Spark and Flink, as well as clients, real‑time streams (Kafka), an OLAP engine (ClickHouse), and custom tools and CI/CD pipelines.

It processes roughly 270,000 offline tasks daily, about 20,000 ad‑hoc queries, and 7,000 critical real‑time jobs. The support team receives thousands of tickets weekly, with each sub‑team handling about three person‑days of inquiries.

Architecture diagram
Architecture diagram

2. User Issues

For offline computation, users mainly ask why tasks fail or become slow.

Why tasks fail

Kernel defects, especially after untested upgrades.

Dependency component bugs; many tasks depend on shared resources that may break after upgrades.

Data quality problems.

Other reasons such as memory issues.

Why tasks become slow

Hardware aging; large storage volumes degrade read/write speed over time.

Resource scheduling pressure, especially with mixed deployment across departments.

Data skew or inherent data problems.

Because diagnosing these issues is time‑consuming, Bilibili explores intelligent methods to assist troubleshooting.

Typical user queries are concise, often just a problem description with a link or screenshot.

User query example
User query example
Large Language ModelAI assistanceBilibilitask troubleshooting
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.