How Bilibili Uses Large Language Models to Tackle Big Data Platform Issues

This article explains how Bilibili's massive video platform leverages a large‑language‑model‑driven assistant to diagnose and resolve offline task failures and slowdowns within its five‑layer, storage‑compute separated big data architecture, improving operational efficiency for thousands of daily queries.

DataFunSummit
DataFunSummit
DataFunSummit
How Bilibili Uses Large Language Models to Tackle Big Data Platform Issues

Background Introduction

Bilibili is a video‑sharing platform with massive data volumes, and its big data platform supports many critical business lines such as AI and commerce.

The platform follows a “five‑layer integrated” plus “storage‑compute separation” architecture: a distributed file system at the bottom, an intelligent scheduling layer, various compute engines (Spark, Flink), client tools, real‑time streams (Kafka), an OLAP engine (ClickHouse), and custom CI/CD tools.

Task volume is huge: about 270,000 offline tasks, roughly 20,000 ad‑hoc queries, and 7,000 important real‑time jobs run each day. The support team receives thousands of queries weekly, with each sub‑team handling about three person‑days of troubleshooting, requiring dedicated staff to address task failures and slowdowns.

Architecture overview
Architecture overview

1. Overall Architecture and Scale

The platform’s architecture is described above, emphasizing the layered design and the massive scale of daily computations.

2. User Problems

Users mainly encounter two issues with offline computation: task failures and task slowdowns.

Why tasks fail

Kernel defects – upgrades without sufficient testing can cause large‑scale failures.

Dependency component issues – bugs or upgrades in dependent services propagate failures.

Data quality problems – corrupted or invalid input data leads to failures.

Other factors such as memory constraints.

Why tasks become slow

Hardware aging – large storage volumes cause disk read/write speed degradation over time.

Resource scheduling pressure – high user load and mixed‑deployment mechanisms cause contention.

Data distribution issues – data skew or problematic data sets slow processing.

Because diagnosing these causes is time‑consuming, Bilibili explores intelligent, LLM‑based assistance to help users quickly identify and resolve failures or slowdowns.

Practical Insight

User queries are typically concise, often just a problem description with a link or screenshot, making automated assistance especially valuable.

Task failure illustration
Task failure illustration
User query example
User query example
AI assistanceBilibilibig data platformtask troubleshooting
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.