How Bilibili Uses Large Language Models to Solve Big Data Platform Issues

This article explains Bilibili's massive data platform architecture, the common offline‑task failures and slowdowns users encounter, and how the company applies a large‑language‑model‑driven intelligent assistant to diagnose and resolve these engineering problems efficiently.

DataFunTalk
DataFunTalk
DataFunTalk
How Bilibili Uses Large Language Models to Solve Big Data Platform Issues

Introduction

This article shares Bilibili's practice of building an intelligent assistant based on large language models.

Background

Bilibili is a video‑sharing platform with massive data volumes. Its big‑data platform supports many business lines, including AI and commerce.

Overall Architecture and Scale

The platform follows a “five‑layer integrated” plus “storage‑compute separation” design. The bottom layer is a distributed file system; the middle layer includes an intelligent scheduler and various compute engines such as Spark and Flink, as well as clients, real‑time streams (Kafka), OLAP engines (ClickHouse), and custom tools and CI/CD pipelines.

Every day the platform processes about 270,000 offline tasks, around 20,000 ad‑hoc queries, and roughly 7,000 critical real‑time jobs. The support team receives thousands of inquiries weekly, with each sub‑team handling about three person‑days of tickets, requiring dedicated staff to address task failures and slowdowns.

User Problems

For offline computation, users mainly ask two questions: why a task failed and why it became slow.

Why Tasks Fail

Kernel defects – upgrades without sufficient testing can cause large‑scale failures.

Dependency issues – component upgrades or bugs in shared resources can break dependent tasks.

Data quality problems – corrupted or malformed data leads to failures.

Other reasons such as memory errors.

Why Tasks Slow Down

Hardware aging – massive storage disks wear out, reducing read/write speed.

Resource scheduling pressure – high user volume and mixed deployment cause contention.

Data skew – uneven data distribution or problematic data slows processing.

Diagnosing these causes is labor‑intensive, prompting the exploration of intelligent methods to assist.

Need for Intelligent Assistance

Users typically submit concise, engineering‑focused questions, often just a problem description with a link or screenshot, lacking detailed context.

Large Language ModelAI assistanceBilibilibig data platformtask failure analysis
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.