How Bilibili Uses LLM‑Powered Assistants to Tackle Big‑Data Task Failures

Bilibili’s massive video platform relies on a five‑layer, storage‑compute separated big‑data architecture, handling hundreds of thousands of daily tasks, and now leverages large‑language‑model assistants to automatically diagnose and resolve frequent task failures and performance slowdowns.

DataFunSummit
DataFunSummit
DataFunSummit
How Bilibili Uses LLM‑Powered Assistants to Tackle Big‑Data Task Failures

Background

Bilibili is a video‑sharing platform with massive data; its big‑data platform supports AI, commerce and other critical services.

Overall Architecture and Scale

The platform follows a “five‑layer integrated” plus “storage‑compute separation” design: a distributed file system at the bottom, an intelligent scheduling layer, compute engines such as Spark and Flink, client tools, real‑time streams (Kafka), an OLAP engine (ClickHouse), and custom tools and CI/CD pipelines.

It processes roughly 270,000 offline tasks daily, about 20,000 ad‑hoc queries, and 7,000 critical real‑time jobs. The support team receives thousands of tickets each week, with each sub‑team handling about three person‑days of inquiries.

User Problems

Users mainly ask why offline tasks fail or become slower.

Why tasks fail

Kernel defects, especially after untested upgrades.

Issues in dependent components; bugs or upgrades in shared resources cause cascade failures.

Data‑quality problems.

Other factors such as memory limits.

Why tasks become slower

Hardware aging, e.g., disk wear affecting read/write speed.

Resource scheduling pressure under massive load and mixed‑deployment policies.

Data skew or problematic data distribution.

Diagnosing these causes is labor‑intensive, prompting the exploration of intelligent, LLM‑driven assistance.

Nature of User Queries

Queries are typically terse, often just a problem description with a link or screenshot.

Architecture diagram
Architecture diagram
System components diagram
System components diagram
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

distributed systemsLLMAI assistanceBilibilitask troubleshooting
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.