How Bilibili Leverages Large Language Models to Automate Big Data Operations

This article explores Bilibili’s implementation of a large‑language‑model‑driven intelligent assistant that helps troubleshoot massive offline and real‑time data processing tasks, detailing the platform’s five‑layer architecture, common failure causes, and how AI can streamline issue resolution.

DataFunTalk
DataFunTalk
DataFunTalk
How Bilibili Leverages Large Language Models to Automate Big Data Operations

Introduction

This article shares Bilibili’s practice of building an intelligent assistant based on large language models.

Background

Bilibili is a video sharing platform with massive data. Its big‑data platform supports many business lines, including AI and commerce.

The platform follows a “five‑layer integrated” plus “separate storage and compute” architecture: a distributed file system at the bottom, an intelligent scheduling layer, various compute engines such as Spark and Flink, client tools, real‑time streams (Kafka), OLAP engine (ClickHouse), and custom CI/CD tools.

Daily workload includes 270,000 offline tasks, about 20,000 ad‑hoc queries, and roughly 7,000 critical real‑time jobs. The support team handles thousands of inquiries weekly, with each sub‑team spending about three person‑days per week on troubleshooting task failures or slowdowns.

User Issues

Users mainly ask two questions about offline jobs: why a task fails and why it becomes slow.

Why tasks fail

Kernel defects, especially after untested upgrades.

Problems in dependent components; bugs or upgrades in shared resources can cause failures.

Data quality issues.

Other reasons such as memory problems.

Why tasks become slow

Hardware aging, e.g., disk wear affecting read/write speed.

Resource scheduling pressure and cross‑department resource shuffling.

Data skew or inherent data problems.

Because diagnosing these causes is time‑consuming, an intelligent assistant is needed.

Nature of User Queries

Queries are typically terse, often just a problem description with a link or a screenshot.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

large language modelAI Operationsbig data platformIntelligent Assistanttask troubleshooting
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.