How Bilibili Uses Large Language Models to Build an Intelligent Assistant

This article explains Bilibili's large‑language‑model‑based intelligent assistant, describing the platform's five‑layer architecture, massive daily task load, common failure and slowdown causes, and the need for AI‑driven troubleshooting to improve reliability and performance.

DataFunTalk
DataFunTalk
DataFunTalk
How Bilibili Uses Large Language Models to Build an Intelligent Assistant

Introduction

This article shares Bilibili's practice of building an intelligent agent assistant based on large language models.

Background

Bilibili is a video sharing platform with massive data. Its big‑data platform supports many business lines such as AI and commerce.

Architecture diagram
Architecture diagram

1. Overall Architecture and Scale

The platform follows a “five‑layer integrated” plus “separate storage and compute” design. The bottom layer is a distributed file system; the middle layer provides intelligent scheduling; compute engines include Spark, Flink, etc.; clients, real‑time streams (Kafka), OLAP engine (ClickHouse) and custom tools and CI/CD platforms complete the stack.

Daily workload is huge: 270,000 offline tasks, about 20,000 ad‑hoc queries, and roughly 7,000 critical real‑time jobs. The support team receives thousands of inquiries weekly, each small team handling about three person‑days of tickets, requiring dedicated staff to answer task‑failure or slowdown questions.

2. Users' Problems

For offline computation, users mainly ask why tasks fail and why they become slow.

Why tasks fail

Kernel defects, especially after untested kernel upgrades.

Issues in dependent components; upgrades or bugs in shared resources cause failures.

Data quality problems; corrupted or invalid input data leads to failure.

Other reasons such as memory errors.

Why tasks become slow

Hardware aging; large storage fleets experience degraded read/write speed over time.

Resource scheduling pressure; massive user volume and mixed‑deployment mechanisms cause contention.

Data skew or problematic data distribution.

Because diagnosing these causes is time‑consuming, Bilibili explores intelligent methods to assist in troubleshooting.

Assistant workflow diagram
Assistant workflow diagram
Bilibilibig data platformIntelligent Assistanttask failure analysis
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.