Artificial Intelligence 13 min read

Bilibili's In-House Role-Playing Large Language Model: Architecture, Training Stages, Evaluation, and Demonstrations

Bilibili’s in‑house role‑playing large language model, built on the Index architecture and refined through pre‑training, supervised fine‑tuning, and preference optimization (PPO and DPO), achieved top scores on the Chinese CharacterEval benchmark, surpassing rivals while incorporating safety alignment and showcasing consistent, personality‑driven dialogue examples.

Bilibili Tech

Nov 5, 2024

Bilibili's In-House Role-Playing Large Language Model: Architecture, Training Stages, Evaluation, and Demonstrations

In recent years, rapid advances in large‑model algorithms and computing power have brought unprecedented attention to general artificial intelligence technologies, spawning a wide range of application scenarios. Among them, role‑playing AI has become a hot field, with many companies launching dialogue products that showcase their AIGC capabilities. Bilibili (B‑Station) has built a role‑playing model on top of its Index large model.

Evaluation of the Role‑Playing Model

The model was assessed using the Chinese‑scene benchmark CharacterEval , which contains 77 character profiles extracted from novels and films and 1,785 dialogue pairs. The benchmark evaluates three major aspects—dialogue ability, character consistency, and role‑playing attractiveness—across 12 fine‑grained dimensions. Index‑70B achieved the highest overall score and ranked first in 7 of the 12 sub‑dimensions, outperforming competing products such as CharacterYuyan, Minimax, and Baichuan. The open‑source Index‑1.9B also showed superior performance compared with other models of similar scale.

Comparison of Index role‑playing model with industry models

Technical Overview

The development pipeline consists of three stages: Pre‑Training (PT), Supervised Fine‑Tuning (SFT), and Preference Optimization (PO).

Pre‑Training

Bilibili’s Index base model is continuously refined from years of internal research. During PT, the model learns from massive corpora that include publicly available books, encyclopedias, papers, STEM data, and a large volume of user‑generated dialogues, especially from the anime and entertainment domains. Data cleaning employs heuristic rules and classifier‑based filtering.

Supervised Fine‑Tuning (SFT)

SFT aligns the generic model to the specific role‑playing task. High‑quality role‑description and role‑dialogue data are constructed. Role descriptions cover attributes such as gender, age, height, nickname, personality, background, speaking style, catchphrases, etc. Role dialogues capture language behavior that reflects personality, preferences, dialect, and stylistic quirks. Example role description for a character named “萌萌酱” and a sample dialogue are provided in the source.

Preference Optimization (PO)

After SFT, the model is further refined using reinforcement‑learning‑based methods. Both Proximal Policy Optimization (PPO) and Direct Preference Optimization (DPO) are explored. PPO involves four models (Actor, Critic, Reward, Reference) and requires four‑times the computational resources. DPO directly learns from human‑ranked pairs, reducing resource consumption while still improving alignment with human preferences.

Safety and Alignment

Before deployment, content‑safety risks are considered. The model is taught to refuse disallowed queries and to follow human values, leveraging the SFT + DPO pipeline for alignment.

Framework Diagram

Index role‑playing model construction framework

Dialogue Demonstration

An example character profile (三三, a 14‑year‑old Bilibili mascot) is shown, illustrating the model’s ability to generate consistent, personality‑driven responses.

Outlook

The in‑house role‑playing model has achieved strong benchmark results and is being explored in internal business scenarios. Future work aims to further strengthen model capabilities, expand data sources, and collaborate with external partners.

References

PPO vs DPO alignment discussion: https://mp.weixin.qq.com/s/nQXSkMeUhFTob9GKTD4_lA

网易伏羲易生诸相多模态模型语言部分: https://zhuanlan.zhihu.com/p/690626399

CharacterEval paper: https://arxiv.org/abs/2401.01275

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

pretraining Content Safety evaluation benchmark Preference Optimization role-playing AI Supervised Fine‑Tuning

Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.