Bilibili's In-House Role-Playing Large Language Model: Architecture, Training Stages, Evaluation, and Demonstrations
Bilibili’s in‑house role‑playing large language model, built on the Index architecture and refined through pre‑training, supervised fine‑tuning, and preference optimization (PPO and DPO), achieved top scores on the Chinese CharacterEval benchmark, surpassing rivals while incorporating safety alignment and showcasing consistent, personality‑driven dialogue examples.
In recent years, rapid advances in large‑model algorithms and computing power have brought unprecedented attention to general artificial intelligence technologies, spawning a wide range of application scenarios. Among them, role‑playing AI has become a hot field, with many companies launching dialogue products that showcase their AIGC capabilities. Bilibili (B‑Station) has built a role‑playing model on top of its Index large model.
Evaluation of the Role‑Playing Model
The model was assessed using the Chinese‑scene benchmark CharacterEval , which contains 77 character profiles extracted from novels and films and 1,785 dialogue pairs. The benchmark evaluates three major aspects—dialogue ability, character consistency, and role‑playing attractiveness—across 12 fine‑grained dimensions. Index‑70B achieved the highest overall score and ranked first in 7 of the 12 sub‑dimensions, outperforming competing products such as CharacterYuyan, Minimax, and Baichuan. The open‑source Index‑1.9B also showed superior performance compared with other models of similar scale.
Technical Overview
The development pipeline consists of three stages: Pre‑Training (PT), Supervised Fine‑Tuning (SFT), and Preference Optimization (PO).
Pre‑Training
Bilibili’s Index base model is continuously refined from years of internal research. During PT, the model learns from massive corpora that include publicly available books, encyclopedias, papers, STEM data, and a large volume of user‑generated dialogues, especially from the anime and entertainment domains. Data cleaning employs heuristic rules and classifier‑based filtering.
Supervised Fine‑Tuning (SFT)
SFT aligns the generic model to the specific role‑playing task. High‑quality role‑description and role‑dialogue data are constructed. Role descriptions cover attributes such as gender, age, height, nickname, personality, background, speaking style, catchphrases, etc. Role dialogues capture language behavior that reflects personality, preferences, dialect, and stylistic quirks. Example role description for a character named “萌萌酱” and a sample dialogue are provided in the source.
Preference Optimization (PO)
After SFT, the model is further refined using reinforcement‑learning‑based methods. Both Proximal Policy Optimization (PPO) and Direct Preference Optimization (DPO) are explored. PPO involves four models (Actor, Critic, Reward, Reference) and requires four‑times the computational resources. DPO directly learns from human‑ranked pairs, reducing resource consumption while still improving alignment with human preferences.
Safety and Alignment
Before deployment, content‑safety risks are considered. The model is taught to refuse disallowed queries and to follow human values, leveraging the SFT + DPO pipeline for alignment.
Framework Diagram
Dialogue Demonstration
An example character profile (三三, a 14‑year‑old Bilibili mascot) is shown, illustrating the model’s ability to generate consistent, personality‑driven responses.
Outlook
The in‑house role‑playing model has achieved strong benchmark results and is being explored in internal business scenarios. Future work aims to further strengthen model capabilities, expand data sources, and collaborate with external partners.
References
PPO vs DPO alignment discussion: https://mp.weixin.qq.com/s/nQXSkMeUhFTob9GKTD4_lA
网易伏羲易生诸相多模态模型语言部分: https://zhuanlan.zhihu.com/p/690626399
CharacterEval paper: https://arxiv.org/abs/2401.01275
Bilibili Tech
Provides introductions and tutorials on Bilibili-related technologies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.