A New UGC Video Evaluation Paradigm Built on 17 Billion Real User Interactions
The paper introduces CASTER, a multimodal AI system that uses Social‑CoT reasoning and the MEDEA framework to simulate diverse audience reactions, benchmarked on the large‑scale CASTER‑Bench dataset, and demonstrates superior performance over GPT‑5.2, Claude‑4.5‑Opus, and traditional VQA methods while already being deployed on Bilibili.
Let AI Think From the Audience Perspective
Traditional video quality assessment (VQA) focuses on visual clarity, but on Bilibili community quality is judged by community consensus rather than pixel quality. CASTER takes multimodal video information (cover, keyframes, title, tags, ASR) and simulates reactions of different audience personas to infer potential community approval.
Social-CoT: Social Cognitive Reasoning
Social‑CoT is the core reasoning mechanism, distinct from logical Chain‑of‑Thought. It proceeds in three steps:
Instantiate diverse audience personas (enthusiast, casual passerby, newcomer, critical veteran) representing typical community viewpoints.
Simulate emotional response paths for each persona, reasoning about feelings, memorable segments, and likely comments.
Aggregate community mind via a Skellam Scoring consensus to decide if the content will generate positive resonance.
A concrete Social‑CoT example is illustrated below.
MEDEA Framework
Social‑CoT is realized as a trainable system called MEDEA (Multimodal Engagement‑Driven Evaluation Architecture). It consists of three stages:
Harvest real community wisdom: using a teacher model (Gemini) on Bilibili data to convert community insight into structured Social‑CoT reasoning paths, yielding 54 K annotated samples.
SFT to teach the model the Social‑CoT structure, aligning visual cues and textual information with social interpretation.
RL alignment with human community standards using the GRPO algorithm and four‑dimensional composite rewards:
Format reward: enforce structured output.
Label reward: correctness of predictions.
Cognitive diversity constraint: avoid repetitive comments and explore the full distribution.
Social alignment reward: semantic similarity between simulated and high‑up‑vote real comments.
The social alignment reward prevents degenerate outputs like generic praise and enables concrete empathetic interpretations (e.g., describing an Icelandic vlog’s wind‑blown hair as “the raw power of nature”).
CASTER‑Bench: Community Resonance Benchmark
To support the CASTER task, we release CASTER‑Bench, containing 1 485 UGC videos across 30 categories (life, knowledge, gaming, food, tech, dance, etc.), with an average length of 442 s (total 182.5 h), providing full multimodal information (video, cover, title, tags, partition, ASR).
Experiments: Outperforming GPT‑5.2 and Claude‑4.5‑Opus
On CASTER‑Bench, MEDEA achieves the highest quality metrics: F1 = 0.650, precision = 0.603, recall = 0.705, a +17.1 % improvement over the strongest baseline (GPT‑5.2 reasoning, F1 = 0.555). Detailed failure‑mode analysis shows:
Traditional VQA methods (FastVQA, DOVER, MaxVQA) obtain low F1 (0.33‑0.41) because they assess visual quality rather than community resonance.
Standard large models (GPT‑5.2, Claude‑4.5‑Opus) have high recall (>90 %) but low precision (~30 %) due to “generous bias” – they find positives in any video but cannot distinguish truly excellent content.
Reasoning‑enhanced models improve slightly (max F1 = 0.555) but logical reasoning does not equal social cognition.
Using Social‑CoT prompts without fine‑tuning yields F1 = 0.508, indicating prompting helps but dedicated training is required to internalize community standards.
Deployment on Bilibili
CASTER is not only a paper; it has been deployed in Bilibili’s content ecosystem. Integrated into the content distribution pipeline, it can identify high‑resonance videos at an extremely early stage—before comments appear—allowing quality creators to gain exposure faster without waiting for organic diffusion.
Developer Outreach
CASTER will be presented as a poster on July 5 2026 in San Diego, USA, with MEDEA swag available for attendees.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Bilibili Tech
Provides introductions and tutorials on Bilibili-related technologies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
