How We Built an LLM‑Powered AI Hub to Read and Analyze Community Chats

This article details the design and deployment of a multi‑layer LLM system that automatically reads massive creator group chats, extracts structured insights, mitigates hallucinations with dual‑model verification, uses few‑shot prompting for stable output, and delivers real‑time risk alerts and operational reports.

Bilibili Tech
Bilibili Tech
Bilibili Tech
How We Built an LLM‑Powered AI Hub to Read and Analyze Community Chats

Why an AI that can read group chats?

Operating large creator communities on B‑Station generates massive daily chat traffic. Manual review is slow, error‑prone, and misses weak signals. Simple keyword filters cannot capture context or emerging topics, and free‑text feedback lacks structure for analysis.

Overall Architecture: LLM‑driven AI middle platform

The solution is organized into four layers: Data Collection → AI Structuring → Group Analysis → Operational Insight.

System architecture diagram
System architecture diagram

Layered Prompt Engineering

Four prompt stages ensure high recall and precision while producing stable, structured output.

Information Extraction Layer : extracts all possible user feedback with high recall, outputs a fixed schema, and embeds business semantics (feedback type, tag system, sentiment).

Content Governance Layer : performs high‑precision validation, removes hallucinations and noise, filters fuzzy statements, merges duplicate feedback, and discards operational/test messages.

Semantic Clustering Layer : automatically builds topic clusters using LLM semantic understanding, unifies tag names, and creates new tags when novel semantics appear.

Insight Generation Layer : produces a 100‑character community hotspot summary and generates daily/weekly reports with trend and risk indicators.

These layers yield a controllable, highly available, end‑to‑end explainable output pipeline.

Dual‑model collaboration for recall vs. precision

Group chat text is unstructured, colloquial, and context‑heavy. The system adopts a “light model for recall → heavy model for verification” strategy:

LLM A: excels at high‑recall information mining.

LLM B: excels at rigorous judgment and hallucination reduction.

Dual‑model workflow
Dual‑model workflow

Hallucination case and mitigation

Early single‑model parsing sometimes invented feedback that never existed, leading to serious hallucination risk. After introducing a strict verification node (LLM A → LLM B) and anti‑hallucination rules, hallucination rates dropped from 8‑12% to below 1% and fake feedback disappeared.

Never fabricate user utterances.

If no original text is available, output “No feedback”.

Structured data must map one‑to‑one with source text.

Fuzzy language (e.g., “seems”, “maybe”) triggers high‑risk flag and manual review.

Missing fields or mismatched tags are marked “Must Fix”.

Few‑shot prompting for stable structured output

To prevent format drift in table‑like outputs, a lightweight few‑shot prompt supplies 2‑3 exemplar tables before processing real data, allowing the model to remember the exact schema without extra validators.

你是社区运营团队的一员。你即将阅读到一批用户反馈内容,每条反馈都有唯一的“反馈ID”。
你的工作如下:
主题聚类 - 对所有反馈进行语义聚类,将语义相近的反馈归为同一主题(如“系统卡顿”和“服务器崩溃”应归为同类)。
相同主题只保留一个统一标签,不要生成相似但不同的标签名称。
标签命名 - 参考微博热搜命名方式,生成简洁有冲击力的标签(2~8字),避免冗长描述。
标签需覆盖多种类型(满意度、咨询、问题等),不可只聚焦问题反馈。
事件提炼 - 为每个标签用一句话概括用户关注的焦点,事件化描述,不要罗列类别。
热度统计 - 统计该标签下的反馈数量,作为热度值。
反馈ID标注 - 列出与该标签相关的最多5个反馈ID,用英文逗号分隔。
排序输出 - 按热度从高到低,输出前10个标签的信息。
输出表格格式如下,请勿返回多余的注解:
| 标签 | 热议事件 | 热度 | 反馈ID |

Semantic clustering for evolving community language

Creators use diverse expressions (e.g., “审核慢”, “卡审核”), synonyms, and constantly introduce new terms. The LLM judges semantic similarity instead of relying on static keywords, generates unified tags, and automatically creates new topics when content does not fit existing clusters, enabling lightweight yet effective topic grouping.

Risk alert system driven by semantic signals

Combining clustering results with feedback metrics yields a lightweight alert logic:

Volume analysis

Growth rate analysis

Negative‑feedback ratio

Emotion spike detection

Cross‑group consistency check

When a topic shows sudden negative sentiment surge, rapid month‑over‑month increase, or simultaneous spikes across multiple groups, the system flags a “sudden event” with a risk level.

Risk alert flowchart
Risk alert flowchart

Business impact and benefits

After deploying the AI analysis system, daily effective feedback rose from ~50 (manual) to ~600 (automated), a ten‑fold coverage increase. Hallucination dropped below 1%, and all structured records include traceable original quotes, forming an evidence chain.

Structured data now powers daily/weekly briefs, informs product prioritization, auto‑generates TAPD tickets, and serves as a knowledge asset for the product team, reducing communication overhead.

Performance metrics
Performance metrics

AI advantages summary

Efficiency boost: automated analysis frees team to focus on critical issues.

Comprehensive coverage: semantic understanding captures weak signals, new topics, and nuanced expressions.

Quantified sentiment: emotion becomes observable metric for timely risk alerts.

Topic aggregation: semantic clustering merges duplicate opinions and surfaces long‑tail problems.

Closed‑loop workflow: structured feedback flows into reports and issue‑tracking systems, accelerating resolution.

The platform now moves from reactive “manual monitoring” to proactive “AI‑driven insight”, turning creator voices into scalable, actionable intelligence.

LLMprompt engineeringFew‑Shot LearningAI OperationsRisk Detectionsemantic clustering
Bilibili Tech
Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.