Artificial Intelligence 23 min read

Evaluation Framework and Methodology for OPPO XiaoBu AI Assistant

This article presents a comprehensive evaluation framework for OPPO's XiaoBu AI assistant, covering evaluation concepts, objectives, five key elements, sampling methods, dimension selection, annotation scoring, report generation, and a detailed Q&A that illustrates practical metrics and processes for voice and search services.

DataFunTalk
DataFunTalk
DataFunTalk
Evaluation Framework and Methodology for OPPO XiaoBu AI Assistant

Guest: Li Ermin (OPPO) Editor: Wu Qiyao (University of California) Platform: DataFunTalk

Introduction – Evaluation has become a frequent part of daily life, from buying houses and cars to selecting digital products. This talk introduces the evaluation system of the XiaoBu assistant.

01. Evaluation Concept and Purpose

Evaluation is purpose‑driven, ranging from small to large scopes. It can be defined as "assessment + measurement" that quantifies observed phenomena according to set rules.

Compared with commodity evaluation, internet‑product and AI evaluation share methodology (sample, dimension, method) but differ in three aspects: varied usage scenarios, massive data and quantifiable metrics, and a focus on user experience rather than pure commercial goals.

The two main purposes are: (1) discover common user‑perceived problems from large samples to guide product and algorithm improvements, and (2) validate iteration effects and provide risk assessment before launch.

02. Evaluation Elements

The five elements are: evaluation method, data selection (sampling), evaluation dimensions & scoring rules, annotation scoring, and evaluation report. Each is described in detail.

2.1 Evaluation Method

Two industries are used as examples:

Search industry – methods include overall satisfaction (Per‑page), side‑by‑side (SBS) comparison, single‑item scoring (PI) with NDCG, and recall/accuracy with F‑score.

Voice‑assistant industry – four core bottlenecks (wake‑up, hearing, understanding, speaking) are evaluated via wake‑up rate, ASR error rates, intent recall/accuracy, and MOS for TTS.

2.2 Data Selection

Four sampling strategies are used: random sampling (log‑based), deduplication sampling, stratified sampling (high‑frequency, medium, tail), and vertical sampling (domain‑specific). Each has advantages and trade‑offs for coverage and representativeness.

2.3 Evaluation Dimensions and Rules

Typical dimensions include legality, spam/low‑quality, intent understanding, relevance, timeliness, ranking, diversity, authority, convenience, and richness. Rules are defined per product and purpose to ensure consistent human annotation.

2.4 Annotation Scoring

Query intent is judged by four methods: direct understanding, everyday‑experience, deep‑thinking, and search‑engine reference. Result satisfaction is then assessed for relevance and dimension coverage, with tiered scoring.

2.5 Evaluation Report

The report contains three core parts: audience‑specific presentation, key metrics & statistics, and background information (purpose, method, dimensions, formulas). The "One‑page" principle ensures the most important information is front‑loaded.

03. General Evaluation Process

Steps: clarify requirements with stakeholders → draft evaluation plan → discuss and finalize → prepare data & environment → trial annotation → formal evaluation → quality check → data statistics → report output. The process ends with product optimization based on findings.

04. XiaoBu Assistant Evaluation System

XiaoBu is OPPO’s AI assistant on smartphones, IoT devices, and other 5G‑enabled products. It supports hundreds of skills (life services, travel, queries, system control, entertainment) and is continuously expanded.

The framework aligns with the four core bottlenecks (wake‑up, hearing, understanding, speaking) across devices (phone, watch, TV). Specific evaluations include:

"Hear‑Clear": audio quality distribution and ASR error rates (lab and online).

"Hear‑Understand": intent recall/accuracy, session satisfaction, and GSB pre‑launch tests.

"Speak‑Clear": MOS for TTS, optional recommendation‑based or binary objective tests.

Special attention is given to Cantonese mode and multi‑device scenarios.

05. Summary and Outlook

The evaluation system evolves with user needs and product iterations, extending to multi‑terminal, cross‑service, and cross‑scenario assessments, including vision, environment perception, and learning capabilities.

06. Q&A Highlights

ASR metrics: character error rate (CER) and sentence error rate (SER); TTS metric: MOS.

Long‑tail issues are best addressed by stratified sampling.

Relevance and timeliness are manually judged with detailed rules.

Recall metrics are usually domain‑specific; satisfaction and PI focus on top‑N results.

Core metrics depend on the model and evaluation goal; not all dimensions are required for every test.

Dialect evaluation requires native speakers; MOS scoring involves at least five raters per audio.

Key indicators vary: ASR – CER/SER, TTS – MOS, NLP – user satisfaction.

Richness is judged by answer length and substance.

Accent‑induced ASR errors are labeled but evaluated against the same standards; dialect support is toggle‑able.

Thank you for attending.

user experiencemetricsAI evaluationVoice AssistantOPPOtesting methodologydata sampling
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.