How Ant Group’s AI Multimodal Evaluation Transforms Image, Speech, and Video Quality Testing
In a QECon 2025 talk, Ant Group’s AI team detailed a comprehensive multimodal evaluation framework that leverages large‑model metrics, custom pipelines, and benchmark datasets to assess image generation, speech recognition, and video quality, while also contributing to industry standards and academic research.
Lu Jun (No Pass) from Ant Group’s Technology Department presented the talk "AI Multimodal Evaluation Based on Large Models" at QECon 2025 in Shenzhen.
Background
With the explosion of artificial intelligence, multimodal capabilities have rapidly advanced, enabling agents to accept and output text, audio, images, and video. Ant Group has deployed many AI multimodal applications and now shares evaluation work on three fronts: images (AIGC generation), audio (ASR/TTS), and video quality.
AIGC Image Evaluation
New challenges arise with AI‑generated images: textual‑image consistency, anomaly detection, and aesthetic judgment. After extensive research and standard‑setting discussions, Ant defined a metric suite focusing on consistency, anomalies, aesthetics, and basic image quality.
Traditional quality metrics (e.g., PSNR, SSIM) cover basic quality, but consistency requires novel approaches. By leveraging CLIP embeddings, the cosine similarity between text and image vectors (CLIP Score) evaluates consistency. Additional tools like PickScore and ImageReward are combined, with an upstream LLM translating Chinese prompts to English for more accurate assessment.
For anomaly detection, models such as Microsoft’s GIT and LLaVA were explored, ultimately selecting LLaVA with instruction‑prompt isolation and supervised fine‑tuning (SFT + LoRA) to classify specific anomaly types.
Aesthetic scoring was upgraded using a CLIP‑based encoder plus a linear MLP, enhanced with two ReLU correction layers, achieving a 23% improvement over the open‑source IAP model and aligning closely with human ratings. This suite is named VQA‑GPT.
AI Product Image Evaluation Pipeline
The end‑to‑end pipeline consists of two modules:
Efficient‑SAM based Module : Combines a segmentation model with a product consistency evaluator to detect issues such as incomplete cut‑outs, product deformation, and background mismatch. It uses PSNR, SSIM, area calculations, and adjustable IoU/Min‑Distance thresholds. Optimizations like feature pre‑loading reduced average latency from 40 s to 4 s.
Reward Model : Trained on triplets <original, rejected, approved> with attention mechanisms to improve interpretability. The model balances ranking loss and classification loss, boosting AUC by over 1%.
To address label noise and class imbalance, semi‑supervised filtering, clustering‑based data balancing, and multi‑task learning (ranking + classification) were applied.
Interpretability is enhanced via attention heatmaps that highlight image regions influencing rejection decisions.
Speech (ASR) Evaluation
ASR metrics include character error rate (CER) and sentence error rate (SER), extended with ITN accuracy and punctuation accuracy for finer granularity. The evaluation set covers language variations (Chinese, English, Cantonese), domains (travel, medical, ordering), and acoustic conditions (steady noise, babble, reverberation, multi‑speaker, silence). Data sources comprise recorded sessions, synthetic noise augmentation, online feedback loops, and public datasets.
To handle numeric and symbol conversions, a Text Normalization (TN) step aligns ground‑truth and hypothesis texts before error calculation. An ITN‑specific test set evaluates numeral transcription accuracy, mitigating cases where TN introduces errors.
TTS Evaluation
TTS is assessed across four primary dimensions and twelve secondary metrics, covering intelligibility, prosody, emotion, and naturalness. Datasets are carefully aligned with production distributions regarding length, language, and symbol ratios. Initially evaluated by human raters, the process later incorporated model‑based scoring.
Pitch accuracy uses a fine‑tuned Whisper model to extract phonemes, enabling evaluation of tone, speed, and pauses. Audio quality leverages an AST model (Vision Transformer adapted to spectrograms) to detect artifacts like electrical noise or explosions. Timbre consistency employs a fine‑tuned HuBERT model for voice‑print features.
Human‑like affect is measured via an emotion‑quadrant method: textual emotion labels are mapped to a 2‑D space (valence × arousal) using open‑source sentiment models, while speech embeddings from a fine‑tuned WavLM model provide the audio counterpart. The Euclidean distance between the two points yields an emotion‑consistency score.
Video Evaluation
Video quality assessment tackles both objective metrics and subjective human evaluation. The core model combines Swin‑B (hierarchical vision transformer), SlowFast (dual‑path spatio‑temporal network), and Q‑Align (LLM‑driven no‑reference quality scoring) to produce a reference‑free quality score.
Training incorporates diverse video feature extractors, data augmentation, ROI detection, sandwich‑type pre‑checks, and multi‑level MOS constraints. Inference standardizes inputs, concatenates module outputs, and passes them through a two‑layer MLP to regress the final quality score.
AIGC video evaluation introduces additional dimensions such as temporal coherence and adherence to physical realism, requiring new benchmark capabilities.
Summary and Outlook
Ant Group has built a multimodal evaluation benchmark covering images, speech, and video, which will be reused internally and shared with the industry. The team contributed to national standards and authored a paper selected for AAAI 2025, highlighting their leadership in AI multimodal assessment.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
