Artificial Intelligence 6 min read

Timbre‑Guided TG‑Critic and Transformer‑Based TrOMR: AI Advances in Music Evaluation

This article reviews two recent AI research papers from NetEase Cloud Music Lab: TG‑Critic, a timbre‑guided, reference‑free singing evaluation model that classifies vocal performance using only audio, and TrOMR, a Transformer‑based end‑to‑end polyphonic optical music recognition system that improves note‑sequence prediction and dataset realism.

NetEase Cloud Music Tech Team

Sep 6, 2023

Timbre‑Guided TG‑Critic and Transformer‑Based TrOMR: AI Advances in Music Evaluation

TG‑Critic: Timbre‑Guided Reference‑Free Singing Evaluation

Paper: TG‑CRITIC: A Timbre‑Guided Model for Reference‑Independent Singing Evaluation . arXiv: https://arxiv.org/abs/2305.09127

Problem

Assess a singer’s performance using only the recorded vocal audio, without any reference melody, score, or pre‑recorded template. This matches how human experts can judge a completely unfamiliar song.

Input and Output

Input: a single non‑rap singing audio clip.

Output: either a three‑level quality label (good, medium, poor) or a continuous quality score in the range 0–1.

Evaluation granularity: the model can produce segment‑level scores, enabling analysis of quality variation within a song.

Key Technical Contributions

Timbre‑guided modeling : The network explicitly extracts timbre embeddings (e.g., using a high‑resolution convolutional front‑end) and fuses them with spectral features to capture vocal timbre characteristics that correlate with perceived quality.

High‑resolution network architecture : A HRNet‑style backbone processes mel‑spectrograms at multiple resolutions, preserving fine‑grained frequency details that are important for vocal assessment.

Cyclic automatic data annotation : A self‑training loop generates pseudo‑labels for unlabeled recordings, iteratively refines the model, and dramatically reduces the need for manual annotation.

Results

Experiments on a large internal singing dataset show that TG‑Critic achieves higher classification accuracy and lower mean absolute error on continuous scores than prior state‑of‑the‑art reference‑free methods. The model runs end‑to‑end, eliminating handcrafted template creation.

TrOMR: Transformer‑Based Polyphonic Optical Music Recognition

Paper: TrOMR: Transformer‑Based Polyphonic Optical Music Recognition . arXiv PDF: https://arxiv.org/pdf/2308.09370.pdf

Problem

Optical Music Recognition (OMR) converts scanned or photographed sheet‑music images into symbolic notation. Traditional pipelines rely on object‑detection stages, require costly annotated datasets, and struggle with dense polyphonic scores.

Technical Contributions

Transformer backbone : The TrOMR network replaces conventional CNN‑RNN pipelines with a pure Transformer encoder‑decoder that models long‑range dependencies, enabling prediction of longer note sequences and improving accuracy on complex scores.

Re‑defined annotation schema : Instead of a single “duration + note value” label, annotations are split into three components—global score symbol representation, local symbol representation, and pitch information—facilitating more effective learning.

Realistic data‑capture pipeline : A smartphone‑based acquisition process captures sheet‑music under varied lighting, including printed pages and screenshots on displays. This yields a large, authentic dataset that better reflects real‑world usage.

Results

Benchmarks on the newly collected dataset demonstrate that TrOMR outperforms previous OMR systems, especially on densely notated polyphonic passages, achieving higher symbol‑level and note‑level accuracy.

deep learning Transformer Audio Analysis Music Evaluation Optical Music Recognition

Written by

NetEase Cloud Music Tech Team

Official account of NetEase Cloud Music Tech Team

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.