Timbre‑Guided TG‑Critic and Transformer‑Based TrOMR: AI Advances in Music Evaluation
This article reviews two recent AI research papers from NetEase Cloud Music Lab: TG‑Critic, a timbre‑guided, reference‑free singing evaluation model that classifies vocal performance using only audio, and TrOMR, a Transformer‑based end‑to‑end polyphonic optical music recognition system that improves note‑sequence prediction and dataset realism.
TG‑Critic: Timbre‑Guided Reference‑Free Singing Evaluation
Paper: TG‑CRITIC: A Timbre‑Guided Model for Reference‑Independent Singing Evaluation . arXiv: https://arxiv.org/abs/2305.09127
Problem
Assess a singer’s performance using only the recorded vocal audio, without any reference melody, score, or pre‑recorded template. This matches how human experts can judge a completely unfamiliar song.
Input and Output
Input: a single non‑rap singing audio clip.
Output: either a three‑level quality label (good, medium, poor) or a continuous quality score in the range 0–1.
Evaluation granularity: the model can produce segment‑level scores, enabling analysis of quality variation within a song.
Key Technical Contributions
Timbre‑guided modeling : The network explicitly extracts timbre embeddings (e.g., using a high‑resolution convolutional front‑end) and fuses them with spectral features to capture vocal timbre characteristics that correlate with perceived quality.
High‑resolution network architecture : A HRNet‑style backbone processes mel‑spectrograms at multiple resolutions, preserving fine‑grained frequency details that are important for vocal assessment.
Cyclic automatic data annotation : A self‑training loop generates pseudo‑labels for unlabeled recordings, iteratively refines the model, and dramatically reduces the need for manual annotation.
Results
Experiments on a large internal singing dataset show that TG‑Critic achieves higher classification accuracy and lower mean absolute error on continuous scores than prior state‑of‑the‑art reference‑free methods. The model runs end‑to‑end, eliminating handcrafted template creation.
TrOMR: Transformer‑Based Polyphonic Optical Music Recognition
Paper: TrOMR: Transformer‑Based Polyphonic Optical Music Recognition . arXiv PDF: https://arxiv.org/pdf/2308.09370.pdf
Problem
Optical Music Recognition (OMR) converts scanned or photographed sheet‑music images into symbolic notation. Traditional pipelines rely on object‑detection stages, require costly annotated datasets, and struggle with dense polyphonic scores.
Technical Contributions
Transformer backbone : The TrOMR network replaces conventional CNN‑RNN pipelines with a pure Transformer encoder‑decoder that models long‑range dependencies, enabling prediction of longer note sequences and improving accuracy on complex scores.
Re‑defined annotation schema : Instead of a single “duration + note value” label, annotations are split into three components—global score symbol representation, local symbol representation, and pitch information—facilitating more effective learning.
Realistic data‑capture pipeline : A smartphone‑based acquisition process captures sheet‑music under varied lighting, including printed pages and screenshots on displays. This yields a large, authentic dataset that better reflects real‑world usage.
Results
Benchmarks on the newly collected dataset demonstrate that TrOMR outperforms previous OMR systems, especially on densely notated polyphonic passages, achieving higher symbol‑level and note‑level accuracy.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
