Training an Audio Quality Detection Model Using Synthetic Noise and PESQ Scoring
This article explains how to generate low‑quality audio samples from clean speech by randomly inserting noise at various SNR levels, compute objective PESQ scores as ground‑truth, and use these paired data to train a neural‑network model for reference‑free audio quality assessment.
The training of an audio quality detection model faces a shortage of degraded audio datasets and difficulty in evaluating their quality scores. To address this, a method is proposed that creates low‑quality audio solely from a clean high‑quality speech corpus and assigns scores using the PESQ algorithm.
Subjective evaluation of speech quality typically relies on MOS (Mean Opinion Score), which requires many listeners to rate audio on a 1‑5 scale; scores above 4 indicate good quality, while below 3 denote unacceptable quality.
Objective evaluation methods fall into two categories: reference‑based (e.g., PESQ) and no‑reference (e.g., P.563). The presented approach adopts the PESQ algorithm, which compares a degraded signal with its clean reference, aligns them, applies auditory transformations, measures spectral distortion, and maps the result to a MOS‑like PESQ score ranging from –0.5 to 4.5.
Data generation steps:
Randomly select start and end positions for inserting noise into the clean audio.
Calculate the noise scaling factor based on a specified signal‑to‑noise ratio (SNR).
Insert the chosen noise segment into the selected portion of the clean audio.
Compute the PESQ score for the resulting degraded audio.
Implementation details (Python):
def random_sample(n1, n2):
if n1 < n2:
start = random.randint(0, n1)
end = random.randint(start, n1)
else:
start = random.randint(0, n2)
end = random.randint(start, n2)
return start, end def add_noise(x, d, SNR):
P_signal = np.sum(abs(x) ** 2)
P_d = np.sum(abs(d) ** 2)
P_noise = P_signal / 10 ** (SNR / 10)
k = np.sqrt(P_noise / P_d)
return k def make_noise_data(high_wave_data, noise_sample_data):
c_start, c_end = random_sample(len(high_wave_data), len(noise_sample_data))
n_start = random.randint(0, len(noise_sample_data) - (c_end - c_start))
n_end = c_end - c_start + n_start
k = add_noise(high_wave_data, noise_sample_data[n_start:n_end], -10)
convert_data = high_wave_data[c_start:c_end] + k * noise_sample_data[n_start:n_end]
new_wave_data = np.concatenate((high_wave_data[:c_start], convert_data, high_wave_data[c_end:]))
librosa.output.write_wav("noise.wav", new_wave_data, sr) score = pesq(sr, high_wave_data, low_wave_data, 'nb')In real‑world scenarios, obtaining a clean reference for every degraded audio segment is impractical, making direct PESQ computation infeasible. By training a neural network on the synthetically generated paired data and their PESQ scores, the model learns to predict audio quality without needing a reference signal, enabling scalable, reference‑free quality assessment.
360 Quality & Efficiency
360 Quality & Efficiency focuses on seamlessly integrating quality and efficiency in R&D, sharing 360’s internal best practices with industry peers to foster collaboration among Chinese enterprises and drive greater efficiency value.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.