Non-Reference Audio Quality Assessment Using a Bidirectional LSTM Deep Learning Model
This article presents a non‑reference audio quality assessment method that leverages a bidirectional LSTM network to predict perceptual scores from spectral features extracted via FFT, describing the system workflow, technical advantages, data preparation, loss design, and TensorFlow implementation details.
Background : With the rapid development of network technologies, audio‑visual products have proliferated and user expectations for audio quality have risen. Objective audio quality assessment methods are divided into reference‑based and non‑reference approaches; the latter are more practical because they do not require a high‑quality reference signal.
Inspired by humans’ ability to judge audio quality without a reference, the authors propose a non‑reference solution that trains a neural network to emulate this mechanism, using a bidirectional LSTM (BiLSTM) model.
System workflow : (1) The client uploads audio data to the server; nginx forwards the request to a web server. (2) The web server packages the data and forwards it to an AI server. (3) The AI server extracts spectral features via Fast Fourier Transform (FFT). (4) The extracted features are fed into a BiLSTM network, which outputs frame‑level scores and an overall quality score, then returns the result to the client.
Technical advantages : (1) No reference audio is needed for evaluation. (2) The method handles audio of arbitrary length. (3) Using spectral features reduces computational load and improves prediction accuracy.
Core technology implementation :
Network model : The BiLSTM captures both past and future context, integrating global information to produce two outputs – per‑frame scores and a final quality rating.
Data preprocessing : The clean Chinese speech dataset ST‑CMDS is mixed with randomly selected noise from 100 types at various SNR levels to simulate real‑world conditions; PESQ scores are computed for reference.
Loss calculation : The loss combines overall mean‑squared error (MSE) between predicted and true scores and a frame‑level MSE weighted to emphasize frames with higher impact.
def frame_mse_tf(y_true, y_pred):
True_pesq = y_true[:,0,:]
loss = tf.constant(0, dtype=tf.float32)
for i in range(y_true.shape[0]):
loss += (10**(True_pesq[i] - 4.5)) * tf.reduce_mean(tf.math.square(y_true[i] - y_pred[i]))
return loss / tf.constant(y_true.shape[0], dtype=tf.float32)
def train_loop(features, labels1, labels2):
loss_object = tf.keras.losses.MeanSquaredError()
with tf.GradientTape() as tape:
predictions1, predictions2 = model(features)
loss1 = loss_object(labels1, predictions1)
loss2 = frame_mse_tf(labels2, predictions2)
loss = loss1 + loss2
gradients = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(gradients, model.trainable_variables))
return loss
def read_npy_file(filename):
data = np.load(filename.numpy().decode())
return data.astype(np.float32)
def data_preprocessing(feature):
feature, label1 = feature[...,0], feature[0][0]
label2 = label1[0] * np.ones([feature.shape[0], 1])
return feature, label1, label2
def read_feature(filename):
[feature,] = tf.py_function(read_npy_file, [filename], [tf.float32,])
data, label1, label2 = tf.py_function(data_preprocessing, [feature], [tf.float32, tf.float32, tf.float32])
return data, label1, label2
def generate_data(file_path):
list_ds = tf.data.Dataset.list_files(file_path + '*.npy')
feature_ds = list_ds.map(read_feature, num_parallel_calls=tf.data.experimental.AUTOTUNE)
return feature_ds
optimizer = tf.keras.optimizers.RMSprop(learning_rate=0.001)
W = model.layers[0].get_weights()
ds = generate_data(train_file_path)
ds = ds.shuffle(buffer_size=1000).padded_batch(BATCH_SIZE, padded_shapes=([None, None], [None], [None, None])).prefetch(tf.data.experimental.AUTOTUNE)
for step, (x, y, z) in enumerate(ds):
loss = train_loop(x, y, z)Model training : The prepared dataset is shuffled, batched, and fed to the BiLSTM model using the RMSprop optimizer.
Conclusion : The proposed solution builds an automatic, reference‑free audio quality assessment system based on a BiLSTM network, trained on a synthetic noisy dataset derived from ST‑CMDS, achieving accurate quality predictions without needing a clean reference.
360 Tech Engineering
Official tech channel of 360, building the most professional technology aggregation platform for the brand.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.