Artificial Intelligence 7 min read

Non-Reference Audio Quality Assessment Using a Bidirectional LSTM Deep Learning Model

This article presents a non‑reference audio quality assessment method that leverages a bidirectional LSTM network to predict perceptual scores from spectral features extracted via FFT, describing the system workflow, technical advantages, data preparation, loss design, and TensorFlow implementation details.

360 Tech Engineering

Apr 25, 2022

Non-Reference Audio Quality Assessment Using a Bidirectional LSTM Deep Learning Model

Background : With the rapid development of network technologies, audio‑visual products have proliferated and user expectations for audio quality have risen. Objective audio quality assessment methods are divided into reference‑based and non‑reference approaches; the latter are more practical because they do not require a high‑quality reference signal.

Inspired by humans’ ability to judge audio quality without a reference, the authors propose a non‑reference solution that trains a neural network to emulate this mechanism, using a bidirectional LSTM (BiLSTM) model.

System workflow : (1) The client uploads audio data to the server; nginx forwards the request to a web server. (2) The web server packages the data and forwards it to an AI server. (3) The AI server extracts spectral features via Fast Fourier Transform (FFT). (4) The extracted features are fed into a BiLSTM network, which outputs frame‑level scores and an overall quality score, then returns the result to the client.

Technical advantages : (1) No reference audio is needed for evaluation. (2) The method handles audio of arbitrary length. (3) Using spectral features reduces computational load and improves prediction accuracy.

Core technology implementation :

Network model : The BiLSTM captures both past and future context, integrating global information to produce two outputs – per‑frame scores and a final quality rating.

Data preprocessing : The clean Chinese speech dataset ST‑CMDS is mixed with randomly selected noise from 100 types at various SNR levels to simulate real‑world conditions; PESQ scores are computed for reference.

Loss calculation : The loss combines overall mean‑squared error (MSE) between predicted and true scores and a frame‑level MSE weighted to emphasize frames with higher impact.

def frame_mse_tf(y_true, y_pred):
    True_pesq = y_true[:,0,:]
    loss = tf.constant(0, dtype=tf.float32)
    for i in range(y_true.shape[0]):
        loss += (10**(True_pesq[i] - 4.5)) * tf.reduce_mean(tf.math.square(y_true[i] - y_pred[i]))
    return loss / tf.constant(y_true.shape[0], dtype=tf.float32)

def train_loop(features, labels1, labels2):
    loss_object = tf.keras.losses.MeanSquaredError()
    with tf.GradientTape() as tape:
        predictions1, predictions2 = model(features)
        loss1 = loss_object(labels1, predictions1)
        loss2 = frame_mse_tf(labels2, predictions2)
        loss = loss1 + loss2
    gradients = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(zip(gradients, model.trainable_variables))
    return loss

def read_npy_file(filename):
    data = np.load(filename.numpy().decode())
    return data.astype(np.float32)

def data_preprocessing(feature):
    feature, label1 = feature[...,0], feature[0][0]
    label2 = label1[0] * np.ones([feature.shape[0], 1])
    return feature, label1, label2

def read_feature(filename):
    [feature,] = tf.py_function(read_npy_file, [filename], [tf.float32,])
    data, label1, label2 = tf.py_function(data_preprocessing, [feature], [tf.float32, tf.float32, tf.float32])
    return data, label1, label2

def generate_data(file_path):
    list_ds = tf.data.Dataset.list_files(file_path + '*.npy')
    feature_ds = list_ds.map(read_feature, num_parallel_calls=tf.data.experimental.AUTOTUNE)
    return feature_ds

optimizer = tf.keras.optimizers.RMSprop(learning_rate=0.001)
W = model.layers[0].get_weights()
ds = generate_data(train_file_path)
ds = ds.shuffle(buffer_size=1000).padded_batch(BATCH_SIZE, padded_shapes=([None, None], [None], [None, None])).prefetch(tf.data.experimental.AUTOTUNE)
for step, (x, y, z) in enumerate(ds):
    loss = train_loop(x, y, z)

Model training : The prepared dataset is shuffled, batched, and fed to the BiLSTM model using the RMSprop optimizer.

Conclusion : The proposed solution builds an automatic, reference‑free audio quality assessment system based on a BiLSTM network, trained on a synthetic noisy dataset derived from ST‑CMDS, achieving accurate quality predictions without needing a clean reference.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

TensorFlow Signal Processing BiLSTM audio quality assessment non-reference

Written by

360 Tech Engineering

Official tech channel of 360, building the most professional technology aggregation platform for the brand.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.