Artificial Intelligence 11 min read

Zero‑Basis Food Sound Recognition with ASR: Theory, Workflow, and Complete Python Code

This article introduces the fundamentals of automatic speech recognition (ASR) for food‑sound classification, explains key audio representations and modeling approaches, and provides a fully runnable Python implementation using librosa, TensorFlow/Keras, and classic machine‑learning tools to train and predict on the Tianchi competition dataset.

Sohu Tech Products

May 12, 2021

Zero‑Basis Food Sound Recognition with ASR: Theory, Workflow, and Complete Python Code

Background The tutorial is aimed at practitioners of intelligent voice interaction and uses the Tianchi competition "Zero‑Basis Introduction to Speech Recognition: Food Sound Recognition" as a concrete example to illustrate ASR concepts and provide end‑to‑end code.

Audio Representation Basics Audio signals are visualized through several representations: waveform (16000 samples per second, 16 kHz sampling rate), sampled points (zoomed‑in waveform), spectrogram (frequency‑time energy map via short‑time Fourier transform), and frame‑level feature vectors (the basic unit analogous to text tokens, extracted by acoustic front‑ends such as mel‑filter banks).

Overall Solution Strategies Two main ASR solution families are described: (1) a hybrid acoustic‑model + language‑model pipeline, where audio is first transformed into acoustic features and then decoded with a language model; (2) end‑to‑end approaches (CTC, seq2seq, RNN‑Transducer, Transformer) that directly map audio frames to text sequences.

Key Terminology Acoustic models (e.g., HMM, GMM, DNN‑HMM), language models (n‑gram, RNN‑based), decoders (search using weighted finite‑state transducers), and end‑to‑end methods (seq2seq + CTC, RNN‑Transducer, Transformer) are briefly defined.

Dataset and Baseline Idea The competition provides 20 categories of chewing sounds. The baseline converts each audio file to a mel‑spectrogram using librosa and trains a CNN classifier on these features.

Data Acquisition The following shell commands download and unzip the training and test sets:

# 下载数据集
!wget http://tianchi-competition.oss-cn-hangzhou.aliyuncs.com/531887/train_sample.zip
!wget http://tianchi-competition.oss-cn-hangzhou.aliyuncs.com/531887/test_a.zip
!unzip -qq train_sample.zip
!rm train_sample.zip
!unzip -qq test_a.zip
!rm test_a.zip

Environment Requirements

TensorFlow ≥ 2.0

Keras

scikit‑learn

librosa

Basic Library Imports

#基本库
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import MinMaxScaler

#深度学习框架
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPool2D, Flatten, Dense, Dropout
from tensorflow.keras.utils import to_categorical

#机器学习模型
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

#音频处理
import librosa
import librosa.display
import glob
import os

Feature Extraction and Dataset Construction

#标签映射
label_dict = {'aloe':0,'burger':1,'cabbage':2,'candied_fruits':3,'carrots':4,'chips':5,'chocolate':6,'drinks':7,'fries':8,'grapes':9,'gummies':10,'ice-cream':11,'jelly':12,'noodles':13,'pickles':14,'pizza':15,'ribs':16,'salmon':17,'soup':18,'wings':19}
label_dict_inv = {v:k for k,v in label_dict.items()}

from tqdm import tqdm

def extract_features(parent_dir, sub_dirs, max_file=10, file_ext="*.wav"):
    label, feature = [], []
    for sub_dir in sub_dirs:
        for fn in tqdm(glob.glob(os.path.join(parent_dir, sub_dir, file_ext))[:max_file]):
            label_name = fn.split('/')[-2]
            label.append(label_dict[label_name])
            X, sr = librosa.load(fn, res_type='kaiser_fast')
            mels = np.mean(librosa.feature.melspectrogram(y=X, sr=sr).T, axis=0)
            feature.append(mels)
    return [feature, label]

parent_dir = './train_sample/'
sub_dirs = np.array(['aloe','burger','cabbage','candied_fruits','carrots','chips','chocolate','drinks','fries','grapes','gummies','ice-cream','jelly','noodles','pickles','pizza','ribs','salmon','soup','wings'])
features, labels = extract_features(parent_dir, sub_dirs, max_file=100)
X = np.vstack(features)
Y = np.array(labels)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, random_state=1, stratify=Y)

Model Construction

#搭建CNN网络
model = Sequential()
input_dim = (16, 8, 1)  # depends on mel‑spectrogram shape after preprocessing
model.add(Conv2D(64, (3,3), padding='same', activation='tanh', input_shape=input_dim))
model.add(MaxPool2D(pool_size=(2,2)))
model.add(Conv2D(128, (3,3), padding='same', activation='tanh'))
model.add(MaxPool2D(pool_size=(2,2)))
model.add(Dropout(0.1))
model.add(Flatten())
model.add(Dense(1024, activation='tanh'))
model.add(Dense(20, activation='softmax'))
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
model.summary()

#训练模型
model.fit(X_train, Y_train, epochs=20, batch_size=15, validation_data=(X_test, Y_test))

Prediction on Test Set

def extract_features(test_dir, file_ext="*.wav"):
    feature = []
    for fn in tqdm(glob.glob(os.path.join(test_dir, file_ext)):
        X, sr = librosa.load(fn, res_type='kaiser_fast')
        mels = np.mean(librosa.feature.melspectrogram(y=X, sr=sr).T, axis=0)
        feature.append(mels)
    return feature

X_test_feat = extract_features('./test_a/')
predictions = model.predict(X_test_feat)
preds = np.argmax(predictions, axis=1)
preds = [label_dict_inv[x] for x in preds]
paths = glob.glob('./test_a/*.wav')
result = pd.DataFrame({'name': paths, 'label': preds})
result['name'] = result['name'].apply(lambda x: x.split('/')[-1])
result.to_csv('submit.csv', index=None)

The script finishes by counting files and lines in the submission file using standard Unix utilities.

This comprehensive guide therefore covers ASR theory, audio preprocessing, feature extraction, model design, training, and inference, providing a ready‑to‑run example for food‑sound classification.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

CNN Python TensorFlow librosa speech recognition ASR Audio Classification

Written by

Sohu Tech Products

A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.