Zero‑Basis Food Sound Recognition with ASR: Theory, Workflow, and Complete Python Code
This article introduces the fundamentals of automatic speech recognition (ASR) for food‑sound classification, explains key audio representations and modeling approaches, and provides a fully runnable Python implementation using librosa, TensorFlow/Keras, and classic machine‑learning tools to train and predict on the Tianchi competition dataset.
Background The tutorial is aimed at practitioners of intelligent voice interaction and uses the Tianchi competition "Zero‑Basis Introduction to Speech Recognition: Food Sound Recognition" as a concrete example to illustrate ASR concepts and provide end‑to‑end code.
Audio Representation Basics Audio signals are visualized through several representations: waveform (16000 samples per second, 16 kHz sampling rate), sampled points (zoomed‑in waveform), spectrogram (frequency‑time energy map via short‑time Fourier transform), and frame‑level feature vectors (the basic unit analogous to text tokens, extracted by acoustic front‑ends such as mel‑filter banks).
Overall Solution Strategies Two main ASR solution families are described: (1) a hybrid acoustic‑model + language‑model pipeline, where audio is first transformed into acoustic features and then decoded with a language model; (2) end‑to‑end approaches (CTC, seq2seq, RNN‑Transducer, Transformer) that directly map audio frames to text sequences.
Key Terminology Acoustic models (e.g., HMM, GMM, DNN‑HMM), language models (n‑gram, RNN‑based), decoders (search using weighted finite‑state transducers), and end‑to‑end methods (seq2seq + CTC, RNN‑Transducer, Transformer) are briefly defined.
Dataset and Baseline Idea The competition provides 20 categories of chewing sounds. The baseline converts each audio file to a mel‑spectrogram using librosa and trains a CNN classifier on these features.
Data Acquisition The following shell commands download and unzip the training and test sets:
# 下载数据集
!wget http://tianchi-competition.oss-cn-hangzhou.aliyuncs.com/531887/train_sample.zip
!wget http://tianchi-competition.oss-cn-hangzhou.aliyuncs.com/531887/test_a.zip
!unzip -qq train_sample.zip
!rm train_sample.zip
!unzip -qq test_a.zip
!rm test_a.zipEnvironment Requirements
TensorFlow ≥ 2.0
Keras
scikit‑learn
librosa
Basic Library Imports
#基本库
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import MinMaxScaler
#深度学习框架
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPool2D, Flatten, Dense, Dropout
from tensorflow.keras.utils import to_categorical
#机器学习模型
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
#音频处理
import librosa
import librosa.display
import glob
import osFeature Extraction and Dataset Construction
#标签映射
label_dict = {'aloe':0,'burger':1,'cabbage':2,'candied_fruits':3,'carrots':4,'chips':5,'chocolate':6,'drinks':7,'fries':8,'grapes':9,'gummies':10,'ice-cream':11,'jelly':12,'noodles':13,'pickles':14,'pizza':15,'ribs':16,'salmon':17,'soup':18,'wings':19}
label_dict_inv = {v:k for k,v in label_dict.items()}
from tqdm import tqdm
def extract_features(parent_dir, sub_dirs, max_file=10, file_ext="*.wav"):
label, feature = [], []
for sub_dir in sub_dirs:
for fn in tqdm(glob.glob(os.path.join(parent_dir, sub_dir, file_ext))[:max_file]):
label_name = fn.split('/')[-2]
label.append(label_dict[label_name])
X, sr = librosa.load(fn, res_type='kaiser_fast')
mels = np.mean(librosa.feature.melspectrogram(y=X, sr=sr).T, axis=0)
feature.append(mels)
return [feature, label]
parent_dir = './train_sample/'
sub_dirs = np.array(['aloe','burger','cabbage','candied_fruits','carrots','chips','chocolate','drinks','fries','grapes','gummies','ice-cream','jelly','noodles','pickles','pizza','ribs','salmon','soup','wings'])
features, labels = extract_features(parent_dir, sub_dirs, max_file=100)
X = np.vstack(features)
Y = np.array(labels)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, random_state=1, stratify=Y)Model Construction
#搭建CNN网络
model = Sequential()
input_dim = (16, 8, 1) # depends on mel‑spectrogram shape after preprocessing
model.add(Conv2D(64, (3,3), padding='same', activation='tanh', input_shape=input_dim))
model.add(MaxPool2D(pool_size=(2,2)))
model.add(Conv2D(128, (3,3), padding='same', activation='tanh'))
model.add(MaxPool2D(pool_size=(2,2)))
model.add(Dropout(0.1))
model.add(Flatten())
model.add(Dense(1024, activation='tanh'))
model.add(Dense(20, activation='softmax'))
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
model.summary()
#训练模型
model.fit(X_train, Y_train, epochs=20, batch_size=15, validation_data=(X_test, Y_test))Prediction on Test Set
def extract_features(test_dir, file_ext="*.wav"):
feature = []
for fn in tqdm(glob.glob(os.path.join(test_dir, file_ext)):
X, sr = librosa.load(fn, res_type='kaiser_fast')
mels = np.mean(librosa.feature.melspectrogram(y=X, sr=sr).T, axis=0)
feature.append(mels)
return feature
X_test_feat = extract_features('./test_a/')
predictions = model.predict(X_test_feat)
preds = np.argmax(predictions, axis=1)
preds = [label_dict_inv[x] for x in preds]
paths = glob.glob('./test_a/*.wav')
result = pd.DataFrame({'name': paths, 'label': preds})
result['name'] = result['name'].apply(lambda x: x.split('/')[-1])
result.to_csv('submit.csv', index=None)The script finishes by counting files and lines in the submission file using standard Unix utilities.
This comprehensive guide therefore covers ASR theory, audio preprocessing, feature extraction, model design, training, and inference, providing a ready‑to‑run example for food‑sound classification.
Sohu Tech Products
A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.