Using SPTM in qa_match for the 58 City AI Competition: Data Preparation, Model Training, and Prediction
This article provides a step‑by‑step guide on preparing data, pre‑training the SPTM lightweight model, fine‑tuning a text‑classification model with qa_match, and generating competition‑ready predictions for the 58 City AI Algorithm Contest, including all required shell commands and parameter explanations.
Intelligent customer service based on AI technology has become widely used, requiring accurate text matching and classification to retrieve relevant knowledge‑base answers. The article introduces how to use the open‑source tool qa_match and its Simple Pre‑trained Model (SPTM) to participate in the 58 City AI Algorithm Competition.
Background : The competition provides a dataset for matching user queries to standard questions. Participants are encouraged to use the SPTM model for pre‑training and fine‑tuning.
1. Download SPTM source code
Ensure TensorFlow (1.8–2.0) and Python 3 are installed on a Linux GPU environment.
git clone https://github.com/wuba/qa_match.git cd qa_match2. Download competition data
Download the competition data from the homepage and unzip it into the data_demo folder, resulting in a data directory with the expected file structure.
SPTM Model Training
1. Pre‑training the language model
Run the following command to pre‑train SPTM using the provided pre‑training data:
nohup python run_pretraining.py --train_file="../data_demo/data/pre_train_data" \
--vocab_file="../data_demo/data/vocab" \
--model_save_dir="./model/pretrain" \
--batch_size=512 \
--print_step=100 \
--weight_decay=0 \
--embedding_dim=1000 \
--lstm_dim=500 \
--layer_num=1 \
--train_step=100000 \
--warmup_step=10000 \
--learning_rate=5e-5 \
--dropout_rate=0.1 \
--max_predictions_per_seq=10 \
--clip_norm=1.0 \
--max_seq_len=100 \
--use_queue=0 > pretrain.log 2>&1 &Key parameters are listed in the table below:
Parameter
Description
vocab
Vocabulary file provided by the competition
train_file/valid_data
Training and validation datasets
lstm_dim
Number of LSTM hidden units
embedding_dim
Dimension of word embeddings
dropout_rate
Dropout probability
layer_num
Number of LSTM layers
weight_decay
Adam weight decay coefficient
max_predictions_per_seq
Maximum number of masked tokens per sentence
clip_norm
Gradient clipping threshold
use_queue
Whether to use a queue for generating pre‑training data
2. Train the classification model
First split train_data into training and validation sets:
shuf ../data_demo/data/train_data | tr -d "\r" > ../data_demo/data/train_data_shuf head -n1000 ../data_demo/data/train_data_shuf > ../data_demo/data/valid_data_final tail -n+1001 ../data_demo/data/train_data_shuf > ../data_demo/data/train_data_finalThen train the classifier:
python run_classifier.py --output_id2label_file="model/id2label.has_init" \
--vocab_file="../data_demo/data/vocab" \
--train_file="../data_demo/data/train_data_final" \
--dev_file="../data_demo/data/valid_data_final" \
--model_save_dir="model/finetune" \
--lstm_dim=500 \
--embedding_dim=1000 \
--opt_type=adam \
--batch_size=256 \
--epoch=20 \
--learning_rate=1e-4 \
--seed=1 \
--max_len=100 \
--print_step=10 \
--dropout_rate=0.1 \
--layer_num=1 \
--init_checkpoint="model/pretrain/lm_pretrain.ckpt-500000"If you prefer not to use the pre‑trained model, omit the --init_checkpoint argument.
SPTM Model Prediction
1. Score the test set
Use the trained classifier to score test_data :
python run_prediction.py --input_file="../data_demo/data/test_data" \
--vocab_file="../data_demo/data/vocab" \
--id2label_file="model/id2label.has_init" \
--model_dir="model/finetune" > "../data_demo/data/result_test_raw"2. Generate the final submission file
Extract the extended question IDs and predicted standard question IDs, then combine them:
awk '{print $2}' test_data > ext_id awk -F',' '{print $1}' result_test_raw | awk -F'|' '{print $1}' | awk -F'__' '{print $3}' > std_id echo ext_id,std_id > 58cop.csv paste -d"," ext_id std_id >> 58cop.csvAfter uploading the submission file, the competition score achieved was 0.6424.
Author : Wang Yong, AI Lab Algorithm Architect at 58.com, MSc from Beijing Institute of Technology, former video recommendation researcher at Youku, currently focusing on NLP algorithms.
58 Tech
Official tech channel of 58, a platform for tech innovation, sharing, and communication.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.