Artificial Intelligence 9 min read

Using SPTM in qa_match for the 58 City AI Competition: Data Preparation, Model Training, and Prediction

This article provides a step‑by‑step guide on preparing data, pre‑training the SPTM lightweight model, fine‑tuning a text‑classification model with qa_match, and generating competition‑ready predictions for the 58 City AI Algorithm Contest, including all required shell commands and parameter explanations.

58 Tech

Aug 14, 2020

Using SPTM in qa_match for the 58 City AI Competition: Data Preparation, Model Training, and Prediction

Intelligent customer service based on AI technology has become widely used, requiring accurate text matching and classification to retrieve relevant knowledge‑base answers. The article introduces how to use the open‑source tool qa_match and its Simple Pre‑trained Model (SPTM) to participate in the 58 City AI Algorithm Competition.

Background : The competition provides a dataset for matching user queries to standard questions. Participants are encouraged to use the SPTM model for pre‑training and fine‑tuning.

1. Download SPTM source code

Ensure TensorFlow (1.8–2.0) and Python 3 are installed on a Linux GPU environment.

git clone https://github.com/wuba/qa_match.git

cd qa_match

2. Download competition data

Download the competition data from the homepage and unzip it into the data_demo folder, resulting in a data directory with the expected file structure.

SPTM Model Training

1. Pre‑training the language model

Run the following command to pre‑train SPTM using the provided pre‑training data:

nohup python run_pretraining.py --train_file="../data_demo/data/pre_train_data" \
--vocab_file="../data_demo/data/vocab" \
--model_save_dir="./model/pretrain" \
--batch_size=512 \
--print_step=100 \
--weight_decay=0 \
--embedding_dim=1000 \
--lstm_dim=500 \
--layer_num=1 \
--train_step=100000 \
--warmup_step=10000 \
--learning_rate=5e-5 \
--dropout_rate=0.1 \
--max_predictions_per_seq=10 \
--clip_norm=1.0 \
--max_seq_len=100 \
--use_queue=0 > pretrain.log 2>&1 &

Key parameters are listed in the table below:

Parameter

Description

vocab

Vocabulary file provided by the competition

train_file/valid_data

Training and validation datasets

lstm_dim

Number of LSTM hidden units

embedding_dim

Dimension of word embeddings

dropout_rate

Dropout probability

layer_num

Number of LSTM layers

weight_decay

Adam weight decay coefficient

max_predictions_per_seq

Maximum number of masked tokens per sentence

clip_norm

Gradient clipping threshold

use_queue

Whether to use a queue for generating pre‑training data

2. Train the classification model

First split train_data into training and validation sets:

shuf ../data_demo/data/train_data | tr -d "\r" > ../data_demo/data/train_data_shuf

head -n1000 ../data_demo/data/train_data_shuf > ../data_demo/data/valid_data_final

tail -n+1001 ../data_demo/data/train_data_shuf > ../data_demo/data/train_data_final

Then train the classifier:

python run_classifier.py --output_id2label_file="model/id2label.has_init" \
--vocab_file="../data_demo/data/vocab" \
--train_file="../data_demo/data/train_data_final" \
--dev_file="../data_demo/data/valid_data_final" \
--model_save_dir="model/finetune" \
--lstm_dim=500 \
--embedding_dim=1000 \
--opt_type=adam \
--batch_size=256 \
--epoch=20 \
--learning_rate=1e-4 \
--seed=1 \
--max_len=100 \
--print_step=10 \
--dropout_rate=0.1 \
--layer_num=1 \
--init_checkpoint="model/pretrain/lm_pretrain.ckpt-500000"

If you prefer not to use the pre‑trained model, omit the --init_checkpoint argument.

SPTM Model Prediction

1. Score the test set

Use the trained classifier to score test_data:

python run_prediction.py --input_file="../data_demo/data/test_data" \
--vocab_file="../data_demo/data/vocab" \
--id2label_file="model/id2label.has_init" \
--model_dir="model/finetune" > "../data_demo/data/result_test_raw"

2. Generate the final submission file

Extract the extended question IDs and predicted standard question IDs, then combine them:

awk '{print $2}' test_data > ext_id

awk -F',' '{print $1}' result_test_raw | awk -F'|' '{print $1}' | awk -F'__' '{print $3}' > std_id

echo ext_id,std_id > 58cop.csv

paste -d"," ext_id std_id >> 58cop.csv

After uploading the submission file, the competition score achieved was 0.6424.

Author : Wang Yong, AI Lab Algorithm Architect at 58.com, MSc from Beijing Institute of Technology, former video recommendation researcher at Youku, currently focusing on NLP algorithms.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI pretraining competition Text Classification qa_match SPTM

Written by

58 Tech

Official tech channel of 58, a platform for tech innovation, sharing, and communication.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.