Guide to Using SPTM (Simple Pre-trained Model) with qa_match for an AI Competition
This article provides a step‑by‑step tutorial on preparing data, pre‑training the SPTM language model, fine‑tuning a text‑classification model, generating predictions, and creating a submission file for the 58.com AI algorithm competition using the open‑source qa_match toolkit.
The article introduces intelligent customer service based on AI, emphasizing text matching and classification as core NLP techniques, and presents the open‑source qa_match tool (which supports one‑ and two‑layer knowledge‑base QA) for the 58.com AI algorithm competition.
Background
The first 58.com AI algorithm competition is announced, with 159 teams registered. The guide explains how to use the SPTM (Simple Pre‑trained Model) component of qa_match for the contest.
Model and Data Preparation
1. Download SPTM source code
git clone https://github.com/wuba/qa_match.git2. Enter the qa_match directory
cd qa_match3. Download competition data and unzip it into qa_match/data_demo , creating a data folder.
SPTM Model Pre‑training
Navigate to the sptm folder:
cd sptmCreate a directory for the pre‑trained model:
mkdir -p model/pretrainRun the pre‑training script (TensorFlow 1.8–2.0, Python 3, GPU environment):
nohup python run_pretraining.py \
--train_file="../data_demo/data/pre_train_data" \
--vocab_file="../data_demo/data/vocab" \
--model_save_dir="./model/pretrain" \
--batch_size=512 \
--print_step=100 \
--weight_decay=0 \
--embedding_dim=1000 \
--lstm_dim=500 \
--layer_num=1 \
--train_step=100000 \
--warmup_step=10000 \
--learning_rate=5e-5 \
--dropout_rate=0.1 \
--max_predictions_per_seq=10 \
--clip_norm=1.0 \
--max_seq_len=100 \
--use_queue=0 > pretrain.log 2>&1 &Key parameters are explained in a table (vocab file, train/valid data, lstm dimensions, embedding size, dropout rate, layer number, weight decay, max predictions per sequence, gradient clipping, queue usage, etc.). After successful training, the model checkpoint is saved.
Training the Classification Model
Split the provided train_data into training and validation sets:
shuf ../data_demo/data/train_data | tr -d "\r" > ../data_demo/data/train_data_shuf head -n1000 ../data_demo/data/train_data_shuf > ../data_demo/data/valid_data_final tail -n+1001 ../data_demo/data/train_data_shuf > ../data_demo/data/train_data_finalTrain the classifier using the pre‑trained checkpoint:
python run_classifier.py \
--output_id2label_file="model/id2label.has_init" \
--vocab_file="../data_demo/data/vocab" \
--train_file="../data_demo/data/train_data_final" \
--dev_file="../data_demo/data/valid_data_final" \
--model_save_dir="model/finetune" \
--lstm_dim=500 \
--embedding_dim=1000 \
--opt_type=adam \
--batch_size=256 \
--epoch=20 \
--learning_rate=1e-4 \
--seed=1 \
--max_len=100 \
--print_step=10 \
--dropout_rate=0.1 \
--layer_num=1 \
--init_checkpoint="model/pretrain/lm_pretrain.ckpt-500000"If you prefer not to use the pre‑trained model, omit the --init_checkpoint argument.
SPTM Model Prediction
Score the competition test set with the fine‑tuned classifier:
python run_prediction.py \
--input_file="../data_demo/data/test_data" \
--vocab_file="../data_demo/data/vocab" \
--id2label_file="model/id2label.has_init" \
--model_dir="model/finetune" > ../data_demo/data/result_test_rawThe output contains lines like __label__xx , where xx is the predicted standard question ID.
Generating the Competition Submission File
Extract the extended question IDs and predicted standard IDs, then combine them:
awk '{print $2}' test_data > ext_id awk -F',' '{print $1}' result_test_raw | \
awk -F'|' '{print $1}' | \
awk -F'__' '{print $3}' > std_id echo ext_id,std_id > 58cop.csv paste -d"," ext_id std_id >> 58cop.csvUpload the 58cop.csv file; the competition score achieved was 0.6424.
Author
Wang Yong, AI Lab algorithm architect at 58.com, holds a master’s degree from Beijing Institute of Technology, previously worked on video recommendation at Youku, now focuses on NLP algorithm research.
58 Tech
Official tech channel of 58, a platform for tech innovation, sharing, and communication.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.