Comprehensive Guide to Fine‑Tuning BERT on Chinese Datasets
This article provides a step‑by‑step guide for fine‑tuning Google’s open‑source BERT on Chinese datasets, covering model download, processor customization, code examples, training commands, and insights into the underlying TensorFlow estimator architecture and deployment considerations.
Since early November, Google Research has been open‑sourcing various versions of BERT. The released version is wrapped with TensorFlow’s high‑level API tf.estimator , so adapting to a new dataset only requires modifying the processor part of the code.
The code follows the paper: a pre‑training entry point run_pretraining.py and fine‑tuning entry points for different tasks, such as run_classifier.py for classification (e.g., CoLA, MRPC, MultiNLI) and run_squad.py for machine‑reading‑comprehension tasks like SQuAD.
Pre‑training demands substantial compute resources, but Google provides pretrained Chinese BERT checkpoints (12‑layer, 768‑hidden, 12‑head, 110 M parameters) that can be directly fine‑tuned on custom data.
The checkpoint archive contains the model variables ( bert_model.ckpt* ), the vocabulary file ( vocab.txt ), and the configuration file ( bert_config.json ).
To run fine‑tuning on your own data you must customize the processor . Create a class that inherits from DataProcessor and override get_labels , get_train_examples , get_dev_examples , and get_test_examples . These methods are called by the main function according to the flags FLAGS.do_train , FLAGS.do_eval , and FLAGS.do_predict .
Example of get_train_examples for a CSV file containing sentence pairs and a binary label:
def get_train_examples(self, data_dir):
file_path = os.path.join(data_dir, 'train.csv')
with open(file_path, 'r') as f:
reader = f.readlines()
examples = []
for index, line in enumerate(reader):
guid = 'train-%d' % index
split_line = line.strip().split(',')
text_a = tokenization.convert_to_unicode(split_line[1])
text_b = tokenization.convert_to_unicode(split_line[2])
label = split_line[0]
examples.append(InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
return examplesThe corresponding get_labels for a binary similarity task:
def get_labels(self):
return ['0', '1']After implementing the four example methods, add your processor class to the processors dictionary in the main script:
processors = {
"cola": ColaProcessor,
"mnli": MnliProcessor,
"mrpc": MrpcProcessor,
"xnli": XnliProcessor,
"selfsim": SelfProcessor # custom processor
}Run fine‑tuning with a command similar to the following (adjust paths and hyper‑parameters as needed):
export BERT_BASE_DIR=/path/to/bert/chinese_L-12_H-768_A-12
export MY_DATASET=/path/to/xnli
python run_classifier.py \
--task_name=selfsim \
--do_train=true \
--do_eval=true \
--do_predict=true \
--data_dir=$MY_DATASET \
--vocab_file=$BERT_BASE_DIR/vocab.txt \
--bert_config_file=$BERT_BASE_DIR/bert_config.json \
--init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \
--max_seq_length=128 \
--train_batch_size=32 \
--learning_rate=5e-5 \
--num_train_epochs=2.0 \
--output_dir=/tmp/selfsim_output/Beyond the processor, the source code includes file_based_convert_examples_to_features for converting examples to TFRecord format and create_model (in modeling.py ) for building the BERT backbone and computing task‑specific loss. For TPU‑optimized runs the code uses tf.contrib.tpu.TPUEstimator , which can be swapped for tf.estimator.Estimator for GPU/CPU execution.
GitHub Issues reveal practical applications: BERT achieved ~79 % accuracy on the AI‑Challenger MRC track, and community members have built ZeroMQ‑based serving services and explored multi‑GPU performance.
In summary, Google’s open‑source BERT and its pretrained Chinese checkpoint provide a powerful foundation for a wide range of NLP tasks; by customizing processors and fine‑tuning with the provided scripts, practitioners can quickly adapt BERT to their own datasets and research objectives.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.