Artificial Intelligence 11 min read

Comprehensive Guide to Fine‑Tuning BERT on Chinese Datasets

This article provides a step‑by‑step guide for fine‑tuning Google’s open‑source BERT on Chinese datasets, covering model download, processor customization, code examples, training commands, and insights into the underlying TensorFlow estimator architecture and deployment considerations.

DataFunTalk

Nov 24, 2018

Comprehensive Guide to Fine‑Tuning BERT on Chinese Datasets

Since early November, Google Research has been open‑sourcing various versions of BERT. The released version is wrapped with TensorFlow’s high‑level API tf.estimator, so adapting to a new dataset only requires modifying the processor part of the code.

The code follows the paper: a pre‑training entry point run_pretraining.py and fine‑tuning entry points for different tasks, such as run_classifier.py for classification (e.g., CoLA, MRPC, MultiNLI) and run_squad.py for machine‑reading‑comprehension tasks like SQuAD.

Pre‑training demands substantial compute resources, but Google provides pretrained Chinese BERT checkpoints (12‑layer, 768‑hidden, 12‑head, 110 M parameters) that can be directly fine‑tuned on custom data.

The checkpoint archive contains the model variables ( bert_model.ckpt*), the vocabulary file ( vocab.txt), and the configuration file ( bert_config.json).

To run fine‑tuning on your own data you must customize the processor. Create a class that inherits from DataProcessor and override get_labels, get_train_examples, get_dev_examples, and get_test_examples. These methods are called by the main function according to the flags FLAGS.do_train, FLAGS.do_eval, and FLAGS.do_predict.

Example of get_train_examples for a CSV file containing sentence pairs and a binary label:

def get_train_examples(self, data_dir):
    file_path = os.path.join(data_dir, 'train.csv')
    with open(file_path, 'r') as f:
        reader = f.readlines()
    examples = []
    for index, line in enumerate(reader):
        guid = 'train-%d' % index
        split_line = line.strip().split(',')
        text_a = tokenization.convert_to_unicode(split_line[1])
        text_b = tokenization.convert_to_unicode(split_line[2])
        label = split_line[0]
        examples.append(InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
    return examples

The corresponding get_labels for a binary similarity task:

def get_labels(self):
    return ['0', '1']

After implementing the four example methods, add your processor class to the processors dictionary in the main script:

processors = {
    "cola": ColaProcessor,
    "mnli": MnliProcessor,
    "mrpc": MrpcProcessor,
    "xnli": XnliProcessor,
    "selfsim": SelfProcessor  # custom processor
}

Run fine‑tuning with a command similar to the following (adjust paths and hyper‑parameters as needed):

export BERT_BASE_DIR=/path/to/bert/chinese_L-12_H-768_A-12
export MY_DATASET=/path/to/xnli
python run_classifier.py \
  --task_name=selfsim \
  --do_train=true \
  --do_eval=true \
  --do_predict=true \
  --data_dir=$MY_DATASET \
  --vocab_file=$BERT_BASE_DIR/vocab.txt \
  --bert_config_file=$BERT_BASE_DIR/bert_config.json \
  --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \
  --max_seq_length=128 \
  --train_batch_size=32 \
  --learning_rate=5e-5 \
  --num_train_epochs=2.0 \
  --output_dir=/tmp/selfsim_output/

Beyond the processor, the source code includes file_based_convert_examples_to_features for converting examples to TFRecord format and create_model (in modeling.py) for building the BERT backbone and computing task‑specific loss. For TPU‑optimized runs the code uses tf.contrib.tpu.TPUEstimator, which can be swapped for tf.estimator.Estimator for GPU/CPU execution.

GitHub Issues reveal practical applications: BERT achieved ~79 % accuracy on the AI‑Challenger MRC track, and community members have built ZeroMQ‑based serving services and explored multi‑GPU performance.

In summary, Google’s open‑source BERT and its pretrained Chinese checkpoint provide a powerful foundation for a wide range of NLP tasks; by customizing processors and fine‑tuning with the provided scripts, practitioners can quickly adapt BERT to their own datasets and research objectives.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

fine-tuning TensorFlow BERT Chinese NLP processor estimator

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.