Artificial Intelligence 40 min read

How CKBERT Boosts Chinese NLP with Knowledge‑Enhanced Pretraining

CKBERT, a Chinese knowledge‑enhanced BERT developed by Alibaba’s EasyNLP team, integrates external knowledge graphs and internal linguistic cues through novel pre‑training tasks, offers three model sizes compatible with HuggingFace and PAI, and demonstrates superior performance on CLUE and NER benchmarks while providing easy deployment on cloud platforms.

Alibaba Cloud Big Data AI Platform

Oct 19, 2022

How CKBERT Boosts Chinese NLP with Knowledge‑Enhanced Pretraining

1. CKBERT Model Overview

CKBERT (Chinese Knowledge‑enhanced BERT) is a self‑developed Chinese pre‑training model from Alibaba’s EasyNLP team that injects two types of knowledge—external knowledge graphs and internal linguistic information—into the BERT architecture without changing the model structure, enabling scalable knowledge integration.

Model Configurations

Base model: 151M parameters, 12 layers, 12 attention heads, hidden size 768, sequence length 128.

Large model: 428M parameters, 24 layers, 16 attention heads, hidden size 1024, sequence length 128.

Huge model: 1.3B parameters, 24 layers, 8 attention heads, hidden size 2048, sequence length 128.

2. Implementation Details

Data preprocessing uses the Harbin Institute of Technology LTP platform for tokenization, NER, dependency parsing, and semantic role labeling. Knowledge triples are extracted from sentence entities and sampled as positive and negative examples. Core Python functions include:

def ltp_process(ltp: LTP, data: List[Dict[str, Union[str, List[Union[int, str]]]]]):
    """use ltp to process the data"""
    new_data = list(map(lambda x: x['text'][0].replace(" ", ""), data))
    seg, hiddens = ltp.seg(new_data)
    result = {}
    result['seg'] = seg
    result['ner'] = ltp.ner(hiddens)
    result['dep'] = ltp.dep(hiddens)
    result['sdp'] = ltp.sdp(hiddens)
    for index in range(len(data)):
        data[index]['text'][0] = data[index]['text'][0].replace(" ", "")
        data[index]['seg'] = result['seg'][index]
        data[index]['ner'] = result['ner'][index]
        data[index]['dep'] = result['dep'][index]
        data[index]['sdp'] = result['sdp'][index]

def dep_sdp_mask(left_numbers: List[int], data_new: List[List[Union[int, str]]], markers_: List[List[int]], selected_numbers_: set, number_: int, marker_attrs: Dict[str, List[int]]) -> int:
    """mask the `mask_labels` for sdp and dep and record the maskers for each mask item"""
    np.random.shuffle(left_numbers)
    for item_ in left_numbers:
        target_item = data_new[item_]
        seg_ids = np.array(target_item[:2]) - 1
        delete_ids = np.where(seg_ids < 1)[0]
        seg_ids = np.delete(seg_ids, delete_ids)
        temp_ids = seg2id(seg_ids)
        ids = []
        for item in temp_ids:
            ids += item.copy()
        if check_ids(ids):
            length_ = len(ids)
            if number_ > length_:
                for id_ in ids:
                    mask_labels[id_] = 1
                # additional processing omitted for brevity
                selected_numbers_.add(item_)
                return length_
            else:
                return 0
    return 0

def get_positive_and_negative_examples(self, ner_data: str, negative_level: int = 3) -> Union[bool, Dict[str, List[str]]]:
    """get the positive examples and negative examples for the ner data"""
    knowledge = self.Knowledge_G
    common_used = set()
    def get_data(key: str, data: Dict[str, str], results: List[str], deep: int, insert_flag: bool = False):
        common_used.add(key)
        if deep == 0:
            return
        for key_item in data:
            if data[key_item] not in common_used and insert_flag:
                results.append(data[key_item])
            if data[key_item] in knowledge and data[key_item] not in common_used:
                get_data(data[key_item], knowledge[data[key_item]], results, deep - 1, True)
    all_examples = {'ner': ner_data, 'positive_examples': [], 'negative_examples': []}
    if ner_data in knowledge:
        tp_data = knowledge[ner_data]
        if '描述' in tp_data:
            positive_example = tp_data['描述']
        else:
            keys = list(tp_data.keys())
            choice = np.random.choice(range(len(keys)), 1)[0]
            positive_example = tp_data[keys[choice]]
        if ner_data in positive_example:
            all_examples['positive_examples'].append(positive_example)
        else:
            all_examples['positive_examples'].append(ner_data + positive_example)
        negative_examples = []
        get_data(ner_data, tp_data, negative_examples, negative_level)
        negative_examples = [ner_data + x if ner_data not in x else x for x in negative_examples]
        all_examples['negative_examples'] = negative_examples
        return all_examples
    return False

model.backbone.resize_token_embeddings(len(train_dataset.tokenizer))
model.config.vocab_size = len(train_dataset.tokenizer)

def compute_simcse(self, original_outputs: torch.Tensor, forward_outputs: torch.Tensor) -> float:
    original_hidden_states = original_outputs['hidden_states'].unsqueeze(-2)
    loss = nn.CrossEntropyLoss()
    forward_outputs = torch.mean(forward_outputs, dim=-2)
    cos_result = self.CosSim(original_hidden_states, forward_outputs)
    cos_result = cos_result.view(-1, cos_result.size(-1))
    labels = torch.zeros(cos_result.size(0), device=original_outputs['hidden_states'].device).long()
    loss_ = loss(cos_result, labels)
    return loss_

3. Pre‑training Acceleration on PAI

To reduce training time, TorchAccelerator and Automatic Mixed Precision (AMP) are combined, achieving more than 40% speed‑up. The acceleration script is:

gpu_number=1
negative_e_number=4
negative_e_length=16
base_dir=$PWD
checkpoint_dir=$base_dir/checkpoints
resources=$base_dir/resources
local_kg=$resources/ownthink_triples_small.txt
local_train_file=$resources/train_small.txt
remote_kg=https://atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com/release/tutorials/ckbert/ownthink_triples_small.txt
remote_train_file=https://atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com/release/tutorials/ckbert/train_small.txt
if [ ! -d $checkpoint_dir ]; then mkdir $checkpoint_dir; fi
if [ ! -d $resources ]; then mkdir $resources; fi
if [ ! -f $local_kg ]; then wget -P $resources $remote_kg; fi
if [ ! -f $local_train_file ]; then wget -P $resources $remote_train_file; fi
python -m torch.distributed.launch --nproc_per_node=$gpu_number \
    --master_port=52349 \
    $base_dir/main.py \
    --mode=train \
    --worker_gpu=$gpu_number \
    --tables=$local_train_file, \
    --learning_rate=5e-5  \
    --epoch_num=5  \
    --logging_steps=10 \
    --save_checkpoint_steps=2150 \
    --sequence_length=256 \
    --train_batch_size=20 \
    --checkpoint_dir=$checkpoint_dir \
    --app_name=language_modeling \
    --use_amp \
    --save_all_checkpoints \
    --user_defined_parameters="pretrain_model_name_or_path=hfl/macbert-base-zh external_mask_flag=True contrast_learning_flag=True negative_e_number=${negative_e_number} negative_e_length=${negative_e_length} kg_path=${local_kg}"

4. Experimental Results

CKBERT consistently outperforms classic BERT and other knowledge‑enhanced models on the CLUE benchmark and NER datasets. Larger model sizes benefit more from heterogeneous knowledge injection.

5. Usage Tutorial

CKBERT can be used through EasyNLP, HuggingFace Transformers, or directly on Alibaba Cloud PAI.

Installation

pip install easynlp

Pre‑training

gpu_number=1
# (script shown in section 3) ...

Fine‑tuning with EasyNLP

easynlp \
   --mode=train \
   --worker_gpu=1 \
   --tables=train.tsv,dev.tsv \
   --input_schema=label:str:1,sid1:str:1,sid2:str:1,sent1:str:1,sent2:str:1 \
   --first_sequence=sent1 \
   --label_name=label \
   --label_enumerate_values=0,1 \
   --checkpoint_dir=./classification_model \
   --epoch_num=1  \
   --sequence_length=128 \
   --app_name=text_classify \
   --user_defined_parameters='pretrain_model_name_or_path=alibaba-pai/pai-ckbert-base-zh'

Inference with EasyNLP

easynlp \
  --mode=predict \
  --tables=dev.tsv \
  --outputs=dev.pred.tsv \
  --input_schema=label:str:1,sid1:str:1,sid2:str:1,sent1:str:1,sent2:str:1 \
  --output_schema=predictions,probabilities,logits,output \
  --append_cols=label \
  --first_sequence=sent1 \
  --checkpoint_path=./classification_model \
  --app_name=text_classify

Using HuggingFace Pipeline

from transformers import AutoTokenizer, AutoModelForMaskedLM, FillMaskPipeline

tokenizer = AutoTokenizer.from_pretrained("alibaba-pai/pai-ckbert-base-zh", use_auth_token=True)
model = AutoModelForMaskedLM.from_pretrained("alibaba-pai/pai-ckbert-base-zh", use_auth_token=True)
unmasker = FillMaskPipeline(model, tokenizer)
print(unmasker("巴黎是[MASK]国的首都。", top_k=5))

Loading Model Directly

from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("alibaba-pai/pai-ckbert-base-zh", use_auth_token=True)
model = AutoModelForMaskedLM.from_pretrained("alibaba-pai/pai-ckbert-base-zh", use_auth_token=True)
text = "巴黎是[MASK]国的首都。"
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)

6. Deployment on Alibaba Cloud PAI

CKBERT models are available on PAI’s Data Science Workshop (DSW) gallery with ready‑to‑run notebooks for Chinese NER and other tasks.

7. Future Outlook

The EasyNLP team plans to integrate more Chinese knowledge models and SOTA architectures, extending support to multimodal tasks and encouraging community contributions.

pretraining Chinese NLP CKBERT EasyNLP knowledge-enhanced BERT

Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.