How CKBERT Boosts Chinese NLP with Knowledge‑Enhanced Pretraining
CKBERT, a Chinese knowledge‑enhanced BERT developed by Alibaba’s EasyNLP team, integrates external knowledge graphs and internal linguistic cues through novel pre‑training tasks, offers three model sizes compatible with HuggingFace and PAI, and demonstrates superior performance on CLUE and NER benchmarks while providing easy deployment on cloud platforms.
1. CKBERT Model Overview
CKBERT (Chinese Knowledge‑enhanced BERT) is a self‑developed Chinese pre‑training model from Alibaba’s EasyNLP team that injects two types of knowledge—external knowledge graphs and internal linguistic information—into the BERT architecture without changing the model structure, enabling scalable knowledge integration.
Model Configurations
Base model: 151M parameters, 12 layers, 12 attention heads, hidden size 768, sequence length 128.
Large model: 428M parameters, 24 layers, 16 attention heads, hidden size 1024, sequence length 128.
Huge model: 1.3B parameters, 24 layers, 8 attention heads, hidden size 2048, sequence length 128.
2. Implementation Details
Data preprocessing uses the Harbin Institute of Technology LTP platform for tokenization, NER, dependency parsing, and semantic role labeling. Knowledge triples are extracted from sentence entities and sampled as positive and negative examples. Core Python functions include:
def ltp_process(ltp: LTP, data: List[Dict[str, Union[str, List[Union[int, str]]]]]):
"""use ltp to process the data"""
new_data = list(map(lambda x: x['text'][0].replace(" ", ""), data))
seg, hiddens = ltp.seg(new_data)
result = {}
result['seg'] = seg
result['ner'] = ltp.ner(hiddens)
result['dep'] = ltp.dep(hiddens)
result['sdp'] = ltp.sdp(hiddens)
for index in range(len(data)):
data[index]['text'][0] = data[index]['text'][0].replace(" ", "")
data[index]['seg'] = result['seg'][index]
data[index]['ner'] = result['ner'][index]
data[index]['dep'] = result['dep'][index]
data[index]['sdp'] = result['sdp'][index] def dep_sdp_mask(left_numbers: List[int], data_new: List[List[Union[int, str]]], markers_: List[List[int]], selected_numbers_: set, number_: int, marker_attrs: Dict[str, List[int]]) -> int:
"""mask the `mask_labels` for sdp and dep and record the maskers for each mask item"""
np.random.shuffle(left_numbers)
for item_ in left_numbers:
target_item = data_new[item_]
seg_ids = np.array(target_item[:2]) - 1
delete_ids = np.where(seg_ids < 1)[0]
seg_ids = np.delete(seg_ids, delete_ids)
temp_ids = seg2id(seg_ids)
ids = []
for item in temp_ids:
ids += item.copy()
if check_ids(ids):
length_ = len(ids)
if number_ > length_:
for id_ in ids:
mask_labels[id_] = 1
# additional processing omitted for brevity
selected_numbers_.add(item_)
return length_
else:
return 0
return 0 def get_positive_and_negative_examples(self, ner_data: str, negative_level: int = 3) -> Union[bool, Dict[str, List[str]]]:
"""get the positive examples and negative examples for the ner data"""
knowledge = self.Knowledge_G
common_used = set()
def get_data(key: str, data: Dict[str, str], results: List[str], deep: int, insert_flag: bool = False):
common_used.add(key)
if deep == 0:
return
for key_item in data:
if data[key_item] not in common_used and insert_flag:
results.append(data[key_item])
if data[key_item] in knowledge and data[key_item] not in common_used:
get_data(data[key_item], knowledge[data[key_item]], results, deep - 1, True)
all_examples = {'ner': ner_data, 'positive_examples': [], 'negative_examples': []}
if ner_data in knowledge:
tp_data = knowledge[ner_data]
if '描述' in tp_data:
positive_example = tp_data['描述']
else:
keys = list(tp_data.keys())
choice = np.random.choice(range(len(keys)), 1)[0]
positive_example = tp_data[keys[choice]]
if ner_data in positive_example:
all_examples['positive_examples'].append(positive_example)
else:
all_examples['positive_examples'].append(ner_data + positive_example)
negative_examples = []
get_data(ner_data, tp_data, negative_examples, negative_level)
negative_examples = [ner_data + x if ner_data not in x else x for x in negative_examples]
all_examples['negative_examples'] = negative_examples
return all_examples
return False model.backbone.resize_token_embeddings(len(train_dataset.tokenizer))
model.config.vocab_size = len(train_dataset.tokenizer) def compute_simcse(self, original_outputs: torch.Tensor, forward_outputs: torch.Tensor) -> float:
original_hidden_states = original_outputs['hidden_states'].unsqueeze(-2)
loss = nn.CrossEntropyLoss()
forward_outputs = torch.mean(forward_outputs, dim=-2)
cos_result = self.CosSim(original_hidden_states, forward_outputs)
cos_result = cos_result.view(-1, cos_result.size(-1))
labels = torch.zeros(cos_result.size(0), device=original_outputs['hidden_states'].device).long()
loss_ = loss(cos_result, labels)
return loss_3. Pre‑training Acceleration on PAI
To reduce training time, TorchAccelerator and Automatic Mixed Precision (AMP) are combined, achieving more than 40% speed‑up. The acceleration script is:
gpu_number=1
negative_e_number=4
negative_e_length=16
base_dir=$PWD
checkpoint_dir=$base_dir/checkpoints
resources=$base_dir/resources
local_kg=$resources/ownthink_triples_small.txt
local_train_file=$resources/train_small.txt
remote_kg=https://atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com/release/tutorials/ckbert/ownthink_triples_small.txt
remote_train_file=https://atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com/release/tutorials/ckbert/train_small.txt
if [ ! -d $checkpoint_dir ]; then mkdir $checkpoint_dir; fi
if [ ! -d $resources ]; then mkdir $resources; fi
if [ ! -f $local_kg ]; then wget -P $resources $remote_kg; fi
if [ ! -f $local_train_file ]; then wget -P $resources $remote_train_file; fi
python -m torch.distributed.launch --nproc_per_node=$gpu_number \
--master_port=52349 \
$base_dir/main.py \
--mode=train \
--worker_gpu=$gpu_number \
--tables=$local_train_file, \
--learning_rate=5e-5 \
--epoch_num=5 \
--logging_steps=10 \
--save_checkpoint_steps=2150 \
--sequence_length=256 \
--train_batch_size=20 \
--checkpoint_dir=$checkpoint_dir \
--app_name=language_modeling \
--use_amp \
--save_all_checkpoints \
--user_defined_parameters="pretrain_model_name_or_path=hfl/macbert-base-zh external_mask_flag=True contrast_learning_flag=True negative_e_number=${negative_e_number} negative_e_length=${negative_e_length} kg_path=${local_kg}"4. Experimental Results
CKBERT consistently outperforms classic BERT and other knowledge‑enhanced models on the CLUE benchmark and NER datasets. Larger model sizes benefit more from heterogeneous knowledge injection.
5. Usage Tutorial
CKBERT can be used through EasyNLP, HuggingFace Transformers, or directly on Alibaba Cloud PAI.
Installation
pip install easynlpPre‑training
gpu_number=1
# (script shown in section 3) ...Fine‑tuning with EasyNLP
easynlp \
--mode=train \
--worker_gpu=1 \
--tables=train.tsv,dev.tsv \
--input_schema=label:str:1,sid1:str:1,sid2:str:1,sent1:str:1,sent2:str:1 \
--first_sequence=sent1 \
--label_name=label \
--label_enumerate_values=0,1 \
--checkpoint_dir=./classification_model \
--epoch_num=1 \
--sequence_length=128 \
--app_name=text_classify \
--user_defined_parameters='pretrain_model_name_or_path=alibaba-pai/pai-ckbert-base-zh'Inference with EasyNLP
easynlp \
--mode=predict \
--tables=dev.tsv \
--outputs=dev.pred.tsv \
--input_schema=label:str:1,sid1:str:1,sid2:str:1,sent1:str:1,sent2:str:1 \
--output_schema=predictions,probabilities,logits,output \
--append_cols=label \
--first_sequence=sent1 \
--checkpoint_path=./classification_model \
--app_name=text_classifyUsing HuggingFace Pipeline
from transformers import AutoTokenizer, AutoModelForMaskedLM, FillMaskPipeline
tokenizer = AutoTokenizer.from_pretrained("alibaba-pai/pai-ckbert-base-zh", use_auth_token=True)
model = AutoModelForMaskedLM.from_pretrained("alibaba-pai/pai-ckbert-base-zh", use_auth_token=True)
unmasker = FillMaskPipeline(model, tokenizer)
print(unmasker("巴黎是[MASK]国的首都。", top_k=5))Loading Model Directly
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("alibaba-pai/pai-ckbert-base-zh", use_auth_token=True)
model = AutoModelForMaskedLM.from_pretrained("alibaba-pai/pai-ckbert-base-zh", use_auth_token=True)
text = "巴黎是[MASK]国的首都。"
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)6. Deployment on Alibaba Cloud PAI
CKBERT models are available on PAI’s Data Science Workshop (DSW) gallery with ready‑to‑run notebooks for Chinese NER and other tasks.
7. Future Outlook
The EasyNLP team plans to integrate more Chinese knowledge models and SOTA architectures, extending support to multimodal tasks and encouraging community contributions.
Alibaba Cloud Big Data AI Platform
The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
