Boosting Chinese NER Accuracy with Crowdsourced Data and Adversarial Learning

This paper proposes a Chinese named entity recognition method that leverages noisy crowdsourced annotations through adversarial training with dual BiLSTM modules, demonstrating consistent F1 improvements on dialogue and e‑commerce datasets.

Alibaba Cloud Developer
Alibaba Cloud Developer
Alibaba Cloud Developer
Boosting Chinese NER Accuracy with Crowdsourced Data and Adversarial Learning

Article Purpose

Named Entity Recognition (NER) is a crucial NLP task that identifies entities such as person names and place names in text. To obtain new labeled data at low cost, we use crowdsourced annotation and investigate how to improve Chinese NER accuracy despite the inherent noise.

Method Overview

We introduce a model that employs two bidirectional LSTM modules: a private BiLSTM that captures annotator‑specific information and a common BiLSTM that learns shared features across annotators via adversarial learning, treating each annotator as a classification target. The outputs are combined and fed to a CRF layer for final NER tagging.

Data Used

We evaluate the model on two datasets:

Dialogue data from Gowild (20,000 sentences annotated by 43 workers for person and song names).

E‑commerce data from Alibaba, where five workers annotate titles and user queries for five entity types (brand, product, model, specification, material), each sentence being labeled by two workers.

Both datasets contain noisy annotations due to varying annotator expertise and occasional grammatical errors in the dialogue corpus.

Baseline

We use a BIOE tagging scheme and first train a CRF model as a traditional baseline. Then we replace handcrafted features with a character‑level BiLSTM encoder, followed by a CRF decoder.

Adversarial Learning Component

The private BiLSTM learns annotator‑specific distributions, while the common BiLSTM learns shared features across annotators. A third BiLSTM, called label , processes the current annotation sequence. The outputs of private and common modules are merged and passed to a CRF layer for NER. Additionally, the concatenated outputs of label and common modules are fed to a CNN classifier that predicts the annotator identity; this classifier is trained adversarially so that its gradients are reversed when updating the shared encoder.

Results

On the dialogue dataset, our adversarial model achieves roughly a one‑point F1 improvement over the baseline. On the e‑commerce dataset, we observe a similar gain of about one F1 point.

The paper was presented at AAAI 2018.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

named entity recognitionChinese NLPcrowdsourcingadversarial learningCRFBiLSTM
Alibaba Cloud Developer
Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.