Artificial Intelligence 10 min read

Chinese New Word Discovery: From Traditional Unsupervised Methods to CNN‑Based Deep Learning

The article examines the challenge of out‑of‑vocabulary terms in Chinese e‑commerce NLP, reviews classic unsupervised metrics such as frequency, cohesion and neighbor entropy, and proposes a lightweight fully‑convolutional network inspired by image‑segmentation techniques to automatically detect new words.

Ctrip Technology

Oct 13, 2022

Chinese New Word Discovery: From Traditional Unsupervised Methods to CNN‑Based Deep Learning

Overview In fast‑changing e‑commerce platforms, many newly coined terms are not present in existing lexical resources, causing out‑of‑vocabulary (OOV) problems that degrade tokenization quality, search recall, and highlight accuracy. While word‑level embeddings would be ideal, the prevalence of OOV words often forces practitioners to rely on character‑level vectors.

1. Traditional Unsupervised Methods The classic pipeline splits raw corpora into n‑grams, generates candidate fragments, and evaluates them using three statistical indicators: (a) heat – raw frequency of the fragment; (b) cohesion – measured by pointwise mutual information (PMI); and (c) left/right neighbor entropy – the diversity of characters appearing adjacent to the fragment. High frequency, high cohesion, and high neighbor entropy together suggest a valid new word.

2. Limitations of Classic Approaches These methods require manually set thresholds for each indicator. As corpora evolve, the distribution of frequencies and entropies shifts, demanding continual retuning and consuming substantial human effort.

3. Deep‑Learning‑Based New Word Discovery By visualizing the normalized frequency of every possible substring as a 2‑D matrix (rows = start character, columns = end character), a word‑frequency probability map is obtained. Human inspection of such maps already reveals bright triangular regions corresponding to meaningful terms (e.g., “浦东”, “机场”). This observation motivates treating new‑word detection as an image‑segmentation problem.

Early segmentation algorithms relied on simple thresholding, but modern approaches employ deep convolutional networks, notably U‑Net. The proposed solution adapts this idea to a much simpler fully‑convolutional network (FCN) because the input maps are low‑resolution (e.g., 24×24) and only binary classification (word‑pixel vs. background) is needed.

4. Proposed FCN Architecture

Zero‑pad the frequency map to 24×24.

Apply two consecutive 3×3 convolutional layers, each producing 4 channels.

Concatenate the two feature maps and pass them through another 3×3 convolution that outputs a single channel.

Use a logistic loss (no softmax) for binary pixel classification.

The model discards down‑sampling/upsampling stages of U‑Net, focusing on local differences (first‑order and second‑order gradients) that indicate word boundaries.

5. Model Inspection After training, intermediate convolution kernels can be examined via TensorFlow, e.g., model.get_layer('Conv2').__dict__. The first row of the kernel captures the difference between a pixel and the one above it, while the second row captures the opposite direction; larger absolute differences imply a higher likelihood of a word boundary.

6. Optimization Opportunities The current FCN uses only frequency as input. Incorporating additional features such as left/right neighbor entropy could improve precision. Moreover, increasing kernel size or network depth would broaden the receptive field, though it may introduce over‑fitting.

7. Practical Impact Integrating the discovered terms into the lexical dictionary raises tokenization coverage and improves search recall. Although overall search accuracy may not change dramatically for end users, the model reduces erroneous highlights and provides a useful signal for downstream ranking and candidate generation stages.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

CNN Deep Learning NLP U-Net New Word Discovery unsupervised

Written by

Ctrip Technology

Official Ctrip Technology account, sharing and discussing growth.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.