Artificial Intelligence 14 min read

Detecting Emerging Terms in Web Novels: PMI, Entropy, and TF‑IDF Methods

This article explores how to automatically discover new words in Chinese web novels by combining n‑gram statistics, pointwise mutual information, information entropy, and TF‑IDF filtering, presenting a practical, unsupervised pipeline that improves tokenization and search recall without manual labeling.

Yuewen Technology

Apr 1, 2022

Detecting Emerging Terms in Web Novels: PMI, Entropy, and TF‑IDF Methods

Background

Word segmentation is a crucial preprocessing step in text processing, and its quality directly affects downstream tasks such as search recall. Existing segmentation tools work well with generic vocabularies, but many domain‑specific terms—especially in web novels, like character names, place names, and secret techniques—are absent from standard dictionaries, leading to out‑of‑vocabulary (OOV) problems.

The challenge is to recognize consecutive characters as a single term, even when the combination does not follow standard Chinese grammar but is widely accepted online.

New Word Discovery

New‑word detection can be treated as a named entity recognition (NER) problem. While supervised NER models (e.g., LSTM+CRF) require large annotated corpora, we can adopt unsupervised statistical methods. First, we introduce a classic PMI‑and‑entropy algorithm, then propose improvements.

What character combinations can be considered a word?

In Chinese, a word typically consists of 2–5 characters. We generate 2‑gram, 3‑gram, 4‑gram, and 5‑gram sequences from the whole text and count their frequencies. High frequency alone is insufficient; for example, the character sequence “许七安” appears 21,766 times in the novel Da Feng Da Geng Ren , while its sub‑strings also appear frequently.

The internal cohesion of a candidate can be measured by Point‑wise Mutual Information (PMI). PMI is high when the two characters co‑occur more often than expected under independence. We also compute information entropy of the left and right contexts; a true word tends to have high entropy on both sides, indicating diverse surrounding characters.

For a bigram xy, if the characters are independent, PMI = 0. A larger PMI suggests stronger cohesion, implying the bigram is more likely a word. Entropy can also be used: higher left/right entropy indicates the bigram appears in varied contexts.

Filtering Old Words (Common Words)

After applying PMI and entropy thresholds, many frequent n‑grams are retrieved, but they include both new and existing words. We filter out known words using a dictionary; any n‑gram absent from the dictionary is treated as a candidate new word.

Algorithm Introduction

Introducing TF‑IDF

Low‑frequency character combinations often yield high PMI because the denominator in the PMI formula is small, but such combinations may be accidental. TF‑IDF mitigates this bias by down‑weighting n‑grams that are frequent across the entire corpus and up‑weighting those that are frequent in a specific document (novel). We first compute TF‑IDF for each n‑gram and retain only those exceeding a threshold before applying PMI and entropy calculations.

Benefits of TF‑IDF filtering include:

Avoiding PMI’s preference for rare characters.

Ensuring discovered terms have higher TF‑IDF, making them more useful in search scenarios.

Reducing the number of n‑grams that need PMI and entropy computation.

Automatically discarding old words without separate dictionary checks.

Subword Mining

Some multi‑character terms consist of a core sub‑word plus modifiers (e.g., “采薇姑娘”, “采薇师妹”). The core sub‑word “采薇” appears infrequently on its own, and its right‑side entropy is low because it is almost always followed by “姑娘” or “师妹”. By analyzing entropy and PMI of sub‑words, we can identify such patterns and decide whether the sub‑word itself should be treated as a new term.

When a discovered term contains four or more characters, we attempt to split it into “other + word” or “word + other”. If the “other” part is not in the dictionary and satisfies entropy, TF‑IDF, and PMI thresholds, it is also considered a new word.

Experimental Results

We applied the pipeline to 10,000 web novels. For the novel Da Feng Da Geng Ren , the method extracted numerous new terms, such as:

大奉, 大奉王朝, 大奉朝廷, 许平峰, 许辞旧, 度厄, 度厄罗汉, 桂月楼, 鸾钰, 许七安, 许平志, … (truncated for brevity).

Conclusion

By combining TF‑IDF, PMI, information entropy, and sub‑word mining, we can automatically discover novel terms in novels, improving segmentation and search recall without any manual annotation. The approach is simple, practical, and can complement other methods.

References

K. W. Church and P. Hanks. 1990. Word association norms, mutual information, and lexicography. Comput. Linguist., 16(1): 22–29.

M. Huang et al. 2014. New Word Detection for Sentiment Analysis. In Proc. of ACL 52, pages 531–541.

Matrix67. "Internet Age Sociolinguistics: Text Data Mining Based on SNS".

T. Dunning. 1993. Accurate methods for the statistics of surprise and coincidence. Comput. Linguistics, 19(1): 61‑74.

B. Daille. 1994. Approche mixte pour l'extraction automatique de terminologie. PhD thesis, Université Paris 7.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

NLP TF-IDF Chinese text mining information entropy new word detection PMI

Written by

Yuewen Technology

The Yuewen Group tech team supports and powers services like QQ Reading, Qidian Books, and Hongxiu Reading. This account targets internet developers, sharing high‑quality original technical content. Follow us for the latest Yuewen tech updates.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.