Can Machine Learning Reveal the True Author of Red Mansions' Final 40 Chapters?

This article uses machine learning to compare lexical patterns between the first 80 and last 40 chapters of 'Dream of the Red Chamber', demonstrating distinct stylistic differences that support the scholarly view that the final chapters were not authored by Cao Xueqin.

21CTO
21CTO
21CTO
Can Machine Learning Reveal the True Author of Red Mansions' Final 40 Chapters?

In academia it is generally believed that the last 40 chapters of "Dream of the Red Chamber" were not written by Cao Xueqin. This article applies machine learning to analyze the author's lexical habits, demonstrating stylistic differences between the first 80 chapters and the last 40, thereby confirming the latter were not authored by Cao.

Main Principle

Each author has unique word usage patterns; even deliberate imitation leaves traces.

In classical Chinese, function words are evenly distributed, but their frequencies differ across chapters. These frequencies are used as features.

High‑frequency words such as "了", "的", "我", "宝玉", "你", "道", "他", "也", "着", "是", "说" and others are also used as feature vectors.

Feature Selection

Selected 42 classical function words and high‑frequency words (adverbs, pronouns, verbs) as features, calculating their occurrence rates per chapter.

['之','其','或','亦','方','于','即','皆','因','仍','故','尚','呢','了','的','着','一','不','乃','呀','吗','咧','啊','把','让','向','往','是','在','越','再','更','比','很','偏','别','好','可','便','就','但','儿']
['又','也','都','要']
['这','那','你','我','他']
['来','去','道','笑','说']

Dataset Construction

Chapters 20‑29 serve as class 1 (balanced poetry), chapters 110‑119 as class 2.

Feature vectors are fed into a Support Vector Machine (SVM) to train a classification model, which then classifies the remaining chapters.

SVM principles are referenced from NG’s Machine Learning lecture and the scikit‑learn library.

Relevant academic papers: Shi Jianjun (2011) and Li Xianping (1978).

Project Structure

README.md
textProcesser.py   # text processing
modelBuilder.py    # model building
decisionMaker.py   # classification
neg_trainset.npy   # negative training set
pos_trainset.npy   # positive training set
trainset.npy       # training set
testset.npy        # test set
text/
    redmansions.txt
    chapter-1 ...
    chapter-n
    chapter-words-1 ...
    chapter-words-n
    chapter-wordcount-1 ...
    chapter-wordcount-n

Usage Steps

Run textProcesser.py to split the novel into chapters, tokenize, and compute word frequencies.

Run modelBuilder.py to extract feature vectors and build the classification model.

Run decisionMaker.py to classify the chapters.

Conclusion

Chapters 1‑80 are classified as one group, chapters 81‑120 as another, with the boundary around chapter 80. The last 40 chapters exhibit a distinct style, supporting the view that they were not written by the original author. Some misclassifications may stem from feature selection or the specific text edition used.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

machine learningfeature engineeringtext classificationSupport Vector Machineliterary analysisRed Mansions
21CTO
Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.