Can Machine Learning Reveal the True Author of Red Mansions' Final 40 Chapters?
This article uses machine learning to compare lexical patterns between the first 80 and last 40 chapters of 'Dream of the Red Chamber', demonstrating distinct stylistic differences that support the scholarly view that the final chapters were not authored by Cao Xueqin.
In academia it is generally believed that the last 40 chapters of "Dream of the Red Chamber" were not written by Cao Xueqin. This article applies machine learning to analyze the author's lexical habits, demonstrating stylistic differences between the first 80 chapters and the last 40, thereby confirming the latter were not authored by Cao.
Main Principle
Each author has unique word usage patterns; even deliberate imitation leaves traces.
In classical Chinese, function words are evenly distributed, but their frequencies differ across chapters. These frequencies are used as features.
High‑frequency words such as "了", "的", "我", "宝玉", "你", "道", "他", "也", "着", "是", "说" and others are also used as feature vectors.
Feature Selection
Selected 42 classical function words and high‑frequency words (adverbs, pronouns, verbs) as features, calculating their occurrence rates per chapter.
['之','其','或','亦','方','于','即','皆','因','仍','故','尚','呢','了','的','着','一','不','乃','呀','吗','咧','啊','把','让','向','往','是','在','越','再','更','比','很','偏','别','好','可','便','就','但','儿']
['又','也','都','要']
['这','那','你','我','他']
['来','去','道','笑','说']Dataset Construction
Chapters 20‑29 serve as class 1 (balanced poetry), chapters 110‑119 as class 2.
Feature vectors are fed into a Support Vector Machine (SVM) to train a classification model, which then classifies the remaining chapters.
SVM principles are referenced from NG’s Machine Learning lecture and the scikit‑learn library.
Relevant academic papers: Shi Jianjun (2011) and Li Xianping (1978).
Project Structure
README.md
textProcesser.py # text processing
modelBuilder.py # model building
decisionMaker.py # classification
neg_trainset.npy # negative training set
pos_trainset.npy # positive training set
trainset.npy # training set
testset.npy # test set
text/
redmansions.txt
chapter-1 ...
chapter-n
chapter-words-1 ...
chapter-words-n
chapter-wordcount-1 ...
chapter-wordcount-nUsage Steps
Run textProcesser.py to split the novel into chapters, tokenize, and compute word frequencies.
Run modelBuilder.py to extract feature vectors and build the classification model.
Run decisionMaker.py to classify the chapters.
Conclusion
Chapters 1‑80 are classified as one group, chapters 81‑120 as another, with the boundary around chapter 80. The last 40 chapters exhibit a distinct style, supporting the view that they were not written by the original author. Some misclassifications may stem from feature selection or the specific text edition used.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
