How to Scrape Chinese Classic Novels with Python: A Step‑by‑Step Guide
This tutorial walks you through planning, extracting, and saving classic Chinese novel content from shicimingju.com using Python, regular expressions, and file storage, providing clear code examples and practical tips for successful web scraping.
Planning the Crawl
Before writing a spider, clarify four steps: where to crawl, what to crawl, how to crawl, and how to store the scraped data.
The concepts of where and what are intertwined; identifying one naturally reveals the other.
In this example we scrape the novel site shicimingju.com , targeting works such as "Three Kingdoms" and "Sui‑Tang Legends".
The data to collect includes:
Main page: novel title, chapter titles, chapter URLs
Detail pages: the full text of each chapter
How to Crawl (How)
1. Use regular expressions on the main page to extract the novel name ( book_name ), chapter titles ( chapter ), and chapter links ( bookurl ).
2. Iterate over bookurl to request each chapter page.
3. Apply regular expressions to extract the chapter content.
Saving the Scraped Data
1. Because the content is textual novels, storing them as files is more appropriate than a database.
2. Create a file named after the novel ( book_name ) and write each chapter title and its text sequentially.
Code Implementation
<code>import urllib.request
import re
indexUrl = "http://www.shicimingju.com/book/sanguoyanyi.html"
html = urllib.request.urlopen(indexUrl).read()
html = html.decode('utf8')
# Extract book name, chapter titles, and chapter URLs
book_name = re.findall('<h1>(.*)</h1>', html, re.S)
chapter = re.findall('href="/book/.{0,30}\d\.html">(.*?)</a>', html, re.S)
bookurl = re.findall('href="(/book/.{0,30}\d\.html)"', html, re.S)
chapterUrlBegin = re.sub('.html', '', indexUrl) # base for chapter URLs
for i in range(0, len(bookurl)):
# Get chapter number from relative URL
number = re.findall('/(.{1,4})\.html', bookurl[i])
# Build full chapter URL
chapterUrl = re.sub('$', '/' + number[0] + '.html', chapterUrlBegin)
# Fetch chapter page
chapterHtml = urllib.request.urlopen(chapterUrl).read()
chapterHtml = chapterHtml.decode('utf-8', 'ignore')
# Extract chapter content
chapterText = re.findall('<div id="con2".*?>(.*?)</div>', chapterHtml, re.S)
# Clean HTML tags and whitespace
chapterText = re.sub('<p>', '', ''.join(chapterText))
chapterText = re.sub('</p>', '', ''.join(chapterText))
chapterText = re.sub('', ' ', ''.join(chapterText))
# Write to file
f = open('D://book/' + ''.join(book_name) + '.txt', 'a', encoding='utf-8')
f.write(chapter[i] + "\n")
f.write(chapterText + "\n")
f.close()
</code>Following these steps successfully scrapes the entire novel and saves it as a text file.
The overall process can be summed up as: use regular expressions to capture web information, then store the results in files.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.