Backend Development 5 min read

How to Scrape Chinese Classic Novels with Python: A Step‑by‑Step Guide

This tutorial walks you through planning, extracting, and saving classic Chinese novel content from shicimingju.com using Python, regular expressions, and file storage, providing clear code examples and practical tips for successful web scraping.

Python Programming Learning Circle
Python Programming Learning Circle
Python Programming Learning Circle
How to Scrape Chinese Classic Novels with Python: A Step‑by‑Step Guide

Planning the Crawl

Before writing a spider, clarify four steps: where to crawl, what to crawl, how to crawl, and how to store the scraped data.

The concepts of where and what are intertwined; identifying one naturally reveals the other.

In this example we scrape the novel site shicimingju.com , targeting works such as "Three Kingdoms" and "Sui‑Tang Legends".

The data to collect includes:

Main page: novel title, chapter titles, chapter URLs

Detail pages: the full text of each chapter

How to Crawl (How)

1. Use regular expressions on the main page to extract the novel name ( book_name ), chapter titles ( chapter ), and chapter links ( bookurl ).

2. Iterate over bookurl to request each chapter page.

3. Apply regular expressions to extract the chapter content.

Saving the Scraped Data

1. Because the content is textual novels, storing them as files is more appropriate than a database.

2. Create a file named after the novel ( book_name ) and write each chapter title and its text sequentially.

Code Implementation

<code>import urllib.request
import re
indexUrl = "http://www.shicimingju.com/book/sanguoyanyi.html"
html = urllib.request.urlopen(indexUrl).read()
html = html.decode('utf8')

# Extract book name, chapter titles, and chapter URLs
book_name = re.findall('<h1>(.*)</h1>', html, re.S)
chapter = re.findall('href="/book/.{0,30}\d\.html">(.*?)</a>', html, re.S)
bookurl = re.findall('href="(/book/.{0,30}\d\.html)"', html, re.S)
chapterUrlBegin = re.sub('.html', '', indexUrl)  # base for chapter URLs

for i in range(0, len(bookurl)):
    # Get chapter number from relative URL
    number = re.findall('/(.{1,4})\.html', bookurl[i])
    # Build full chapter URL
    chapterUrl = re.sub('$', '/' + number[0] + '.html', chapterUrlBegin)
    # Fetch chapter page
    chapterHtml = urllib.request.urlopen(chapterUrl).read()
    chapterHtml = chapterHtml.decode('utf-8', 'ignore')
    # Extract chapter content
    chapterText = re.findall('<div id="con2".*?>(.*?)</div>', chapterHtml, re.S)
    # Clean HTML tags and whitespace
    chapterText = re.sub('<p>', '', ''.join(chapterText))
    chapterText = re.sub('</p>', '', ''.join(chapterText))
    chapterText = re.sub('', ' ', ''.join(chapterText))
    # Write to file
    f = open('D://book/' + ''.join(book_name) + '.txt', 'a', encoding='utf-8')
    f.write(chapter[i] + "\n")
    f.write(chapterText + "\n")
    f.close()
</code>

Following these steps successfully scrapes the entire novel and saves it as a text file.

The overall process can be summed up as: use regular expressions to capture web information, then store the results in files.

pythonRegular ExpressionsWeb Scrapingfile storagenovel data
Python Programming Learning Circle
Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.