Backend Development 19 min read

Python Web Crawling Tutorial: From Basics to a Full‑Scale Novel Scraper

This article introduces web crawling fundamentals, demonstrates how to inspect HTML elements, walks through simple examples using urllib, requests, and BeautifulSoup, and culminates in a complete Python script that extracts chapter links and contents from a novel website, saving them to a text file.

Python Programming Learning Circle
Python Programming Learning Circle
Python Programming Learning Circle
Python Web Crawling Tutorial: From Basics to a Full‑Scale Novel Scraper

Web crawlers, also known as web spiders, retrieve page content by requesting URLs such as https://www.baidu.com/ . Before writing a crawler, you should be familiar with the browser's element inspection tools (right‑click → Inspect) to locate the HTML structure you need to parse.

The first practical step is to fetch a page’s HTML. In Python 3 you can use the built‑in urllib.request module or the third‑party requests library. Install requests with:

<code>pip install requests</code>

Example using requests.get() to obtain the HTML of http://gitbook.cn/ :

<code># -*- coding:UTF-8 -*-
import requests
if __name__ == '__main__':
    target = 'http://gitbook.cn/'
    req = requests.get(url=target)
    print(req.text)
</code>

After retrieving the raw HTML, you typically need to extract the main content while discarding tags like &lt;div&gt; , &lt;br&gt; , etc. This is where BeautifulSoup shines.

Install BeautifulSoup (or beautifulsoup4 ) with either:

pip install beautifulsoup4

easy_install beautifulsoup4

Use it to parse the HTML and locate the div whose class attribute is showtxt , which contains the article body:

<code># -*- coding:UTF-8 -*-
from bs4 import BeautifulSoup
import requests
if __name__ == '__main__':
    target = 'http://www.biqukan.com/1_1094/5403177.html'
    req = requests.get(url=target)
    html = req.text
    bf = BeautifulSoup(html)
    texts = bf.find_all('div', class_='showtxt')
    print(texts[0].text.replace('\xa0'*8, '\n\n'))
</code>

For a full‑scale novel scraper, first obtain the list of chapter links. The chapter list resides in a div with class="listmain" . Each chapter link is an &lt;a&gt; tag with an href attribute like /1_1094/5403177.html . Combine this with the base URL to form the complete chapter URL.

Below is a compact downloader class that gathers all chapter URLs, fetches each chapter’s content, and writes it to 《一年永恒》.txt :

<code># -*- coding:UTF-8 -*-
from bs4 import BeautifulSoup
import requests, sys

class downloader(object):
    def __init__(self):
        self.server = 'http://www.biqukan.com/'
        self.target = 'http://www.biqukan.com/1_1094/'
        self.names = []  # chapter titles
        self.urls = []   # chapter URLs
        self.nums = 0

    def get_download_url(self):
        req = requests.get(url=self.target)
        html = req.text
        div_bf = BeautifulSoup(html)
        div = div_bf.find_all('div', class_='listmain')
        a_bf = BeautifulSoup(str(div[0]))
        a = a_bf.find_all('a')
        self.nums = len(a[15:])  # skip ads and extra entries
        for each in a[15:]:
            self.names.append(each.string)
            self.urls.append(self.server + each.get('href'))

    def get_contents(self, target):
        req = requests.get(url=target)
        html = req.text
        bf = BeautifulSoup(html)
        texts = bf.find_all('div', class_='showtxt')
        texts = texts[0].text.replace('\xa0'*8, '\n\n')
        return texts

    def writer(self, name, path, text):
        with open(path, 'a', encoding='utf-8') as f:
            f.write(name + '\n')
            f.writelines(text)
            f.write('\n\n')

if __name__ == "__main__":
    dl = downloader()
    dl.get_download_url()
    print('《一年永恒》开始下载:')
    for i in range(dl.nums):
        dl.writer(dl.names[i], '一念永恒.txt', dl.get_contents(dl.urls[i]))
        sys.stdout.write("  已下载:%.3f%%" % (float(i)/dl.nums*100) + '\r')
        sys.stdout.flush()
    print('《一年永恒》下载完成')
</code>

The script runs in a single process, sequentially downloading each chapter and displaying progress. After completion, you will have a plain‑text file containing the entire novel.

Overall, this guide walks you through inspecting HTML, using requests to fetch pages, parsing with BeautifulSoup , extracting links and content, and assembling a functional web crawler in Python.

Tutorialweb scrapingRequestsBeautifulSoupcrawler
Python Programming Learning Circle
Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.