Python Web Crawling Tutorial: From Basics to a Full‑Scale Novel Scraper
This article introduces web crawling fundamentals, demonstrates how to inspect HTML elements, walks through simple examples using urllib, requests, and BeautifulSoup, and culminates in a complete Python script that extracts chapter links and contents from a novel website, saving them to a text file.
Web crawlers, also known as web spiders, retrieve page content by requesting URLs such as https://www.baidu.com/ . Before writing a crawler, you should be familiar with the browser's element inspection tools (right‑click → Inspect) to locate the HTML structure you need to parse.
The first practical step is to fetch a page’s HTML. In Python 3 you can use the built‑in urllib.request module or the third‑party requests library. Install requests with:
<code>pip install requests</code>Example using requests.get() to obtain the HTML of http://gitbook.cn/ :
<code># -*- coding:UTF-8 -*-
import requests
if __name__ == '__main__':
target = 'http://gitbook.cn/'
req = requests.get(url=target)
print(req.text)
</code>After retrieving the raw HTML, you typically need to extract the main content while discarding tags like <div> , <br> , etc. This is where BeautifulSoup shines.
Install BeautifulSoup (or beautifulsoup4 ) with either:
pip install beautifulsoup4
easy_install beautifulsoup4
Use it to parse the HTML and locate the div whose class attribute is showtxt , which contains the article body:
<code># -*- coding:UTF-8 -*-
from bs4 import BeautifulSoup
import requests
if __name__ == '__main__':
target = 'http://www.biqukan.com/1_1094/5403177.html'
req = requests.get(url=target)
html = req.text
bf = BeautifulSoup(html)
texts = bf.find_all('div', class_='showtxt')
print(texts[0].text.replace('\xa0'*8, '\n\n'))
</code>For a full‑scale novel scraper, first obtain the list of chapter links. The chapter list resides in a div with class="listmain" . Each chapter link is an <a> tag with an href attribute like /1_1094/5403177.html . Combine this with the base URL to form the complete chapter URL.
Below is a compact downloader class that gathers all chapter URLs, fetches each chapter’s content, and writes it to 《一年永恒》.txt :
<code># -*- coding:UTF-8 -*-
from bs4 import BeautifulSoup
import requests, sys
class downloader(object):
def __init__(self):
self.server = 'http://www.biqukan.com/'
self.target = 'http://www.biqukan.com/1_1094/'
self.names = [] # chapter titles
self.urls = [] # chapter URLs
self.nums = 0
def get_download_url(self):
req = requests.get(url=self.target)
html = req.text
div_bf = BeautifulSoup(html)
div = div_bf.find_all('div', class_='listmain')
a_bf = BeautifulSoup(str(div[0]))
a = a_bf.find_all('a')
self.nums = len(a[15:]) # skip ads and extra entries
for each in a[15:]:
self.names.append(each.string)
self.urls.append(self.server + each.get('href'))
def get_contents(self, target):
req = requests.get(url=target)
html = req.text
bf = BeautifulSoup(html)
texts = bf.find_all('div', class_='showtxt')
texts = texts[0].text.replace('\xa0'*8, '\n\n')
return texts
def writer(self, name, path, text):
with open(path, 'a', encoding='utf-8') as f:
f.write(name + '\n')
f.writelines(text)
f.write('\n\n')
if __name__ == "__main__":
dl = downloader()
dl.get_download_url()
print('《一年永恒》开始下载:')
for i in range(dl.nums):
dl.writer(dl.names[i], '一念永恒.txt', dl.get_contents(dl.urls[i]))
sys.stdout.write(" 已下载:%.3f%%" % (float(i)/dl.nums*100) + '\r')
sys.stdout.flush()
print('《一年永恒》下载完成')
</code>The script runs in a single process, sequentially downloading each chapter and displaying progress. After completion, you will have a plain‑text file containing the entire novel.
Overall, this guide walks you through inspecting HTML, using requests to fetch pages, parsing with BeautifulSoup , extracting links and content, and assembling a functional web crawler in Python.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.