Backend Development 12 min read

Build a Robust Python Web Crawler: Modular Architecture & Full Code Walkthrough

This article explains how to design a modular Python web crawler by breaking the system into five core components—scheduler, URL manager, downloader, parser, and data storage—provides detailed code examples for each module, and demonstrates a complete end‑to‑end crawling workflow on a sample website.

MaGe Linux Operations

Apr 22, 2019

Build a Robust Python Web Crawler: Modular Architecture & Full Code Walkthrough

1. Introduction

We start with the question: do you really know how to write a crawler? Simple scripts often consist of a single .py file and a few requests, but a production‑grade crawler must consider many scenarios, so we modularize the functionality to make the crawler robust and maintainable.

2. Basic Crawler Architecture and Workflow

The basic crawler architecture is divided into five major components:

Crawler Scheduler – orchestrates the other four modules.

URL Manager – manages crawled and uncrawled URLs and provides an interface for new URLs.

HTML Downloader – fetches the HTML of target pages.

HTML Parser – extracts data from the HTML source, sends new URLs back to the URL manager, and passes processed data to the storage module.

Data Storage – stores the downloaded data locally.

3. Practical Example: Crawling a Sample Site

Below is a screenshot of the target site we will crawl.

4. URL Manager (URLManage.py)

class URLManager(object):
    def __init__(self):
        self.new_urls = set()
        self.old_urls = set()

    def has_new_url(self):
        # 判断是否有未爬取的url
        return self.new_url_size() != 0

    def get_new_url(self):
        # 获取一个未爬取的链接
        new_url = self.new_urls.pop()
        # 提取之后，将其添加到已爬取的链接中
        self.old_urls.add(new_url)
        return new_url

    def add_new_url(self, url):
        # 将新链接添加到未爬取的集合中(单个链接)
        if url is None:
            return
        if url not in self.new_urls and url not in self.old_urls:
            self.new_urls.add(url)

    def add_new_urls(self, urls):
        # 将新链接添加到未爬取的集合中(集合)
        if urls is None or len(urls) == 0:
            return
        for url in urls:
            self.add_new_url(url)

    def new_url_size(self):
        # 获取未爬取的url大小
        return len(self.new_urls)

    def old_url_size(self):
        # 获取已爬取的url大小
        return len(self.old_urls)

The manager maintains two sets: one for URLs that have been crawled and one for those pending, using Python’s set type for automatic deduplication.

5. HTML Downloader (HTMLDownload.py)

import requests
class HTMLDownload(object):
    def download(self, url):
        if url is None:
            return
        s = requests.Session()
        s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'
        res = s.get(url)
        # 判断是否正常获取
        if res.status_code == 200:
            res.encoding = 'utf-8'
            res = res.text
            return res
        return None

This module simply retrieves the raw HTML of a given URL using the requests library.

6. HTML Parser (HTMLParser.py)

import re
from bs4 import BeautifulSoup
class HTMLParser(object):
    def parser(self, page_url, html_cont):
        '''
        用于解析网页内容，抽取URL和数据
        :param page_url: 下载页面的URL
        :param html_cont: 下载的网页内容
        :return: 返回URL和数据
        '''
        if page_url is None or html_cont is None:
            return
        soup = BeautifulSoup(html_cont, 'html.parser')
        new_urls = self._get_new_urls(page_url, soup)
        new_data = self._get_new_data(page_url, soup)
        return new_urls, new_data

    def _get_new_urls(self, page_url, soup):
        '''
        抽取新的URL集合
        '''
        new_urls = set()
        for link in range(1, 100):
            # 添加新的url
            new_url = "http://www.runoob.com/w3cnote/page/" + str(link)
            new_urls.add(new_url)
            print(new_urls)
        return new_urls

    def _get_new_data(self, page_url, soup):
        '''
        抽取有效数据
        '''
        data = {}
        data['url'] = page_url
        title = soup.find('div', class_='post-intro').find('h2')
        print(title)
        data['title'] = title.get_text()
        summary = soup.find('div', class_='post-intro').find('p')
        data['summary'] = summary.get_text()
        return data

The parser uses BeautifulSoup to extract new URLs and relevant data (title and summary) from each page.

7. Data Storage (DataOutput.py)

import codecs
class DataOutput(object):
    def __init__(self):
        self.datas = []

    def store_data(self, data):
        if data is None:
            return
        self.datas.append(data)

    def output_html(self):
        fout = codecs.open('baike.html', 'a', encoding='utf-8')
        fout.write("<html>")
        fout.write("<head><meta charset='utf-8'/></head>")
        fout.write("<body>")
        fout.write("<table>")
        for data in self.datas:
            fout.write("<tr>")
            fout.write("<td>%s</td>" % data['url'])
            fout.write("<td>《%s》</td>" % data['title'])
            fout.write("<td>[%s]</td>" % data['summary'])
            fout.write("</tr>")
            self.datas.remove(data)
        fout.write("</table>")
        fout.write("</body>")
        fout.write("</html>")
        fout.close()

This component writes the extracted data into an HTML table; in practice you could store it in MySQL, CSV, etc.

8. Crawler Scheduler (SpiderMan.py)

from base.DataOutput import DataOutput
from base.HTMLParser import HTMLParser
from base.HTMLDownload import HTMLDownload
from base.URLManager import URLManager

class SpiderMan(object):
    def __init__(self):
        self.manager = URLManager()
        self.downloader = HTMLDownload()
        self.parser = HTMLParser()
        self.output = DataOutput()

    def crawl(self, root_url):
        # 添加入口URL
        self.manager.add_new_url(root_url)
        # 循环抓取，最多100个链接
        while(self.manager.has_new_url() and self.manager.old_url_size() < 100):
            try:
                # 从URL管理器获取新的URL
                new_url = self.manager.get_new_url()
                print(new_url)
                # HTML下载器下载网页
                html = self.downloader.download(new_url)
                # HTML解析器抽取网页数据
                new_urls, data = self.parser.parser(new_url, html)
                print(new_urls)
                # 将抽取的url添加到URL管理器中
                self.manager.add_new_urls(new_urls)
                # 数据存储器存储文件
                self.output.store_data(data)
                print("已经抓取%s个链接" % self.manager.old_url_size())
            except Exception as e:
                print("failed")
                print(e)
            # 数据存储器将文件输出成指定的格式
            self.output.output_html()

if __name__ == '__main__':
    spider_man = SpiderMan()
    spider_man.crawl("http://www.runoob.com/w3cnote/page/1")

The scheduler ties all modules together, repeatedly fetching new URLs, downloading pages, parsing data, storing results, and writing them to an HTML file. The final output screenshot is shown below.

9. Conclusion

We have introduced the five essential modules of a crawler architecture. Whether you are building a large‑scale spider or a small script, organizing your code into a scheduler, URL manager, downloader, parser, and storage component leads to cleaner, more maintainable, and more robust crawling projects.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python Backend Development modular design Web Crawler Scraping

Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.