Master Web Crawling in Python: From Data Fetching to Image Download

This article explains how to build a Python web crawler that fetches HTML pages with urllib, parses them using BeautifulSoup to extract image URLs, and downloads the images, covering all three stages with complete code examples and parser options.

Python Crawling & Data Mining
Python Crawling & Data Mining
Python Crawling & Data Mining
Master Web Crawling in Python: From Data Fetching to Image Download

The Internet is a massive repository of data, and when large‑scale data collection and analysis are required, a program called a web crawler (or spider) is used.

Stage 1 – Fetching Data

Fetching data means downloading the HTML of a target URL. The core of this step is network communication, which can be performed with Python’s built‑in urllib.request module.

# coding=utf-8
# 代码文件:code/chapter6/6.1.1.py
# 爬取数据
import urllib.request

url = 'http://p.weather.com.cn/'

def getHtmlString():
    """网络请求返回HTML字符串"""
    req = urllib.request.Request(url)
    with urllib.request.urlopen(req) as response:
        data = response.read()
        htmlstr = data.decode(encoding='utf-8', errors='ignore')
        return htmlstr

if __name__ == '__main__':
    html = getHtmlString()
    print(html)

Stage 2 – Parsing Data

BeautifulSoup is a Python library for extracting data from HTML or XML documents. Install it with: pip install beautifulsoup4 Typical BeautifulSoup methods include find_all, select, find, and get. Common attributes are title and text.

Example code that extracts image URLs from the fetched HTML:

# coding=utf-8
# 代码文件:code/chapter6/6.1.2.py
# 解析数据
import urllib.request
from bs4 import BeautifulSoup

url = 'http://p.weather.com.cn/'

def getHtmlString():
    """网络请求返回HTML字符串"""
    req = urllib.request.Request(url)
    with urllib.request.urlopen(req) as response:
        data = response.read()
        htmlstr = data.decode(encoding='utf-8', errors='ignore')
        return htmlstr

def find_imageurls(htmlstr):
    """从HTML代码中查找匹配的字符串"""
    sp = BeautifulSoup(htmlstr, 'html.parser')
    imgtaglist = sp.find_all('img')
    srclist = list(map(lambda u: u.get('src'), imgtaglist))
    filtered_srclist = filter(lambda u: u.lower().endswith('.png') or u.lower().endswith('.jpg'), srclist)
    return filtered_srclist

if __name__ == '__main__':
    html = getHtmlString()
    url_list = find_imageurls(html)
    for img_url in url_list:
        print(img_url)

Stage 3 – Downloading Images

The final step is to download each image file using its URL.

# coding=utf-8
# 代码文件:code/chapter6/6.1.3.py
import os
import urllib.request
from bs4 import BeautifulSoup

url = 'http://p.weather.com.cn/'

# (functions getHtmlString and find_imageurls omitted for brevity)

if __name__ == '__main__':
    html = getHtmlString()
    url_list = find_imageurls(html)
    for img_url in url_list:
        req = urllib.request.Request(img_url)
        with urllib.request.urlopen(req) as response:
            data = response.read()
        filename = img_url[img_url.rfind('/')+1:]
        filepath = 'download/' + filename
        if not os.path.exists('download'):
            os.mkdir('download')
        with open(filepath, 'wb') as f:
            f.write(data)
        print('下载图片:{}。'.format(filename))
    print('工作完成。')

Common HTML parsers for BeautifulSoup include html.parser, lxml, lxml-xml, and html5lib, each with different speed and compatibility characteristics.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Web ScrapingurllibbeautifulsoupImage Download
Python Crawling & Data Mining
Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.