Master Web Crawling in Python: From Data Fetching to Image Download
This article explains how to build a Python web crawler that fetches HTML pages with urllib, parses them using BeautifulSoup to extract image URLs, and downloads the images, covering all three stages with complete code examples and parser options.
The Internet is a massive repository of data, and when large‑scale data collection and analysis are required, a program called a web crawler (or spider) is used.
Stage 1 – Fetching Data
Fetching data means downloading the HTML of a target URL. The core of this step is network communication, which can be performed with Python’s built‑in urllib.request module.
# coding=utf-8
# 代码文件:code/chapter6/6.1.1.py
# 爬取数据
import urllib.request
url = 'http://p.weather.com.cn/'
def getHtmlString():
"""网络请求返回HTML字符串"""
req = urllib.request.Request(url)
with urllib.request.urlopen(req) as response:
data = response.read()
htmlstr = data.decode(encoding='utf-8', errors='ignore')
return htmlstr
if __name__ == '__main__':
html = getHtmlString()
print(html)Stage 2 – Parsing Data
BeautifulSoup is a Python library for extracting data from HTML or XML documents. Install it with: pip install beautifulsoup4 Typical BeautifulSoup methods include find_all, select, find, and get. Common attributes are title and text.
Example code that extracts image URLs from the fetched HTML:
# coding=utf-8
# 代码文件:code/chapter6/6.1.2.py
# 解析数据
import urllib.request
from bs4 import BeautifulSoup
url = 'http://p.weather.com.cn/'
def getHtmlString():
"""网络请求返回HTML字符串"""
req = urllib.request.Request(url)
with urllib.request.urlopen(req) as response:
data = response.read()
htmlstr = data.decode(encoding='utf-8', errors='ignore')
return htmlstr
def find_imageurls(htmlstr):
"""从HTML代码中查找匹配的字符串"""
sp = BeautifulSoup(htmlstr, 'html.parser')
imgtaglist = sp.find_all('img')
srclist = list(map(lambda u: u.get('src'), imgtaglist))
filtered_srclist = filter(lambda u: u.lower().endswith('.png') or u.lower().endswith('.jpg'), srclist)
return filtered_srclist
if __name__ == '__main__':
html = getHtmlString()
url_list = find_imageurls(html)
for img_url in url_list:
print(img_url)Stage 3 – Downloading Images
The final step is to download each image file using its URL.
# coding=utf-8
# 代码文件:code/chapter6/6.1.3.py
import os
import urllib.request
from bs4 import BeautifulSoup
url = 'http://p.weather.com.cn/'
# (functions getHtmlString and find_imageurls omitted for brevity)
if __name__ == '__main__':
html = getHtmlString()
url_list = find_imageurls(html)
for img_url in url_list:
req = urllib.request.Request(img_url)
with urllib.request.urlopen(req) as response:
data = response.read()
filename = img_url[img_url.rfind('/')+1:]
filepath = 'download/' + filename
if not os.path.exists('download'):
os.mkdir('download')
with open(filepath, 'wb') as f:
f.write(data)
print('下载图片:{}。'.format(filename))
print('工作完成。')Common HTML parsers for BeautifulSoup include html.parser, lxml, lxml-xml, and html5lib, each with different speed and compatibility characteristics.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
