Backend Development 9 min read

Python Multithreading for Web Scraping: Concepts, Code Samples, and Performance Comparison

This tutorial explains process and thread fundamentals, compares single‑threaded and multithreaded Python crawlers, provides complete code examples for both approaches, and demonstrates how converting a single‑threaded scraper to multithreading can significantly reduce execution time when handling large data volumes.

Python Programming Learning Circle

Aug 21, 2023

Python Multithreading for Web Scraping: Concepts, Code Samples, and Performance Comparison

When crawling large amounts of data quickly, converting a single‑threaded crawler to a multithreaded one can greatly improve speed. This article introduces the basic concepts of processes and threads, explains why threads are lighter than processes, and shows how multithreading reduces overhead in concurrent execution.

1. Processes and Threads – A process is an independent running instance of a program with its own resources, while a thread shares the process’s resources and has much lower context‑switch cost. Multiple threads within a process enable higher concurrency and better resource utilization.

2. Python Single‑Threaded vs. Multithreaded Crawlers – By default Python runs single‑threaded code sequentially. For I/O‑bound tasks such as downloading images, a single thread becomes a bottleneck. Using the threading module, two functions ( coding and playing) can run concurrently, demonstrating the speed advantage.

import threading
import time

def coding():
    for x in range(3):
        print('%s正在写代码
' % x)
        time.sleep(1)

def playing():
    for x in range(3):
        print('%s正在玩游戏
' % x)
        time.sleep(1)

def multi_thread():
    start = time.time()
    t1 = threading.Thread(target=coding)
    t1.start()
    t2 = threading.Thread(target=playing)
    t2.start()
    t1.join()
    t2.join()
    end = time.time()
    print('总共运行时间 : %.5f 秒' % (end - start))

if __name__ == '__main__':
    multi_thread()

The above code produces the multithreaded execution result shown in the following image:

For comparison, the single‑threaded version runs the two functions sequentially:

import time

def coding():
    for x in range(3):
        print('%s正在写代码
' % x)
        time.sleep(1)

def playing():
    start = time.time()
    for x in range(3):
        print('%s正在玩游戏
' % x)
        time.sleep(1)
    end = time.time()
    print('总共运行时间 : %.5f 秒' % (end - start))

def single_thread():
    coding()
    playing()

if __name__ == '__main__':
    single_thread()

The single‑threaded execution result is displayed below:

From the results, multithreading runs the two tasks concurrently, while single‑threading runs them one after the other. For small workloads the time difference is minor, but with larger tasks multithreading can noticeably reduce total execution time.

3. Converting a Single‑Threaded Scraper to Multithreading – The example below shows a simple image‑scraping script that downloads pictures from a live‑streaming page using a single thread. By creating multiple threads in the main block, the same task can be parallelized.

import requests
from lxml import etree
import time
import os
import threading

dirpath = '图片/'
if not os.path.exists(dirpath):
    os.mkdir(dirpath)  # 创建文件夹

header = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.162 Safari/537.36'
}

def get_photo():
    url = 'https://www.huya.com/g/4079/'  # 目标网站
    response = requests.get(url=url, headers=header)  # 发送请求
    data = etree.HTML(response.text)  # 转化为html格式
    return data

def jiexi():
    data = get_photo()
    image_url = data.xpath('//a//img//@data-original')
    image_name = data.xpath('//a//img[@class="pic"]//@alt')
    for ur, name in zip(image_url, image_name):
        url = ur.replace('?imageview/4/0/w/338/h/190/blur/1', '')
        title = name + '.jpg'
        response = requests.get(url=url, headers=header)  # 在此发送新的请求
        with open(dirpath + title, 'wb') as f:
            f.write(response.content)
        print('下载成功' + name)
        time.sleep(2)

if __name__ == '__main__':
    jiexi()

To run this scraper with four concurrent threads, replace the main block as follows:

if __name__ == "__main__":
    threads = []
    start = time.time()
    for i in range(1, 5):
        thread = threading.Thread(target=jiexi)
        threads.append(thread)
        thread.start()
    for thread in threads:
        thread.join()
    end = time.time()
    print('总共消耗时间 : %.5f 秒' % (end - start))
    print('全部完成!')

The article concludes that multithreading is advantageous for I/O‑bound crawling tasks with large workloads, while single‑threaded code remains sufficient for small, simple jobs.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

concurrency threading Web Scraping

Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.