Fundamentals 9 min read

Boost Your Web Scraping Speed: Python Multithreading Explained with Real Code

Learn how converting single‑threaded Python crawlers into multithreaded ones can dramatically reduce data‑fetching time, covering process vs thread concepts, practical threading code examples, performance comparisons, and a step‑by‑step guide to building a multithreaded image scraper.

Python Crawling & Data Mining

Aug 12, 2023

Boost Your Web Scraping Speed: Python Multithreading Explained with Real Code

Preface

When crawling large amounts of data quickly, converting a single‑threaded spider to a multithreaded one can greatly improve efficiency. This article introduces the basic concepts and coding methods.

1. Processes and Threads

A process is an instance of a running program with its own resources, while a thread shares the process’s resources and has lower context‑switch overhead. Threads are introduced to reduce the time and space cost of concurrent execution, improving system throughput.

For example, opening the Task Manager (see image) shows running programs as processes; assigning ten people to a job represents ten threads within that process, illustrating why multithreading can be more efficient than single threading in many cases.

2. Multithreading vs Single Thread in Python

Python runs code sequentially by default (single thread). For small tasks, a single‑threaded spider is sufficient, but downloading many images sequentially is slow. Using multiple threads can download images concurrently, saving time.

The threading module simplifies multithreaded programming. Below is a simple example that runs a coding task and a gaming task in parallel.

import threading
import time

def coding():
    for x in range(3):
        print('%s正在写代码
' % x)
        time.sleep(1)

def playing():
    for x in range(3):
        print('%s正在玩游戏
' % x)
        time.sleep(1)

def multi_thread():
    start = time.time()
    t1 = threading.Thread(target=coding)
    t1.start()
    t2 = threading.Thread(target=playing)
    t2.start()
    t1.join()
    t2.join()
    end = time.time()
    print('总共运行时间 : %.5f 秒' % (end - start))

if __name__ == '__main__':
    multi_thread()

The resulting output (see image) shows the two tasks executing simultaneously.

Running the same tasks in a single thread yields sequential execution, as shown in the next image.

From these results, multithreading runs the coding and gaming tasks together, while single threading runs them one after the other. For small workloads the time difference is minor, but for large tasks multithreading reduces total execution time.

3. Converting a Single‑Threaded Crawler to Multithreaded

Below is a simple single‑threaded image‑scraping script that fetches pictures from a live‑stream site.

import requests
from lxml import etree
import time
import os

dirpath = '图片/'
if not os.path.exists(dirpath):
    os.mkdir(dirpath)

header = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.162 Safari/537.36'
}

def get_photo():
    url = 'https://www.huya.com/g/4079/'
    response = requests.get(url=url, headers=header)
    data = etree.HTML(response.text)
    return data

def jiexi():
    data = get_photo()
    image_url = data.xpath('//a//img//@data-original')
    image_name = data.xpath('//a//img[@class="pic"]//@alt')
    for ur, name in zip(image_url, image_name):
        url = ur.replace('?imageview/4/0/w/338/h/190/blur/1', '')
        title = name + '.jpg'
        response = requests.get(url=url, headers=header)
        with open(dirpath + title, 'wb') as f:
            f.write(response.content)
        print('下载成功' + name)
        time.sleep(2)

if __name__ == '__main__':
    jiexi()

To make it multithreaded, only the main function needs to change. The example creates four threads to run the scraper concurrently.

if __name__ == "__main__":
    threads = []
    start = time.time()
    for i in range(1, 5):
        thread = threading.Thread(target=jiexi, args=(i,))
        threads.append(thread)
        thread.start()
    for thread in threads:
        thread.join()
    end = time.time()
    print('总共消耗时间 : %.5f 秒' % (end - start))
    print('全部完成!')

4. Book Recommendation

The recommended book covers common Python 3 web‑scraping techniques, including basic web concepts, urllib, Requests, XPath, Beautiful Soup, Selenium for dynamic sites, the Scrapy framework, and Linux basics for deploying crawler scripts. It is aimed at beginners interested in web crawling.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

performance multithreading tutorial threading

Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.