Backend Development 5 min read

Mastering Python’s requests stream=True: Fast, Efficient Web Crawling

This article walks through using Python’s requests library with the stream=True parameter to efficiently filter valid URLs during web crawling, presenting two methods, code examples, execution time comparisons, and a clear explanation of the stream option’s role.

Python Crawling & Data Mining

Nov 19, 2024

Mastering Python’s requests stream=True: Fast, Efficient Web Crawling

1. Introduction

Hello everyone, I am PiPi. A few days ago I shared a Python web‑crawling question in a group, and now I’m presenting the solution for everyone to learn.

2. Solution Process

PI suggested a feasible approach. Later, MoonGod provided a working code snippet:

for url in all_url:
    resp = requests.get(url, headers=header, stream=True)
    content_length = resp.headers.get('content-length')
    if content_length and int(content_length) > 10240:
        print(url)

The program produced results in less than a second. Jupyter Notebook automatically displayed the execution time, which is not shown in PyCharm without extra configuration.

MoonGod’s method meets the requirement, though file parsing is a bit slow.

The core knowledge point is the stream=True parameter. The full example code is:

import requests
import time

url = [
    'https://wap.game.xiaomi.com/index.php?c=app&v=download&package=com.joypac.dragonhero.cn.mi&channel=meng_4001_2_android',
    'https://wap.game.xiaomi.com/index.php?c=app&v=download&package=com.yiwan.longtengtianxia.mi&channel=meng_4001_2_android',
    'https://wap.game.xiaomi.com/index.php?c=app&v=download&package=com.netease.mrzh.mi&channel=meng_4001_2_android'
]
header = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'}

start = time.time()
# Method 1: check response.headers
for i in url:
    resp = requests.get(i, headers=header, stream=True)
    if 'Content-Length' in resp.headers:
        print(f'Valid URL:
 {i}')
end = time.time()
print(f'Test completed! Total time: {end - start:.2f} seconds')

# Method 2: check size of streamed content
start2 = time.time()
for i in url:
    resp = requests.get(i, headers=header, stream=True)
    chunk_size = 1024
    for data in resp.iter_content(chunk_size=chunk_size):
        if len(data) > 800:
            print(f'Valid URL:
 {i}')
            break
end2 = time.time()
print(f'Test completed! Total time: {end2 - start2:.2f} seconds')

Below are screenshots of the code and an explanation of the stream parameter.

3. Conclusion

This article demonstrates how to use the stream=True parameter in Python’s requests library during web crawling, providing concrete examples, performance comparisons, and a clear explanation of its purpose.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python Stream Network Programming web crawling requests

Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.