Mastering Python’s stream=True: Efficient Web Scraping Techniques

This article walks through using the requests library’s stream=True parameter in Python to efficiently filter valid URLs during web scraping, presenting two practical methods, code examples, performance insights, and a clear explanation of how stream handling works.

Python Crawling & Data Mining
Python Crawling & Data Mining
Python Crawling & Data Mining
Mastering Python’s stream=True: Efficient Web Scraping Techniques

1. Introduction

Hello everyone, I’m PiPi. Recently I shared a Python web‑scraping question in a strong Python community and now I’m presenting it here for collective learning.

2. Solution Process

PI suggested a feasible idea, and later a member (YueShen) provided a working code snippet:

for url in all_url:
    resp = requests.get(url, headers=header, stream=True)
    content_length = resp.headers.get('content-length')
    if content_length and int(content_length) > 10240:
        print(url)

The script runs in under a second, and Jupyter automatically displays execution time, which PyCharm does not show by default.

YueShen’s method meets the requirement, though file parsing can be a bit slow.

The core knowledge point being tested is the stream=True parameter. Below is a more complete example that demonstrates two approaches to validate URLs:

import requests
import time

url = [
    'https://wap.game.xiaomi.com/index.php?c=app&v=download&package=com.joypac.dragonhero.cn.mi&channel=meng_4001_2_android',
    'https://wap.game.xiaomi.com/index.php?c=app&v=download&package=com.yiwan.longtengtianxia.mi&channel=meng_4001_2_android',
    'https://wap.game.xiaomi.com/index.php?c=app&v=download&package=com.netease.mrzh.mi&channel=meng_4001_2_android'
]
header = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'}

start = time.time()
# Method 1: Check response.headers for Content‑Length
for i in url:
    resp = requests.get(i, headers=header, stream=True)
    if 'Content-Length' in resp.headers:
        print(f'Valid URL:
 {i}')
end = time.time()
print(f'Test finished! Total time: {end - start:.2f} seconds')

# Method 2: Check size of streamed content
start2 = time.time()
for i in url:
    resp = requests.get(i, headers=header, stream=True)
    chunk_size = 1024
    for data in resp.iter_content(chunk_size=chunk_size):
        if len(data) > 800:
            print(f'Valid URL:
 {i}')
            break
end2 = time.time()
print(f'Test finished! Total time: {end2 - start2:.2f} seconds')

The following image explains what the stream=True argument does:

3. Conclusion

This article demonstrates how to use the stream=True parameter in Python’s requests library during web crawling, providing concrete code examples and performance comparisons. Understanding this parameter helps filter large responses efficiently and improves scraping scripts.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

PythonTutorialStreamweb-scrapingnetwork crawling
Python Crawling & Data Mining
Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.