Mastering Python’s stream=True: Efficient Web Scraping Techniques
This article walks through using the requests library’s stream=True parameter in Python to efficiently filter valid URLs during web scraping, presenting two practical methods, code examples, performance insights, and a clear explanation of how stream handling works.
1. Introduction
Hello everyone, I’m PiPi. Recently I shared a Python web‑scraping question in a strong Python community and now I’m presenting it here for collective learning.
2. Solution Process
PI suggested a feasible idea, and later a member (YueShen) provided a working code snippet:
for url in all_url:
resp = requests.get(url, headers=header, stream=True)
content_length = resp.headers.get('content-length')
if content_length and int(content_length) > 10240:
print(url)The script runs in under a second, and Jupyter automatically displays execution time, which PyCharm does not show by default.
YueShen’s method meets the requirement, though file parsing can be a bit slow.
The core knowledge point being tested is the stream=True parameter. Below is a more complete example that demonstrates two approaches to validate URLs:
import requests
import time
url = [
'https://wap.game.xiaomi.com/index.php?c=app&v=download&package=com.joypac.dragonhero.cn.mi&channel=meng_4001_2_android',
'https://wap.game.xiaomi.com/index.php?c=app&v=download&package=com.yiwan.longtengtianxia.mi&channel=meng_4001_2_android',
'https://wap.game.xiaomi.com/index.php?c=app&v=download&package=com.netease.mrzh.mi&channel=meng_4001_2_android'
]
header = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'}
start = time.time()
# Method 1: Check response.headers for Content‑Length
for i in url:
resp = requests.get(i, headers=header, stream=True)
if 'Content-Length' in resp.headers:
print(f'Valid URL:
{i}')
end = time.time()
print(f'Test finished! Total time: {end - start:.2f} seconds')
# Method 2: Check size of streamed content
start2 = time.time()
for i in url:
resp = requests.get(i, headers=header, stream=True)
chunk_size = 1024
for data in resp.iter_content(chunk_size=chunk_size):
if len(data) > 800:
print(f'Valid URL:
{i}')
break
end2 = time.time()
print(f'Test finished! Total time: {end2 - start2:.2f} seconds')The following image explains what the stream=True argument does:
3. Conclusion
This article demonstrates how to use the stream=True parameter in Python’s requests library during web crawling, providing concrete code examples and performance comparisons. Understanding this parameter helps filter large responses efficiently and improves scraping scripts.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
