Mastering Python’s requests stream=True: Fast, Efficient Web Crawling
This article walks through using Python’s requests library with the stream=True parameter to efficiently filter valid URLs during web crawling, presenting two methods, code examples, execution time comparisons, and a clear explanation of the stream option’s role.
1. Introduction
Hello everyone, I am PiPi. A few days ago I shared a Python web‑crawling question in a group, and now I’m presenting the solution for everyone to learn.
2. Solution Process
PI suggested a feasible approach. Later, MoonGod provided a working code snippet:
for url in all_url:
resp = requests.get(url, headers=header, stream=True)
content_length = resp.headers.get('content-length')
if content_length and int(content_length) > 10240:
print(url)The program produced results in less than a second. Jupyter Notebook automatically displayed the execution time, which is not shown in PyCharm without extra configuration.
MoonGod’s method meets the requirement, though file parsing is a bit slow.
The core knowledge point is the stream=True parameter. The full example code is:
import requests
import time
url = [
'https://wap.game.xiaomi.com/index.php?c=app&v=download&package=com.joypac.dragonhero.cn.mi&channel=meng_4001_2_android',
'https://wap.game.xiaomi.com/index.php?c=app&v=download&package=com.yiwan.longtengtianxia.mi&channel=meng_4001_2_android',
'https://wap.game.xiaomi.com/index.php?c=app&v=download&package=com.netease.mrzh.mi&channel=meng_4001_2_android'
]
header = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'}
start = time.time()
# Method 1: check response.headers
for i in url:
resp = requests.get(i, headers=header, stream=True)
if 'Content-Length' in resp.headers:
print(f'Valid URL:
{i}')
end = time.time()
print(f'Test completed! Total time: {end - start:.2f} seconds')
# Method 2: check size of streamed content
start2 = time.time()
for i in url:
resp = requests.get(i, headers=header, stream=True)
chunk_size = 1024
for data in resp.iter_content(chunk_size=chunk_size):
if len(data) > 800:
print(f'Valid URL:
{i}')
break
end2 = time.time()
print(f'Test completed! Total time: {end2 - start2:.2f} seconds')Below are screenshots of the code and an explanation of the stream parameter.
3. Conclusion
This article demonstrates how to use the stream=True parameter in Python’s requests library during web crawling, providing concrete examples, performance comparisons, and a clear explanation of its purpose.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
