Mastering Python urllib2: GET, POST, Proxies, Cookies, Headers, GZIP, and Multithreaded Crawling
This guide walks through using Python's urllib2 library for web crawling, covering basic GET/POST requests, handling proxy IPs, managing cookies, spoofing browser headers, processing gzip-compressed responses, and implementing multithreaded fetching with a simple thread‑pool template.
This article provides a step‑by‑step tutorial on building a web crawler with Python's urllib2 module.
1. Basic Page Retrieval
import urllib2 url = "http://www.baidu.com" response = urllib2.urlopen(url) print response.read()
2. POST Requests
import urllib, urllib2 url = "http://abcde.com" form = {'name':'abc','password':'1234'} form_data = urllib.urlencode(form) request = urllib2.Request(url, form_data) response = urllib2.urlopen(request) print response.read()
3. Using Proxy IPs
When an IP is blocked, a ProxyHandler can route requests through a proxy.
import urllib2 proxy = urllib2.ProxyHandler({'http':'127.0.0.1:8087'}) opener = urllib2.build_opener(proxy) urllib2.install_opener(opener) response = urllib2.urlopen('http://www.baidu.com') print response.read()
4. Cookie Handling
The cookielib module creates a CookieJar that stores cookies for subsequent requests.
import urllib2, cookielib cookie_support = urllib2.HTTPCookieProcessor(cookielib.CookieJar()) opener = urllib2.build_opener(cookie_support) urllib2.install_opener(opener) content = urllib2.urlopen('http://XXXX').read() print content
Manually adding a cookie header:
cookie = "PHPSESSID=...; kmsign=...; KMUID=..." request.add_header('Cookie', cookie)
5. Spoofing Browser Headers
Some sites reject non‑browser requests (HTTP 403). Adding a realistic User-Agent and other headers solves this.
import urllib2 headers = {'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'} request = urllib2.Request('http://my.oschina.net/jhao104/blog?catalog=3463517', headers=headers) print urllib2.urlopen(request).read()
6. GZIP Compression
To receive compressed data, add an Accept-encoding: gzip header and decompress the response.
import urllib2, httplib, StringIO, gzip request = urllib2.Request(url) request.add_header('Accept-encoding', 'gzip') response = urllib2.urlopen(request) compressed = response.read() compressed_stream = StringIO.StringIO(compressed) gzipper = gzip.GzipFile(fileobj=compressed_stream) print gzipper.read()
7. Multithreaded Crawling
For faster crawling, a simple thread‑pool using threading and Queue can process tasks concurrently.
from threading import Thread from Queue import Queue from time import sleep q = Queue() NUM = 2 # number of worker threads JOBS = 10 # total tasks def do_something_using(arguments): print arguments def working(): while True: arguments = q.get() do_something_using(arguments) sleep(1) q.task_done() for i in range(NUM): t = Thread(target=working) t.setDaemon(True) t.start() for i in range(JOBS): q.put(i) q.join()
The article also mentions page parsing with regular expressions, lxml, and BeautifulSoup, and briefly discusses handling simple captchas (manual or via paid services).
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
