Mastering Python urllib2: GET, POST, Proxies, Cookies, Headers, GZIP, and Multithreaded Crawling

This guide walks through using Python's urllib2 library for web crawling, covering basic GET/POST requests, handling proxy IPs, managing cookies, spoofing browser headers, processing gzip-compressed responses, and implementing multithreaded fetching with a simple thread‑pool template.

ITPUB
ITPUB
ITPUB
Mastering Python urllib2: GET, POST, Proxies, Cookies, Headers, GZIP, and Multithreaded Crawling

This article provides a step‑by‑step tutorial on building a web crawler with Python's urllib2 module.

1. Basic Page Retrieval

import urllib2 url = "http://www.baidu.com" response = urllib2.urlopen(url) print response.read()

2. POST Requests

import urllib, urllib2 url = "http://abcde.com" form = {'name':'abc','password':'1234'} form_data = urllib.urlencode(form) request = urllib2.Request(url, form_data) response = urllib2.urlopen(request) print response.read()

3. Using Proxy IPs

When an IP is blocked, a ProxyHandler can route requests through a proxy.

import urllib2 proxy = urllib2.ProxyHandler({'http':'127.0.0.1:8087'}) opener = urllib2.build_opener(proxy) urllib2.install_opener(opener) response = urllib2.urlopen('http://www.baidu.com') print response.read()

4. Cookie Handling

The cookielib module creates a CookieJar that stores cookies for subsequent requests.

import urllib2, cookielib cookie_support = urllib2.HTTPCookieProcessor(cookielib.CookieJar()) opener = urllib2.build_opener(cookie_support) urllib2.install_opener(opener) content = urllib2.urlopen('http://XXXX').read() print content

Manually adding a cookie header:

cookie = "PHPSESSID=...; kmsign=...; KMUID=..." request.add_header('Cookie', cookie)

5. Spoofing Browser Headers

Some sites reject non‑browser requests (HTTP 403). Adding a realistic User-Agent and other headers solves this.

import urllib2 headers = {'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'} request = urllib2.Request('http://my.oschina.net/jhao104/blog?catalog=3463517', headers=headers) print urllib2.urlopen(request).read()

6. GZIP Compression

To receive compressed data, add an Accept-encoding: gzip header and decompress the response.

import urllib2, httplib, StringIO, gzip request = urllib2.Request(url) request.add_header('Accept-encoding', 'gzip') response = urllib2.urlopen(request) compressed = response.read() compressed_stream = StringIO.StringIO(compressed) gzipper = gzip.GzipFile(fileobj=compressed_stream) print gzipper.read()

7. Multithreaded Crawling

For faster crawling, a simple thread‑pool using threading and Queue can process tasks concurrently.

from threading import Thread from Queue import Queue from time import sleep q = Queue() NUM = 2 # number of worker threads JOBS = 10 # total tasks def do_something_using(arguments): print arguments def working(): while True: arguments = q.get() do_something_using(arguments) sleep(1) q.task_done() for i in range(NUM): t = Thread(target=working) t.setDaemon(True) t.start() for i in range(JOBS): q.put(i) q.join()

The article also mentions page parsing with regular expressions, lxml, and BeautifulSoup, and briefly discusses handling simple captchas (manual or via paid services).

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

ProxymultithreadingGzipWeb Scrapingcookiesurllib2
ITPUB
Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.