Backend Development 8 min read

Eight Essential Techniques for Python Web Scraping with urllib2

This article presents a concise guide to Python web scraping, covering eight practical techniques—including basic GET/POST requests, proxy usage, cookie management, header spoofing, page parsing, captcha handling, gzip compression, and multithreaded crawling—each illustrated with clear code examples.

Python Programming Learning Circle
Python Programming Learning Circle
Python Programming Learning Circle
Eight Essential Techniques for Python Web Scraping with urllib2

Python is an excellent language for quickly building web crawlers because of its rich ecosystem and versatile use cases such as rapid web development, data extraction, and automation tasks. The following eight techniques help you write efficient and maintainable crawlers.

1. Basic page fetching (GET)

import urllib2
url = "http://www.baidu.com"
response = urllib2.urlopen(url)
print response.read()

2. POST request

import urllib
import urllib2
url = "http://abcde.com"
form = {'name':'abc','password':'1234'}
form_data = urllib.urlencode(form)
request = urllib2.Request(url, form_data)
response = urllib2.urlopen(request)
print response.read()

3. Using a proxy IP

When the target site blocks your IP, you can route requests through a proxy using urllib2.ProxyHandler :

import urllib2
proxy = urllib2.ProxyHandler({'http': '127.0.0.1:8087'})
opener = urllib2.build_opener(proxy)
urllib2.install_opener(opener)
response = urllib2.urlopen('http://www.baidu.com')
print response.read()

4. Cookie handling

Cookies are used by many sites to maintain sessions. The cookielib module together with urllib2 provides transparent cookie support:

import urllib2, cookielib
cookie_support = urllib2.HTTPCookieProcessor(cookielib.CookieJar())
opener = urllib2.build_opener(cookie_support)
urllib2.install_opener(opener)
content = urllib2.urlopen('http://XXXX').read()

You can also add cookies manually:

cookie = "PHPSESSID=91rurfqm2329bopnosfu4fvmu7; kmsign=55d2c12c9b1e3; KMUID=b6Ejc1XSwPq9o756AxnBAg="
request.add_header("Cookie", cookie)

5. Spoofing a browser (custom headers)

Some servers reject non‑browser requests, returning HTTP 403. Setting a realistic User‑Agent and Content‑Type can avoid this:

import urllib2
headers = {'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'}
request = urllib2.Request(url, headers=headers)
print urllib2.urlopen(request).read()

6. Page parsing

After fetching HTML, you can extract data using regular expressions or dedicated parsers such as lxml (fast, XPath support) and BeautifulSoup (pure Python, easy to use). Both are suitable for different scenarios.

7. Captcha handling

Simple captchas can be solved with basic image processing; for complex ones (e.g., 12306) you may need third‑party services that provide human‑solved answers, usually at a cost.

8. Gzip compression

Many web services send compressed responses. Inform the server you accept gzip and then decompress the payload:

import urllib2, httplib
request = urllib2.Request('http://xxxx.com')
request.add_header('Accept-encoding', 'gzip')
opener = urllib2.build_opener()
f = opener.open(request)
import StringIO, gzip
compresseddata = f.read()
compressedstream = StringIO.StringIO(compresseddata)
gzipper = gzip.GzipFile(fileobj=compressedstream)
print gzipper.read()

9. Multithreaded concurrent crawling

To speed up crawling, a simple thread‑pool can be built with threading and Queue :

from threading import Thread
from Queue import Queue
from time import sleep
q = Queue()
NUM = 2
JOBS = 10

def do_something(arg):
    print arg

def worker():
    while True:
        arg = q.get()
        do_something(arg)
        sleep(1)
        q.task_done()

for i in range(NUM):
    t = Thread(target=worker)
    t.setDaemon(True)
    t.start()

for i in range(JOBS):
    q.put(i)

q.join()

These eight techniques form a solid foundation for building reliable Python web crawlers.

ProxyMultithreadinggzipweb scrapingcookiesurllib2
Python Programming Learning Circle
Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.