Backend Development 8 min read

Eight Essential Techniques for Python Web Scraping with urllib2

This article presents a concise guide to Python web scraping, covering eight practical techniques—including basic GET/POST requests, proxy usage, cookie management, header spoofing, page parsing, captcha handling, gzip compression, and multithreaded crawling—each illustrated with clear code examples.

Python Programming Learning Circle

Jan 24, 2024

Eight Essential Techniques for Python Web Scraping with urllib2

Python is an excellent language for quickly building web crawlers because of its rich ecosystem and versatile use cases such as rapid web development, data extraction, and automation tasks. The following eight techniques help you write efficient and maintainable crawlers.

1. Basic page fetching (GET)

import urllib2
url = "http://www.baidu.com"
response = urllib2.urlopen(url)
print response.read()

2. POST request

import urllib
import urllib2
url = "http://abcde.com"
form = {'name':'abc','password':'1234'}
form_data = urllib.urlencode(form)
request = urllib2.Request(url, form_data)
response = urllib2.urlopen(request)
print response.read()

3. Using a proxy IP

When the target site blocks your IP, you can route requests through a proxy using urllib2.ProxyHandler:

import urllib2
proxy = urllib2.ProxyHandler({'http': '127.0.0.1:8087'})
opener = urllib2.build_opener(proxy)
urllib2.install_opener(opener)
response = urllib2.urlopen('http://www.baidu.com')
print response.read()

4. Cookie handling

Cookies are used by many sites to maintain sessions. The cookielib module together with urllib2 provides transparent cookie support:

import urllib2, cookielib
cookie_support = urllib2.HTTPCookieProcessor(cookielib.CookieJar())
opener = urllib2.build_opener(cookie_support)
urllib2.install_opener(opener)
content = urllib2.urlopen('http://XXXX').read()

You can also add cookies manually:

cookie = "PHPSESSID=91rurfqm2329bopnosfu4fvmu7; kmsign=55d2c12c9b1e3; KMUID=b6Ejc1XSwPq9o756AxnBAg="
request.add_header("Cookie", cookie)

5. Spoofing a browser (custom headers)

Some servers reject non‑browser requests, returning HTTP 403. Setting a realistic User‑Agent and Content‑Type can avoid this:

import urllib2
headers = {'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'}
request = urllib2.Request(url, headers=headers)
print urllib2.urlopen(request).read()

6. Page parsing

After fetching HTML, you can extract data using regular expressions or dedicated parsers such as lxml (fast, XPath support) and BeautifulSoup (pure Python, easy to use). Both are suitable for different scenarios.

7. Captcha handling

Simple captchas can be solved with basic image processing; for complex ones (e.g., 12306) you may need third‑party services that provide human‑solved answers, usually at a cost.

8. Gzip compression

Many web services send compressed responses. Inform the server you accept gzip and then decompress the payload:

import urllib2, httplib
request = urllib2.Request('http://xxxx.com')
request.add_header('Accept-encoding', 'gzip')
opener = urllib2.build_opener()
f = opener.open(request)

import StringIO, gzip
compresseddata = f.read()
compressedstream = StringIO.StringIO(compresseddata)
gzipper = gzip.GzipFile(fileobj=compressedstream)
print gzipper.read()

9. Multithreaded concurrent crawling

To speed up crawling, a simple thread‑pool can be built with threading and Queue:

from threading import Thread
from Queue import Queue
from time import sleep
q = Queue()
NUM = 2
JOBS = 10

def do_something(arg):
    print arg

def worker():
    while True:
        arg = q.get()
        do_something(arg)
        sleep(1)
        q.task_done()

for i in range(NUM):
    t = Thread(target=worker)
    t.setDaemon(True)
    t.start()

for i in range(JOBS):
    q.put(i)

q.join()

These eight techniques form a solid foundation for building reliable Python web crawlers.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

proxy multithreading gzip cookies web-scraping urllib2

Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.