Eight Essential Techniques for Python Web Scraping with urllib2
This article presents a concise guide to Python web scraping, covering eight practical techniques—including basic GET/POST requests, proxy usage, cookie management, header spoofing, page parsing, captcha handling, gzip compression, and multithreaded crawling—each illustrated with clear code examples.
Python is an excellent language for quickly building web crawlers because of its rich ecosystem and versatile use cases such as rapid web development, data extraction, and automation tasks. The following eight techniques help you write efficient and maintainable crawlers.
1. Basic page fetching (GET)
import urllib2
url = "http://www.baidu.com"
response = urllib2.urlopen(url)
print response.read()2. POST request
import urllib
import urllib2
url = "http://abcde.com"
form = {'name':'abc','password':'1234'}
form_data = urllib.urlencode(form)
request = urllib2.Request(url, form_data)
response = urllib2.urlopen(request)
print response.read()3. Using a proxy IP
When the target site blocks your IP, you can route requests through a proxy using urllib2.ProxyHandler :
import urllib2
proxy = urllib2.ProxyHandler({'http': '127.0.0.1:8087'})
opener = urllib2.build_opener(proxy)
urllib2.install_opener(opener)
response = urllib2.urlopen('http://www.baidu.com')
print response.read()4. Cookie handling
Cookies are used by many sites to maintain sessions. The cookielib module together with urllib2 provides transparent cookie support:
import urllib2, cookielib
cookie_support = urllib2.HTTPCookieProcessor(cookielib.CookieJar())
opener = urllib2.build_opener(cookie_support)
urllib2.install_opener(opener)
content = urllib2.urlopen('http://XXXX').read()You can also add cookies manually:
cookie = "PHPSESSID=91rurfqm2329bopnosfu4fvmu7; kmsign=55d2c12c9b1e3; KMUID=b6Ejc1XSwPq9o756AxnBAg="
request.add_header("Cookie", cookie)5. Spoofing a browser (custom headers)
Some servers reject non‑browser requests, returning HTTP 403. Setting a realistic User‑Agent and Content‑Type can avoid this:
import urllib2
headers = {'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'}
request = urllib2.Request(url, headers=headers)
print urllib2.urlopen(request).read()6. Page parsing
After fetching HTML, you can extract data using regular expressions or dedicated parsers such as lxml (fast, XPath support) and BeautifulSoup (pure Python, easy to use). Both are suitable for different scenarios.
7. Captcha handling
Simple captchas can be solved with basic image processing; for complex ones (e.g., 12306) you may need third‑party services that provide human‑solved answers, usually at a cost.
8. Gzip compression
Many web services send compressed responses. Inform the server you accept gzip and then decompress the payload:
import urllib2, httplib
request = urllib2.Request('http://xxxx.com')
request.add_header('Accept-encoding', 'gzip')
opener = urllib2.build_opener()
f = opener.open(request) import StringIO, gzip
compresseddata = f.read()
compressedstream = StringIO.StringIO(compresseddata)
gzipper = gzip.GzipFile(fileobj=compressedstream)
print gzipper.read()9. Multithreaded concurrent crawling
To speed up crawling, a simple thread‑pool can be built with threading and Queue :
from threading import Thread
from Queue import Queue
from time import sleep
q = Queue()
NUM = 2
JOBS = 10
def do_something(arg):
print arg
def worker():
while True:
arg = q.get()
do_something(arg)
sleep(1)
q.task_done()
for i in range(NUM):
t = Thread(target=worker)
t.setDaemon(True)
t.start()
for i in range(JOBS):
q.put(i)
q.join()These eight techniques form a solid foundation for building reliable Python web crawlers.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.