Python Web Scraping Techniques: GET/POST Requests, Proxy IP, Cookies, Header Spoofing, Gzip Compression, and Multithreading
This article provides a comprehensive Python web‑scraping guide covering basic GET/POST requests with urllib2, proxy handling, cookie management, header manipulation to mimic browsers, gzip compression handling, regular‑expression and library parsing, simple captcha strategies, and a multithreaded thread‑pool example.
Python is widely used for rapid web development, crawling, and automation; this guide summarizes reusable crawling techniques.
Basic page fetching
GET request example:
import urllib2
url = "http://www.baidu.com"
response = urllib2.urlopen(url)
print response.read()POST request example:
import urllib
import urllib2
url = "http://abcde.com"
form = {'name':'abc','password':'1234'}
form_data = urllib.urlencode(form)
request = urllib2.Request(url, form_data)
response = urllib2.urlopen(request)
print response.read()Using proxy IP
When an IP is blocked, a proxy can be set via ProxyHandler :
import urllib2
proxy = urllib2.ProxyHandler({'http': '127.0.0.1:8087'})
opener = urllib2.build_opener(proxy)
urllib2.install_opener(opener)
response = urllib2.urlopen('http://www.baidu.com')
print response.read()Cookie handling
Cookies store session data; cookielib works with urllib2 to manage them:
import urllib2, cookielib
cookie_support = urllib2.HTTPCookieProcessor(cookielib.CookieJar())
opener = urllib2.build_opener(cookie_support)
urllib2.install_opener(opener)
content = urllib2.urlopen('http://XXXX').read()Manual cookie addition:
cookie = "PHPSESSID=91rurfqm2329bopnosfu4fvmu7; kmsign=55d2c12c9b1e3; KMUID=b6Ejc1XSwPq9o756AxnBAg="
request.add_header("Cookie", cookie)Impersonating a browser
Some servers reject crawlers with HTTP 403; setting a realistic User‑Agent and Content‑Type can bypass this:
import urllib2
headers = {'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'}
request = urllib2.Request(url='http://my.oschina.net/jhao104/blog?catalog=3463517', headers=headers)
print urllib2.urlopen(request).read()Gzip compression
To receive compressed data, add the Accept-encoding: gzip header and decompress the response:
import urllib2, httplib
request = urllib2.Request('http://xxxx.com')
request.add_header('Accept-encoding', 'gzip')
opener = urllib2.build_opener()
f = opener.open(request) import StringIO
import gzip
compresseddata = f.read()
compressedstream = StringIO.StringIO(compresseddata)
gzipper = gzip.GzipFile(fileobj=compressedstream)
print gzipper.read()Page parsing
Regular expressions are powerful for extracting data; useful resources include:
Regex tutorial: http://www.cnblogs.com/huxi/archive/2010/07/04/1771073.html
Online regex tester: http://tool.oschina.net/regex/
Parsing libraries such as lxml (fast, XPath support) and BeautifulSoup (pure Python, easy to use) are recommended:
lxml guide: http://my.oschina.net/jhao104/blog/639448
BeautifulSoup guide: http://cuiqingcai.com/1319.html
Captcha handling
Simple captchas can be recognized programmatically; for complex ones (e.g., 12306) a paid third‑party solving service may be required.
Multithreaded concurrent crawling
When single‑threaded crawling is too slow, a thread pool can improve throughput. Below is a basic thread‑pool template that prints numbers 0‑9 concurrently:
from threading import Thread
from Queue import Queue
from time import sleep
q = Queue()
NUM = 2
JOBS = 10
def do_somthing_using(arguments):
print arguments
def working():
while True:
arguments = q.get()
do_somthing_using(arguments)
sleep(1)
q.task_done()
for i in range(NUM):
t = Thread(target=working)
t.setDaemon(True)
t.start()
for i in range(JOBS):
q.put(i)
q.join()Although Python's GIL limits CPU‑bound threading, multithreading is effective for I/O‑bound crawling tasks.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.