Essential Python Web Scraping Techniques: GET/POST Requests, Proxy IPs, Cookie Handling, Header Spoofing, Gzip Compression, and Multithreading
This article presents a comprehensive guide to Python web scraping, covering basic GET and POST requests with urllib2, using proxy IPs, managing cookies, disguising as a browser via custom headers, handling gzip-compressed responses, and accelerating crawls with a simple multithreaded worker pool.
Python is a versatile language for rapid web development, crawling, and automation; it can be used to build simple websites, posting scripts, email bots, and basic captcha solvers.
The article outlines eight essential techniques for efficient web crawling.
1. Basic Page Retrieval
Using urllib2 to perform a simple GET request:
<code>import urllib2
url = "http://www.baidu.com"
response = urllib2.urlopen(url)
print response.read()</code>2. POST Requests
Sending form data with urllib and urllib2 :
<code>import urllib
import urllib2
url = "http://abcde.com"
form = {'name':'abc','password':'1234'}
form_data = urllib.urlencode(form)
request = urllib2.Request(url, form_data)
response = urllib2.urlopen(request)
print response.read()</code>3. Using Proxy IPs
When an IP is blocked, a proxy can be set via urllib2.ProxyHandler :
<code>import urllib2
proxy = urllib2.ProxyHandler({'http': '127.0.0.1:8087'})
opener = urllib2.build_opener(proxy)
urllib2.install_opener(opener)
response = urllib2.urlopen('http://www.baidu.com')
print response.read()</code>4. Cookie Management
The cookielib module creates a CookieJar that works with urllib2 to store and send cookies automatically:
<code>import urllib2, cookielib
cookie_support = urllib2.HTTPCookieProcessor(cookielib.CookieJar())
opener = urllib2.build_opener(cookie_support)
urllib2.install_opener(opener)
content = urllib2.urlopen('http://XXXX').read()</code>Manual cookie addition example:
<code>cookie = "PHPSESSID=91rurfqm2329bopnosfu4fvmu7; kmsign=55d2c12c9b1e3; KMUID=b6Ejc1XSwPq9o756AxnBAg="
request.add_header("Cookie", cookie)</code>5. Spoofing Browser Headers
Some servers reject non‑browser requests; adding a realistic User‑Agent (and other headers) avoids HTTP 403 errors:
<code>import urllib2
headers = {
'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'
}
request = urllib2.Request('http://my.oschina.net/jhao104/blog?catalog=3463517', headers=headers)
print urllib2.urlopen(request).read()</code>6. Handling Gzip Compression
Tell the server you accept gzip, then decompress the response:
<code>import urllib2, httplib
request = urllib2.Request('http://xxxx.com')
request.add_header('Accept-encoding', 'gzip')
opener = urllib2.build_opener()
f = opener.open(request)</code> <code>import StringIO
import gzip
compresseddata = f.read()
compressedstream = StringIO.StringIO(compresseddata)
gzipper = gzip.GzipFile(fileobj=compressedstream)
print gzipper.read()</code>7. Page Parsing
Regular expressions, lxml , and BeautifulSoup are the primary tools for extracting data from HTML/XML; lxml offers fast Xpath support, while BeautifulSoup provides a pure‑Python, easy‑to‑use API.
8. Multithreaded Concurrent Crawling
To speed up crawling, a simple thread‑pool can be used. The example below creates a queue of jobs and processes them with a configurable number of worker threads:
<code>from threading import Thread
from Queue import Queue
from time import sleep
q = Queue()
NUM = 2
JOBS = 10
def do_somthing_using(arguments):
print arguments
def working():
while True:
arguments = q.get()
do_somthing_using(arguments)
sleep(1)
q.task_done()
for i in range(NUM):
t = Thread(target=working)
t.setDaemon(True)
t.start()
for i in range(JOBS):
q.put(i)
q.join()</code>Although Python's GIL limits CPU‑bound threading, network‑bound crawling benefits noticeably from this approach.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.