Backend Development 8 min read

Essential Python Web Scraping Techniques: GET/POST Requests, Proxy IPs, Cookie Handling, Header Spoofing, Gzip Compression, and Multithreading

This article presents a comprehensive guide to Python web scraping, covering basic GET and POST requests with urllib2, using proxy IPs, managing cookies, disguising as a browser via custom headers, handling gzip-compressed responses, and accelerating crawls with a simple multithreaded worker pool.

Python Programming Learning Circle
Python Programming Learning Circle
Python Programming Learning Circle
Essential Python Web Scraping Techniques: GET/POST Requests, Proxy IPs, Cookie Handling, Header Spoofing, Gzip Compression, and Multithreading

Python is a versatile language for rapid web development, crawling, and automation; it can be used to build simple websites, posting scripts, email bots, and basic captcha solvers.

The article outlines eight essential techniques for efficient web crawling.

1. Basic Page Retrieval

Using urllib2 to perform a simple GET request:

<code>import urllib2
url = "http://www.baidu.com"
response = urllib2.urlopen(url)
print response.read()</code>

2. POST Requests

Sending form data with urllib and urllib2 :

<code>import urllib
import urllib2
url = "http://abcde.com"
form = {'name':'abc','password':'1234'}
form_data = urllib.urlencode(form)
request = urllib2.Request(url, form_data)
response = urllib2.urlopen(request)
print response.read()</code>

3. Using Proxy IPs

When an IP is blocked, a proxy can be set via urllib2.ProxyHandler :

<code>import urllib2
proxy = urllib2.ProxyHandler({'http': '127.0.0.1:8087'})
opener = urllib2.build_opener(proxy)
urllib2.install_opener(opener)
response = urllib2.urlopen('http://www.baidu.com')
print response.read()</code>

4. Cookie Management

The cookielib module creates a CookieJar that works with urllib2 to store and send cookies automatically:

<code>import urllib2, cookielib
cookie_support = urllib2.HTTPCookieProcessor(cookielib.CookieJar())
opener = urllib2.build_opener(cookie_support)
urllib2.install_opener(opener)
content = urllib2.urlopen('http://XXXX').read()</code>

Manual cookie addition example:

<code>cookie = "PHPSESSID=91rurfqm2329bopnosfu4fvmu7; kmsign=55d2c12c9b1e3; KMUID=b6Ejc1XSwPq9o756AxnBAg="
request.add_header("Cookie", cookie)</code>

5. Spoofing Browser Headers

Some servers reject non‑browser requests; adding a realistic User‑Agent (and other headers) avoids HTTP 403 errors:

<code>import urllib2
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'
}
request = urllib2.Request('http://my.oschina.net/jhao104/blog?catalog=3463517', headers=headers)
print urllib2.urlopen(request).read()</code>

6. Handling Gzip Compression

Tell the server you accept gzip, then decompress the response:

<code>import urllib2, httplib
request = urllib2.Request('http://xxxx.com')
request.add_header('Accept-encoding', 'gzip')
opener = urllib2.build_opener()
f = opener.open(request)</code>
<code>import StringIO
import gzip
compresseddata = f.read()
compressedstream = StringIO.StringIO(compresseddata)
gzipper = gzip.GzipFile(fileobj=compressedstream)
print gzipper.read()</code>

7. Page Parsing

Regular expressions, lxml , and BeautifulSoup are the primary tools for extracting data from HTML/XML; lxml offers fast Xpath support, while BeautifulSoup provides a pure‑Python, easy‑to‑use API.

8. Multithreaded Concurrent Crawling

To speed up crawling, a simple thread‑pool can be used. The example below creates a queue of jobs and processes them with a configurable number of worker threads:

<code>from threading import Thread
from Queue import Queue
from time import sleep
q = Queue()
NUM = 2
JOBS = 10

def do_somthing_using(arguments):
    print arguments

def working():
    while True:
        arguments = q.get()
        do_somthing_using(arguments)
        sleep(1)
        q.task_done()

for i in range(NUM):
    t = Thread(target=working)
    t.setDaemon(True)
    t.start()

for i in range(JOBS):
    q.put(i)

q.join()</code>

Although Python's GIL limits CPU‑bound threading, network‑bound crawling benefits noticeably from this approach.

ProxyPythonMultithreadinggzipweb scrapingcookiesurllib2
Python Programming Learning Circle
Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.