Backend Development 8 min read

Python Web Scraping Techniques: GET/POST Requests, Proxy IP, Cookies, Header Spoofing, Gzip Compression, and Multithreading

This article provides a comprehensive Python web‑scraping guide covering basic GET/POST requests with urllib2, proxy handling, cookie management, header manipulation to mimic browsers, gzip compression handling, regular‑expression and library parsing, simple captcha strategies, and a multithreaded thread‑pool example.

Python Programming Learning Circle

Apr 13, 2021

Python Web Scraping Techniques: GET/POST Requests, Proxy IP, Cookies, Header Spoofing, Gzip Compression, and Multithreading

Python is widely used for rapid web development, crawling, and automation; this guide summarizes reusable crawling techniques.

Basic page fetching

GET request example:

import urllib2
url = "http://www.baidu.com"
response = urllib2.urlopen(url)
print response.read()

POST request example:

import urllib
import urllib2
url = "http://abcde.com"
form = {'name':'abc','password':'1234'}
form_data = urllib.urlencode(form)
request = urllib2.Request(url, form_data)
response = urllib2.urlopen(request)
print response.read()

Using proxy IP

When an IP is blocked, a proxy can be set via ProxyHandler:

import urllib2
proxy = urllib2.ProxyHandler({'http': '127.0.0.1:8087'})
opener = urllib2.build_opener(proxy)
urllib2.install_opener(opener)
response = urllib2.urlopen('http://www.baidu.com')
print response.read()

Cookie handling

Cookies store session data; cookielib works with urllib2 to manage them:

import urllib2, cookielib
cookie_support = urllib2.HTTPCookieProcessor(cookielib.CookieJar())
opener = urllib2.build_opener(cookie_support)
urllib2.install_opener(opener)
content = urllib2.urlopen('http://XXXX').read()

Manual cookie addition:

cookie = "PHPSESSID=91rurfqm2329bopnosfu4fvmu7; kmsign=55d2c12c9b1e3; KMUID=b6Ejc1XSwPq9o756AxnBAg="
request.add_header("Cookie", cookie)

Impersonating a browser

Some servers reject crawlers with HTTP 403; setting a realistic User‑Agent and Content‑Type can bypass this:

import urllib2
headers = {'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'}
request = urllib2.Request(url='http://my.oschina.net/jhao104/blog?catalog=3463517', headers=headers)
print urllib2.urlopen(request).read()

Gzip compression

To receive compressed data, add the Accept-encoding: gzip header and decompress the response:

import urllib2, httplib
request = urllib2.Request('http://xxxx.com')
request.add_header('Accept-encoding', 'gzip')
opener = urllib2.build_opener()
f = opener.open(request)

import StringIO
import gzip
compresseddata = f.read()
compressedstream = StringIO.StringIO(compresseddata)
gzipper = gzip.GzipFile(fileobj=compressedstream)
print gzipper.read()

Page parsing

Regular expressions are powerful for extracting data; useful resources include:

Regex tutorial: http://www.cnblogs.com/huxi/archive/2010/07/04/1771073.html

Online regex tester: http://tool.oschina.net/regex/

Parsing libraries such as lxml (fast, XPath support) and BeautifulSoup (pure Python, easy to use) are recommended:

lxml guide: http://my.oschina.net/jhao104/blog/639448

BeautifulSoup guide: http://cuiqingcai.com/1319.html

Captcha handling

Simple captchas can be recognized programmatically; for complex ones (e.g., 12306) a paid third‑party solving service may be required.

Multithreaded concurrent crawling

When single‑threaded crawling is too slow, a thread pool can improve throughput. Below is a basic thread‑pool template that prints numbers 0‑9 concurrently:

from threading import Thread
from Queue import Queue
from time import sleep

q = Queue()
NUM = 2
JOBS = 10

def do_somthing_using(arguments):
    print arguments

def working():
    while True:
        arguments = q.get()
        do_somthing_using(arguments)
        sleep(1)
        q.task_done()

for i in range(NUM):
    t = Thread(target=working)
    t.setDaemon(True)
    t.start()

for i in range(JOBS):
    q.put(i)

q.join()

Although Python's GIL limits CPU‑bound threading, multithreading is effective for I/O‑bound crawling tasks.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

proxy Python multithreading gzip Web Scraping cookies urllib2 Header Spoofing

Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.