Python urllib2 Web Scraping: Basic Requests, Proxy, Cookies, Form Submission, and Header Spoofing

This article explains how to use Python's urllib2 library for web scraping, covering basic page fetching, proxy configuration, cookie handling, form submission, header manipulation, anti‑hotlink techniques, and advanced methods like Selenium to bypass restrictions.

Python Programming Learning Circle
Python Programming Learning Circle
Python Programming Learning Circle
Python urllib2 Web Scraping: Basic Requests, Proxy, Cookies, Form Submission, and Header Spoofing

This article demonstrates how to use Python's urllib2 library for web scraping, covering basic page fetching, proxy usage, cookie handling, form submission, header spoofing, anti‑hotlink techniques, and advanced approaches such as Selenium.

1. Basic Fetch

import urllib2
content = urllib2.urlopen('http://XXXX').read()

2. Using a Proxy Server

import urllib2
proxy_support = urllib2.ProxyHandler({'http':'http://XX.XX.XX.XX:XXXX'})
opener = urllib2.build_opener(proxy_support, urllib2.HTTPHandler)
urllib2.install_opener(opener)
content = urllib2.urlopen('http://XXXX').read()

3. Handling Login‑Required Sites

3.1 Cookie handling

import urllib2, cookielib
cookie_support = urllib2.HTTPCookieProcessor(cookielib.CookieJar())
opener = urllib2.build_opener(cookie_support, urllib2.HTTPHandler)
urllib2.install_opener(opener)
content = urllib2.urlopen('http://XXXX').read()

Combining proxy and cookie:

opener = urllib2.build_opener(proxy_support, cookie_support, urllib2.HTTPHandler)

3.2 Form submission

import urllib
postdata = urllib.urlencode({
    'username':'XXXXX',
    'password':'XXXXX',
    'continueURI':'http://www.verycd.com/',
    'fk':fk,
    'login_submit':'登录'
})
req = urllib2.Request(
    url='http://secure.verycd.com/signin/*/http://www.verycd.com/',
    data=postdata
)
result = urllib2.urlopen(req).read()

3.3 Spoofing as a Browser

headers = {
    'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'
}
req = urllib2.Request(
    url='http://secure.verycd.com/signin/*/http://www.verycd.com/',
    data=postdata,
    headers=headers
)

3.4 Bypassing Anti‑Hotlink Checks

headers = {
    'Referer':'http://www.cnbeta.com/articles'
}

3.5 Ultimate Method

If previous tricks still fail, copy all observed request headers from a browser tool or employ Selenium (or similar frameworks such as Pamie, Watir) to drive a real browser for the request.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

ProxyPythonhttp-headersweb-scrapingurllib2
Python Programming Learning Circle
Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.