Master Python Web Scraping: Proxies, Login, Multithreading, and Captcha Hacks
This guide walks through practical Python web‑scraping techniques using urllib2, covering basic page fetching, proxy usage, cookie handling for logins, form submission, header spoofing, anti‑hotlink tricks, multithreaded crawling, and strategies for bypassing simple captchas, all illustrated with code snippets.
These scripts share a common theme: web‑related tasks that require fetching URLs, often combined with the simple‑crawler project simplecd, accumulating a lot of crawling experience.
1. Basic Site Fetching
Use urllib2 to read a page:
import urllib2
content = urllib2.urlopen('http://XXXX').read()2. Using a Proxy Server
Useful when the IP is blocked or request limits are reached.
import urllib2
proxy_support = urllib2.ProxyHandler({'http':'http://XX.XX.XX.XX:XXXX'})
opener = urllib2.build_opener(proxy_support, urllib2.HTTPHandler)
urllib2.install_opener(opener)
content = urllib2.urlopen('http://XXXX').read()3. Handling Login‑Required Sites
3.1 Cookie Handling
Manage cookies with cookielib:
import urllib2, cookielib
cookie_support = urllib2.HTTPCookieProcessor(cookielib.CookieJar())
opener = urllib2.build_opener(cookie_support, urllib2.HTTPHandler)
urllib2.install_opener(opener)
content = urllib2.urlopen('http://XXXX').read()If both proxy and cookie are needed, combine them:
opener = urllib2.build_opener(proxy_support, cookie_support, urllib2.HTTPHandler)3.2 Form Submission
Capture the POST fields (e.g., from verycd) and build the data:
import urllib
postdata = urllib.urlencode({
'username':'XXXXX',
'password':'XXXXX',
'continueURI':'http://www.verycd.com/',
'fk':fk,
'login_submit':'登录'
})
req = urllib2.Request(
url='http://secure.verycd.com/signin/*/http://www.verycd.com/',
data=postdata)
result = urllib2.urlopen(req).read()3.3 Spoofing a Browser
Some sites reject crawlers; set a realistic User‑Agent header:
headers = {
'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'
}
req = urllib2.Request(
url='http://secure.verycd.com/signin/*/http://www.verycd.com/',
data=postdata,
headers=headers)3.4 Anti‑Hotlink Bypass
Set the Referer header to the target site (example: cnbeta):
headers = {
'Referer':'http://www.cnbeta.com/articles'
}3.5 Ultimate Trick
If previous tricks fail, copy all headers observed with tools like httpfox, or resort to Selenium (or similar tools such as pamie, watir) to control a real browser.
4. Multithreaded Crawling
Use a thread pool to fetch pages concurrently. The following is a simple template:
from threading import Thread
from Queue import Queue
from time import sleep
q = Queue()
NUM = 2
JOBS = 10
def do_something_using(arguments):
print arguments
def working():
while True:
arguments = q.get()
do_something_using(arguments)
sleep(1)
q.task_done()
for i in range(NUM):
t = Thread(target=working)
t.setDaemon(True)
t.start()
for i in range(JOBS):
q.put(i)
q.join()5. Captcha Handling
Two typical cases:
Google‑style captchas are generally unsolvable without external services.
Simple captchas (limited characters, basic rotation/noise) can be tackled by rotating back, denoising, segmenting characters, extracting features (e.g., PCA), building a feature library, and matching against it.
Some weak captchas can be broken with the second method, achieving high accuracy.
6. Summary
All the situations the author has encountered were successfully resolved using the methods described above.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
