Backend Development 7 min read

Master Python Web Scraping: Proxies, Login, Multithreading, and Captcha Hacks

This guide walks through practical Python web‑scraping techniques using urllib2, covering basic page fetching, proxy usage, cookie handling for logins, form submission, header spoofing, anti‑hotlink tricks, multithreaded crawling, and strategies for bypassing simple captchas, all illustrated with code snippets.

MaGe Linux Operations

Jul 1, 2014

Master Python Web Scraping: Proxies, Login, Multithreading, and Captcha Hacks

These scripts share a common theme: web‑related tasks that require fetching URLs, often combined with the simple‑crawler project simplecd, accumulating a lot of crawling experience.

1. Basic Site Fetching

Use urllib2 to read a page:

import urllib2
content = urllib2.urlopen('http://XXXX').read()

2. Using a Proxy Server

Useful when the IP is blocked or request limits are reached.

import urllib2
proxy_support = urllib2.ProxyHandler({'http':'http://XX.XX.XX.XX:XXXX'})
opener = urllib2.build_opener(proxy_support, urllib2.HTTPHandler)
urllib2.install_opener(opener)
content = urllib2.urlopen('http://XXXX').read()

3. Handling Login‑Required Sites

3.1 Cookie Handling

Manage cookies with cookielib:

import urllib2, cookielib
cookie_support = urllib2.HTTPCookieProcessor(cookielib.CookieJar())
opener = urllib2.build_opener(cookie_support, urllib2.HTTPHandler)
urllib2.install_opener(opener)
content = urllib2.urlopen('http://XXXX').read()

If both proxy and cookie are needed, combine them:

opener = urllib2.build_opener(proxy_support, cookie_support, urllib2.HTTPHandler)

3.2 Form Submission

Capture the POST fields (e.g., from verycd) and build the data:

import urllib
postdata = urllib.urlencode({
    'username':'XXXXX',
    'password':'XXXXX',
    'continueURI':'http://www.verycd.com/',
    'fk':fk,
    'login_submit':'登录'
})
req = urllib2.Request(
    url='http://secure.verycd.com/signin/*/http://www.verycd.com/',
    data=postdata)
result = urllib2.urlopen(req).read()

3.3 Spoofing a Browser

Some sites reject crawlers; set a realistic User‑Agent header:

headers = {
    'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'
}
req = urllib2.Request(
    url='http://secure.verycd.com/signin/*/http://www.verycd.com/',
    data=postdata,
    headers=headers)

3.4 Anti‑Hotlink Bypass

Set the Referer header to the target site (example: cnbeta):

headers = {
    'Referer':'http://www.cnbeta.com/articles'
}

3.5 Ultimate Trick

If previous tricks fail, copy all headers observed with tools like httpfox, or resort to Selenium (or similar tools such as pamie, watir) to control a real browser.

4. Multithreaded Crawling

Use a thread pool to fetch pages concurrently. The following is a simple template:

from threading import Thread
from Queue import Queue
from time import sleep

q = Queue()
NUM = 2
JOBS = 10

def do_something_using(arguments):
    print arguments

def working():
    while True:
        arguments = q.get()
        do_something_using(arguments)
        sleep(1)
        q.task_done()

for i in range(NUM):
    t = Thread(target=working)
    t.setDaemon(True)
    t.start()

for i in range(JOBS):
    q.put(i)

q.join()

5. Captcha Handling

Two typical cases:

Google‑style captchas are generally unsolvable without external services.

Simple captchas (limited characters, basic rotation/noise) can be tackled by rotating back, denoising, segmenting characters, extracting features (e.g., PCA), building a feature library, and matching against it.

Some weak captchas can be broken with the second method, achieving high accuracy.

6. Summary

All the situations the author has encountered were successfully resolved using the methods described above.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

proxy captcha multithreading Web Scraping urllib2

Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.