Master Python Web Scraping: From Basic Requests to Multithreaded Crawlers

This comprehensive guide walks you through Python web‑scraping techniques—including basic URL fetching, proxy usage, cookie and form handling, browser impersonation, gzip/deflate support, captcha processing, multithreading with thread pools and Twisted async I/O, plus practical tips on connection pooling, thread stack size, retries, timeouts and login automation—providing a solid foundation for building robust crawlers.

MaGe Linux Operations
MaGe Linux Operations
MaGe Linux Operations
Master Python Web Scraping: From Basic Requests to Multithreaded Crawlers

After more than three months of using Python, the author has written many web‑related scripts such as proxy grabbers, Discuz auto‑login/post, mail fetchers, simple captcha recognizers, etc. This article summarizes common techniques for web crawling.

1. Basic site fetching

<code>import urllib2 content = urllib2.urlopen('http://XXXX').read() </code>

2. Using a proxy server

Useful when the IP is blocked or request limits are reached.

<code>import urllib2 proxy_support = urllib2.ProxyHandler({'http':'http://XX.XX.XX.XX:XXXX'}) opener = urllib2.build_opener(proxy_support, urllib2.HTTPHandler) urllib2.install_opener(opener) content = urllib2.urlopen('http://XXXX').read() </code>

3. Situations that require login

3.1 Cookie handling

<code>import urllib2, cookielib cookie_support = urllib2.HTTPCookieProcessor(cookielib.CookieJar()) opener = urllib2.build_opener(cookie_support, urllib2.HTTPHandler) urllib2.install_opener(opener) content = urllib2.urlopen('http://XXXX').read() </code>

If both proxy and cookie are needed, add proxy_support to the opener.

3.2 Form handling

Use a tool such as Firefox + HttpFox to capture the POST data. Example for VeryCD:

VeryCD login form
VeryCD login form
<code>import urllib postdata = urllib.urlencode({ 'username':'XXXXX', 'password':'XXXXX', 'continueURI':'http://www.verycd.com/', 'fk':fk, 'login_submit':'登录' }) req = urllib2.Request('http://secure.verycd.com/signin/*/http://www.verycd.com/', data=postdata) result = urllib2.urlopen(req).read() </code>

3.3 Browser impersonation

Set a realistic User-Agent header.

<code>headers = { 'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6' } req = urllib2.Request(url, data=postdata, headers=headers) </code>

3.4 Anti‑hotlink

Set the Referer header to the target site, e.g. for cnbeta.

<code>headers = { 'Referer':'http://www.cnbeta.com/articles' } </code>

3.5 Ultimate method

If all else fails, control a real browser with Selenium, Pamie, Watir, etc.

4. Multithreaded fetching

Simple thread‑pool example that prints numbers 1‑10.

<code>from threading import Thread from Queue import Queue from time import sleep # Example code omitted for brevity – creates a queue, starts worker threads, and processes tasks. </code>

5. Captcha handling

Google‑type captchas are usually unsolvable.

Simple captchas with limited characters can be rotated back, denoised, segmented, and matched against a feature library (e.g., PCA).

Some weak captchas can be solved with the method above.

6. gzip/deflate support

urllib/urllib2 does not handle compression by default. A custom handler can add the Accept-Encoding header and decode the response.

<code>import urllib2 from gzip import GzipFile from StringIO import StringIO class ContentEncodingProcessor(urllib2.BaseHandler): """A handler to add gzip capabilities to urllib2 requests""" def http_request(self, req): req.add_header("Accept-Encoding", "gzip, deflate") return req def http_response(self, req, resp): if resp.headers.get("content-encoding") == "gzip": gz = GzipFile(fileobj=StringIO(resp.read()), mode="r") resp = urllib2.addinfourl(gz, resp.headers, resp.url, resp.code) return resp encoding_support = ContentEncodingProcessor() opener = urllib2.build_opener(encoding_support, urllib2.HTTPHandler) content = opener.open(url).read() </code>

7. More convenient multithreading

7.1 Asynchronous I/O with Twisted

Use twisted.web.client.getPage with callbacks.

<code>from twisted.web.client import getPage from twisted.internet import reactor links = ['http://www.verycd.com/topics/%d/' % i for i in range(5420, 5430)] def parse_page(data, url): print len(data), url def fetch_error(error, url): print error.getErrorMessage(), url for url in links: getPage(url, timeout=5).addCallback(parse_page, url).addErrback(fetch_error, url) reactor.callLater(5, reactor.stop) reactor.run() </code>

7.2 Simple multithreaded fetcher class

Implementation using urllib2, threading and Queue. Core methods push, pop, threadget are shown.

<code>import urllib2 from threading import Thread, Lock from Queue import Queue import time class Fetcher: def __init__(self, threads): self.opener = urllib2.build_opener(urllib2.HTTPHandler) self.lock = Lock() self.q_req = Queue() self.q_ans = Queue() self.threads = threads for i in range(threads): t = Thread(target=self.threadget) t.setDaemon(True) t.start() self.running = 0 def push(self, req): self.q_req.put(req) def pop(self): return self.q_ans.get() def threadget(self): while True: req = self.q_req.get() with self.lock: self.running += 1 try: ans = self.opener.open(req).read() except Exception as what: ans = '' print what self.q_ans.put((req, ans)) with self.lock: self.running -= 1 self.q_req.task_done() time.sleep(0.1) </code>

8. Miscellaneous tips

8.1 Connection pool

Reuse HTTP connections or use a forward proxy such as Squid to avoid being blocked when many parallel requests are made.

8.2 Thread stack size

Set a larger stack size to reduce memory consumption, e.g. threading.stack_size(32768*2).

8.3 Automatic retry

Wrap a request in a function that retries a few times before giving up.

<code>def get(self, req, retries=3): try: response = self.opener.open(req) data = response.read() except Exception as what: print what, req if retries > 0: return self.get(req, retries-1) else: print 'GET Failed', req return '' return data </code>

8.4 Timeout

Set a global socket timeout: socket.setdefaulttimeout(10).

8.5 Login

Build an opener with cookie support and POST the login form as shown in section 3.2.

9. Conclusion

The techniques above can be combined into a powerful fetcher that supports multithreading, compression, timeout, automatic retry, stack‑size tuning and login automation. The author’s final private version also adds automatic proxy selection.

References

http://obmem.info/?p=476

http://obmem.info/?p=753

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

PythonmultithreadingGzipWeb Scraping
MaGe Linux Operations
Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.