Master Python Web Scraping: From Basic Requests to Multithreaded Crawlers
This comprehensive guide walks you through Python web‑scraping techniques—including basic URL fetching, proxy usage, cookie and form handling, browser impersonation, gzip/deflate support, captcha processing, multithreading with thread pools and Twisted async I/O, plus practical tips on connection pooling, thread stack size, retries, timeouts and login automation—providing a solid foundation for building robust crawlers.
After more than three months of using Python, the author has written many web‑related scripts such as proxy grabbers, Discuz auto‑login/post, mail fetchers, simple captcha recognizers, etc. This article summarizes common techniques for web crawling.
1. Basic site fetching
<code>import urllib2 content = urllib2.urlopen('http://XXXX').read() </code>
2. Using a proxy server
Useful when the IP is blocked or request limits are reached.
<code>import urllib2 proxy_support = urllib2.ProxyHandler({'http':'http://XX.XX.XX.XX:XXXX'}) opener = urllib2.build_opener(proxy_support, urllib2.HTTPHandler) urllib2.install_opener(opener) content = urllib2.urlopen('http://XXXX').read() </code>
3. Situations that require login
3.1 Cookie handling
<code>import urllib2, cookielib cookie_support = urllib2.HTTPCookieProcessor(cookielib.CookieJar()) opener = urllib2.build_opener(cookie_support, urllib2.HTTPHandler) urllib2.install_opener(opener) content = urllib2.urlopen('http://XXXX').read() </code>
If both proxy and cookie are needed, add proxy_support to the opener.
3.2 Form handling
Use a tool such as Firefox + HttpFox to capture the POST data. Example for VeryCD:
<code>import urllib postdata = urllib.urlencode({ 'username':'XXXXX', 'password':'XXXXX', 'continueURI':'http://www.verycd.com/', 'fk':fk, 'login_submit':'登录' }) req = urllib2.Request('http://secure.verycd.com/signin/*/http://www.verycd.com/', data=postdata) result = urllib2.urlopen(req).read() </code>
3.3 Browser impersonation
Set a realistic User-Agent header.
<code>headers = { 'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6' } req = urllib2.Request(url, data=postdata, headers=headers) </code>
3.4 Anti‑hotlink
Set the Referer header to the target site, e.g. for cnbeta.
<code>headers = { 'Referer':'http://www.cnbeta.com/articles' } </code>
3.5 Ultimate method
If all else fails, control a real browser with Selenium, Pamie, Watir, etc.
4. Multithreaded fetching
Simple thread‑pool example that prints numbers 1‑10.
<code>from threading import Thread from Queue import Queue from time import sleep # Example code omitted for brevity – creates a queue, starts worker threads, and processes tasks. </code>
5. Captcha handling
Google‑type captchas are usually unsolvable.
Simple captchas with limited characters can be rotated back, denoised, segmented, and matched against a feature library (e.g., PCA).
Some weak captchas can be solved with the method above.
6. gzip/deflate support
urllib/urllib2 does not handle compression by default. A custom handler can add the Accept-Encoding header and decode the response.
<code>import urllib2 from gzip import GzipFile from StringIO import StringIO class ContentEncodingProcessor(urllib2.BaseHandler): """A handler to add gzip capabilities to urllib2 requests""" def http_request(self, req): req.add_header("Accept-Encoding", "gzip, deflate") return req def http_response(self, req, resp): if resp.headers.get("content-encoding") == "gzip": gz = GzipFile(fileobj=StringIO(resp.read()), mode="r") resp = urllib2.addinfourl(gz, resp.headers, resp.url, resp.code) return resp encoding_support = ContentEncodingProcessor() opener = urllib2.build_opener(encoding_support, urllib2.HTTPHandler) content = opener.open(url).read() </code>
7. More convenient multithreading
7.1 Asynchronous I/O with Twisted
Use twisted.web.client.getPage with callbacks.
<code>from twisted.web.client import getPage from twisted.internet import reactor links = ['http://www.verycd.com/topics/%d/' % i for i in range(5420, 5430)] def parse_page(data, url): print len(data), url def fetch_error(error, url): print error.getErrorMessage(), url for url in links: getPage(url, timeout=5).addCallback(parse_page, url).addErrback(fetch_error, url) reactor.callLater(5, reactor.stop) reactor.run() </code>
7.2 Simple multithreaded fetcher class
Implementation using urllib2, threading and Queue. Core methods push, pop, threadget are shown.
<code>import urllib2 from threading import Thread, Lock from Queue import Queue import time class Fetcher: def __init__(self, threads): self.opener = urllib2.build_opener(urllib2.HTTPHandler) self.lock = Lock() self.q_req = Queue() self.q_ans = Queue() self.threads = threads for i in range(threads): t = Thread(target=self.threadget) t.setDaemon(True) t.start() self.running = 0 def push(self, req): self.q_req.put(req) def pop(self): return self.q_ans.get() def threadget(self): while True: req = self.q_req.get() with self.lock: self.running += 1 try: ans = self.opener.open(req).read() except Exception as what: ans = '' print what self.q_ans.put((req, ans)) with self.lock: self.running -= 1 self.q_req.task_done() time.sleep(0.1) </code>
8. Miscellaneous tips
8.1 Connection pool
Reuse HTTP connections or use a forward proxy such as Squid to avoid being blocked when many parallel requests are made.
8.2 Thread stack size
Set a larger stack size to reduce memory consumption, e.g. threading.stack_size(32768*2).
8.3 Automatic retry
Wrap a request in a function that retries a few times before giving up.
<code>def get(self, req, retries=3): try: response = self.opener.open(req) data = response.read() except Exception as what: print what, req if retries > 0: return self.get(req, retries-1) else: print 'GET Failed', req return '' return data </code>
8.4 Timeout
Set a global socket timeout: socket.setdefaulttimeout(10).
8.5 Login
Build an opener with cookie support and POST the login form as shown in section 3.2.
9. Conclusion
The techniques above can be combined into a powerful fetcher that supports multithreading, compression, timeout, automatic retry, stack‑size tuning and login automation. The author’s final private version also adds automatic proxy selection.
References
http://obmem.info/?p=476
http://obmem.info/?p=753
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
