Comprehensive Guide to Python Libraries for Web Crawling, Parsing, and Web Development
This article provides an extensive overview of Python libraries and frameworks for web crawling, data extraction, HTML/XML parsing, text processing, asynchronous programming, queue management, cloud execution, and popular web development frameworks such as Django, Flask, Web2py, Tornado, and CherryPy.
Learning Python often starts with web crawling because abundant resources and open‑source projects are available. The process of fetching a URL involves four steps: DNS lookup, sending a request to the server, receiving the response, and parsing the page.
General networking libraries include urllib, requests, urllib3, pycurl, httplib2, RoboBrowser, MechanicalSoup, mechanize, socket, Unirest, hyper, and PySocks.
Crawling frameworks such as grab, Scrapy, pyspider, cola, portia, restkit, and demiurge provide higher‑level crawling capabilities.
HTML/XML parsers cover lxml, cssselect, pyquery, BeautifulSoup, html5lib, feedparser, MarkupSafe, xmltodict, xhtml2pdf, and untangle, while cleaning tools include Bleach and sanitize.
Text processing tools like difflib, Levenshtein, fuzzywuzzy, esmre, and ftfy handle simple text manipulation.
Natural language processing libraries include NLTK, Pattern, TextBlob, jieba, SnowNLP, and loso.
Browser automation options are selenium, Ghost.py, Spynner, and Splinter.
Multiprocessing and concurrency are supported by threading, multiprocessing, celery, concurrent‑futures, and various async libraries such as asyncio, Twisted, Tornado, pulsar, diesel, gevent, eventlet, and Tomorrow.
Queue systems include celery, huey, mrq, RQ, simpleq, and python‑gearman.
Cloud computing services like picloud and dominoup.com enable remote code execution.
Web content extraction tools comprise newspaper, html2text, python‑goose, and lassie.
WebSocket libraries are Crossbar, AutobahnPython, and WebSocket‑for‑Python.
DNS utilities include dnsyo and pycares.
Computer vision libraries feature OpenCV, SimpleCV, and mahotas.
Proxy solutions such as shadowsocks and tproxy help bypass firewalls.
Web development frameworks highlighted are Django, Flask, Web2py, Tornado, and CherryPy, each with brief descriptions of their purpose and characteristics.
The article concludes with advice on framework selection, warning against the myth of a "best" framework and over‑emphasis on performance for low‑traffic sites.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
