Bypass Anti‑Scraping Limits with Free Proxy IPs in Python
This tutorial explains how to obtain free proxy IPs, extract their addresses using Python's requests and BeautifulSoup, and continuously validate them to overcome anti‑scraping restrictions when crawling sites such as Baidu Baike for data mining tasks.
1. Preface
Web crawlers frequently encounter anti‑scraping mechanisms that monitor request frequency per IP address; once detected, the IP may be blocked, preventing further access.
2. Grab IP Addresses
First, locate a free proxy‑IP website (see image). Inspect its static HTML structure and use requests together with BeautifulSoup to pull the IP and port values.
Each row consists of five <td> cells; the first cell holds the IP address and the second cell holds the port. By slicing the list ( item[::5] for IPs and item[1::5] for ports) you can collect usable proxies. The parameter n represents the page number, and you retrieve one useful proxy per page.
3. Verify IP Effectiveness
Use Baidu Baike as a target site to test the proxies. The site has strict anti‑scraping measures, so many requests fail quickly. The example demonstrates querying the location of train stations on Baidu Baike using the collected proxies.
1) Crawl all train‑station names from 12306 (without location data).
2) Construct Baidu Baike URLs for each station, parse the page, and extract location information by searching for characters like “省” or “市” within elements of class basicInfo-item.
3) Implement a while True loop: if the current proxy can successfully fetch data, break the loop; otherwise, request a new proxy and retry.
The core loop iterates over all stations; a try block checks proxy usability, and the except block fetches a new proxy when the current one is blocked.
4. Conclusion
The article demonstrates how to scrape free proxy IPs, validate their availability in real time, and use them to bypass anti‑scraping defenses when performing Python web‑crawling and data‑mining tasks.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
