Big Data 5 min read

Bypass Anti‑Scraping Limits with Free Proxy IPs in Python

This tutorial explains how to obtain free proxy IPs, extract their addresses using Python's requests and BeautifulSoup, and continuously validate them to overcome anti‑scraping restrictions when crawling sites such as Baidu Baike for data mining tasks.

Python Crawling & Data Mining
Python Crawling & Data Mining
Python Crawling & Data Mining
Bypass Anti‑Scraping Limits with Free Proxy IPs in Python

1. Preface

Web crawlers frequently encounter anti‑scraping mechanisms that monitor request frequency per IP address; once detected, the IP may be blocked, preventing further access.

2. Grab IP Addresses

First, locate a free proxy‑IP website (see image). Inspect its static HTML structure and use requests together with BeautifulSoup to pull the IP and port values.

Each row consists of five <td> cells; the first cell holds the IP address and the second cell holds the port. By slicing the list ( item[::5] for IPs and item[1::5] for ports) you can collect usable proxies. The parameter n represents the page number, and you retrieve one useful proxy per page.

3. Verify IP Effectiveness

Use Baidu Baike as a target site to test the proxies. The site has strict anti‑scraping measures, so many requests fail quickly. The example demonstrates querying the location of train stations on Baidu Baike using the collected proxies.

1) Crawl all train‑station names from 12306 (without location data).

2) Construct Baidu Baike URLs for each station, parse the page, and extract location information by searching for characters like “省” or “市” within elements of class basicInfo-item.

3) Implement a while True loop: if the current proxy can successfully fetch data, break the loop; otherwise, request a new proxy and retry.

The core loop iterates over all stations; a try block checks proxy usability, and the except block fetches a new proxy when the current one is blocked.

4. Conclusion

The article demonstrates how to scrape free proxy IPs, validate their availability in real time, and use them to bypass anti‑scraping defenses when performing Python web‑crawling and data‑mining tasks.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Pythondata miningWeb Scrapingrequestsproxy IP
Python Crawling & Data Mining
Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.