How to Scrape and Extract Proxy Data with Python: Step-by-Step Guide
This tutorial walks through analyzing a proxy‑listing website’s structure, building a Python scraper using requests, Scrapy, regular expressions and BeautifulSoup, extracting IP, port, location and type fields across multiple pages, and saving the collected data to files, illustrating key web‑crawling techniques.
1. Introduction
After a previous article on using Python to crawl proxy data, this guide focuses on analyzing the web page structure and extracting the required information.
2. Home Page Analysis and Extraction
The homepage shows a pagination pattern where the number after the URL indicates the page. Each page contains over 100 entries, and the site has more than 2,700 pages, totaling over 270,000 proxy records. To keep the dataset recent, only the first 100 pages are targeted.
URL pattern for the first 100 pages:
http://example.com/page/1
http://example.com/page/2
...
http://example.com/page/1003. Web Element Analysis and Extraction
The proxy information is stored inside a <table id="ip_list"> element. The required fields are IP address, port, server location, and type. A ProxyBean class is defined to hold these attributes.
Extraction is performed using regular expressions combined with BeautifulSoup . First, the entire table is captured:
<table id="ip_list">([\S\s]*)</table>Then each row ( <tr>) is processed. Rows with class="odd" are distinguished from others.
Field‑specific regular expressions:
IP address: (2[0-5]{2}|[0-1]?\d{1,2})(\.(2[0-5]{2}|[0-1]?\d{1,2})){3} Port: <td>([0-9]+)</td> Location: <a href="([^>]+)">([^<]+)</a> Type: <td>([A-Za-z]+)</td> BeautifulSoup parses the table rows and extracts the text for each column, populating a ProxyBean instance for every proxy entry.
After extraction, the data is written to a file for later use.
4. Summary
The project demonstrates how to:
Use the requests library to fetch web pages.
Apply anti‑scraping techniques such as proxy pools.
Write regular expressions for precise element extraction.
Leverage BeautifulSoup to parse HTML tables and retrieve structured data.
Overall, the tutorial provides a practical example of building a Python web crawler for proxy data.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
