Backend Development 6 min read

How to Scrape and Extract Proxy Data with Python: Step-by-Step Guide

This tutorial walks through analyzing a proxy‑listing website’s structure, building a Python scraper using requests, Scrapy, regular expressions and BeautifulSoup, extracting IP, port, location and type fields across multiple pages, and saving the collected data to files, illustrating key web‑crawling techniques.

Python Crawling & Data Mining

Apr 18, 2020

How to Scrape and Extract Proxy Data with Python: Step-by-Step Guide

1. Introduction

After a previous article on using Python to crawl proxy data, this guide focuses on analyzing the web page structure and extracting the required information.

2. Home Page Analysis and Extraction

The homepage shows a pagination pattern where the number after the URL indicates the page. Each page contains over 100 entries, and the site has more than 2,700 pages, totaling over 270,000 proxy records. To keep the dataset recent, only the first 100 pages are targeted.

URL pattern for the first 100 pages:

http://example.com/page/1
http://example.com/page/2
... 
http://example.com/page/100

3. Web Element Analysis and Extraction

The proxy information is stored inside a <table id="ip_list"> element. The required fields are IP address, port, server location, and type. A ProxyBean class is defined to hold these attributes.

Extraction is performed using regular expressions combined with BeautifulSoup . First, the entire table is captured:

<table id="ip_list">([\S\s]*)</table>

Then each row ( <tr>) is processed. Rows with class="odd" are distinguished from others.

Field‑specific regular expressions:

IP address: (2[0-5]{2}|[0-1]?\d{1,2})(\.(2[0-5]{2}|[0-1]?\d{1,2})){3} Port: <td>([0-9]+)</td> Location: <a href="([^>]+)">([^<]+)</a> Type: <td>([A-Za-z]+)</td> BeautifulSoup parses the table rows and extracts the text for each column, populating a ProxyBean instance for every proxy entry.

After extraction, the data is written to a file for later use.

4. Summary

The project demonstrates how to:

Use the requests library to fetch web pages.

Apply anti‑scraping techniques such as proxy pools.

Write regular expressions for precise element extraction.

Leverage BeautifulSoup to parse HTML tables and retrieve structured data.

Overall, the tutorial provides a practical example of building a Python web crawler for proxy data.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

proxy Python regular expressions Scrapy

Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.