Master Python Web Scraping: Bypass Anti‑Scraping and Export Data to Excel

This guide walks you through building a Python web‑scraping project that overcomes common anti‑scraping defenses, extracts QQ numbers from a target site, and saves the results into structured Excel files, while highlighting required libraries, request headers, and AJAX handling techniques.

Python Crawling & Data Mining
Python Crawling & Data Mining
Python Crawling & Data Mining
Master Python Web Scraping: Bypass Anti‑Scraping and Export Data to Excel

1. Introduction

CPA Home app promotion platform hosts millions of data entries; scraping this site enables data analysis.

2. Project Goal

Fetch QQ numbers, import them into an Excel template, and generate separate Excel documents.

3. Anti‑Scraping Measures

Testing revealed two main obstacles: the site returns no data when requests lack proper headers, and the IP is blocked after about 40 consecutive requests.

Solutions:

Use realistic HTTP request headers.

Generate random User‑Agent strings with fake_useragent.

4. Required Libraries and URL

Target URL:

https://www.cpajia.com/index.php?g=Wap&a=searchua

Necessary Python packages: requests, time, lxml, fake_useragent.

5. Implementation

Define get_page to set the URL, headers, and import libraries.

import requests</code><code>import os</code><code>import re</code><code>from fake_useragent import UserAgent</code><code>from lxml import etree</code><code>house_dict = {}</code><code>def get_page(url, page_num):</code><code>    pass</code><code>url = 'https://www.cpajia.com/index.php?g=Wap&a=search'</code><code>ua = UserAgent(verify_ssl=False)</code><code>kv = {'User-Agent': ua.random}</code><code>pageList = get_page()

Handle AJAX‑loaded pages by inspecting network requests (F12) and reproducing POST parameters, especially the PageIndex field.

response = requests.post(url=url, data=formdata, headers=kv)</code><code>html = response.content.decode('utf-8')</code><code>parse_html = etree.HTML(html)</code><code>page = parse_html.xpath('//div[@class="wrap"]//div[@class="list-main"]')</code><code>for li in page:</code><code>    house_dict['项目'] = li.xpath('.//div[@class="main-top"]//b/text()')[0].strip()</code><code>    house_dict['QQ'] = li.xpath('.//div[@class="main-com"]//span//a/text()')[0].strip()

Write results to a CSV (later opened in Excel) and note that the file may appear garbled until saved as .xlsx.

f = open('QQ号.csv', 'a', encoding='utf-8')</code><code>f.write(str(house_dict))</code><code>print(house_dict)</code><code>f.write("
")</code><code>f.close()

Invoke the scraper with the desired number of pages:

pageList = get_page(url, 100)  # (url, number_of_pages)

6. Result Demonstration

Running the script prints progress in the console; the generated CSV can be opened in Excel to view the collected QQ numbers.

Excel view of the saved data:

7. Summary

Learned to use requests and write a basic web scraper.

Applied anti‑scraping techniques such as custom headers and random User‑Agents.

Exported scraped data to Excel for further analysis.

Avoid excessive crawling to prevent server overload.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

PythonExcelfake_useragentlxml
Python Crawling & Data Mining
Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.