Master Python Web Scraping: Bypass Anti‑Scraping and Export Data to Excel
This guide walks you through building a Python web‑scraping project that overcomes common anti‑scraping defenses, extracts QQ numbers from a target site, and saves the results into structured Excel files, while highlighting required libraries, request headers, and AJAX handling techniques.
1. Introduction
CPA Home app promotion platform hosts millions of data entries; scraping this site enables data analysis.
2. Project Goal
Fetch QQ numbers, import them into an Excel template, and generate separate Excel documents.
3. Anti‑Scraping Measures
Testing revealed two main obstacles: the site returns no data when requests lack proper headers, and the IP is blocked after about 40 consecutive requests.
Solutions:
Use realistic HTTP request headers.
Generate random User‑Agent strings with fake_useragent.
4. Required Libraries and URL
Target URL:
https://www.cpajia.com/index.php?g=Wap&a=searchuaNecessary Python packages: requests, time, lxml, fake_useragent.
5. Implementation
Define get_page to set the URL, headers, and import libraries.
import requests</code><code>import os</code><code>import re</code><code>from fake_useragent import UserAgent</code><code>from lxml import etree</code><code>house_dict = {}</code><code>def get_page(url, page_num):</code><code> pass</code><code>url = 'https://www.cpajia.com/index.php?g=Wap&a=search'</code><code>ua = UserAgent(verify_ssl=False)</code><code>kv = {'User-Agent': ua.random}</code><code>pageList = get_page()Handle AJAX‑loaded pages by inspecting network requests (F12) and reproducing POST parameters, especially the PageIndex field.
response = requests.post(url=url, data=formdata, headers=kv)</code><code>html = response.content.decode('utf-8')</code><code>parse_html = etree.HTML(html)</code><code>page = parse_html.xpath('//div[@class="wrap"]//div[@class="list-main"]')</code><code>for li in page:</code><code> house_dict['项目'] = li.xpath('.//div[@class="main-top"]//b/text()')[0].strip()</code><code> house_dict['QQ'] = li.xpath('.//div[@class="main-com"]//span//a/text()')[0].strip()Write results to a CSV (later opened in Excel) and note that the file may appear garbled until saved as .xlsx.
f = open('QQ号.csv', 'a', encoding='utf-8')</code><code>f.write(str(house_dict))</code><code>print(house_dict)</code><code>f.write("
")</code><code>f.close()Invoke the scraper with the desired number of pages:
pageList = get_page(url, 100) # (url, number_of_pages)6. Result Demonstration
Running the script prints progress in the console; the generated CSV can be opened in Excel to view the collected QQ numbers.
Excel view of the saved data:
7. Summary
Learned to use requests and write a basic web scraper.
Applied anti‑scraping techniques such as custom headers and random User‑Agents.
Exported scraped data to Excel for further analysis.
Avoid excessive crawling to prevent server overload.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
