Backend Development 12 min read

Bypass Ant Financial Rental Site Anti‑Scraping with Python Cookies

This tutorial explains how to analyze the Ant Short‑Term Rental website's anti‑scraping mechanisms, extract the required Cookie and User‑Agent headers, and use Python's urllib2 and BeautifulSoup to reliably crawl rental listings, save the data to CSV, and optionally extend the scraper with Selenium.

MaGe Linux Operations

Mar 1, 2021

Bypass Ant Financial Rental Site Anti‑Scraping with Python Cookies

When crawling the Ant Short‑Term Rental (蚂蚁短租) site, the server often blocks requests that appear to be automated, showing a message like “Current access suspected as a hacker attack, blocked by the administrator.” To overcome this, the article demonstrates how to inspect the website, locate the data nodes, and configure the necessary request headers.

Website analysis and anti‑scraping detection

The rental information is displayed under <dd> elements, with the house name inside a <div class="room-detail clearfloat"> node. By examining the page source, the required fields (name, price, score, link) can be identified.

Simple BeautifulSoup crawler (fails without headers)

# -*- coding: utf-8 -*-
import urllib
import re
from bs4 import BeautifulSoup
import codecs

url = 'http://www.mayi.com/guiyang/?map=no'
response = urllib.urlopen(url)
contents = response.read()
soup = BeautifulSoup(contents, "html.parser")
print soup.title
print soup
# short‑term rental name
for tag in soup.find_all('dd'):
    for name in tag.find_all(attrs={"class":"room-detail clearfloat"}):
        fname = name.find('p').get_text()
        print u'[短租房名称]', fname.replace('
','').strip()

Running this code results in an error because the site’s anti‑scraping measures block the request.

Adding Cookie and User‑Agent headers

First, capture the Cookie and User‑Agent values from the browser’s Network panel (Headers tab). Then include them in the request headers.

# -*- coding: utf-8 -*-
import urllib2
import re
from bs4 import BeautifulSoup

# crawler function
def gydzf(url):
    user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"
    headers = {"User-Agent": user_agent}
    request = urllib2.Request(url, headers=headers)
    response = urllib2.urlopen(request)
    contents = response.read()
    soup = BeautifulSoup(contents, "html.parser")
    for tag in soup.find_all('dd'):
        # short‑term rental name
        for name in tag.find_all(attrs={"class":"room-detail clearfloat"}):
            fname = name.find('p').get_text()
            print u'[短租房名称]', fname.replace('
','').strip()
        # price
        for price in tag.find_all(attrs={"class":"moy-b"}):
            string = price.find('p').get_text()
            fprice = re.sub("[￥]+".decode("utf8"), "".decode("utf8"), string)
            fprice = fprice[0:5]
            print u'[短租房价格]', fprice.replace('
','').strip()
        # score and comments
        for score in name.find('ul'):
            fscore = name.find('ul').get_text()
            print u'[短租房评分/评论/居住人数]', fscore.replace('
','').strip()
        # page link
        url_dzf = tag.find(attrs={"target":"_blank"})
        urls = url_dzf.attrs['href']
        print u'[网页链接]', urls.replace('
','').strip()
        urlss = 'http://www.mayi.com' + urls
        print urlss

The script now prints the rental name, price, score, and link for each page. Sample output (truncated) is shown below:

页码 1
[短租房名称] 大唐东原财富广场--城市简约复式民宿
[短租房价格] 298
[短租房评分/评论/居住人数] 5.0分·5条评论·二居·可住3人
[网页链接] /room/851634765
http://www.mayi.com/room/851634765
... 
页码 9
[短租房名称] 【高铁北站公园旁】美式风情+超大舒适安逸
[短租房价格] 366
[短租房评分/评论/居住人数] 3条评论·二居·可住5人
[网页链接] /room/851018852
http://www.mayi.com/room/851018852

Fetching detailed information

To obtain additional fields such as address, occupancy, and per‑person price, the article provides a function that re‑requests each detail page using the same Cookie and User‑Agent headers.

import urllib2
import re
from bs4 import BeautifulSoup
import codecs
import csv

c = open("ycf.csv","wb")
c.write(codecs.BOM_UTF8)
writer = csv.writer(c)
writer.writerow(["短租房名称","地址","价格","评分","可住人数","人均价格"])

# function to get detailed info
def getInfo(url, fname, fprice, fscore, users):
    user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36"
    cookie = "mediav=%7B%22eid%22%3A%22387123%22eb7; mayi_uuid=1582009990674274976491; sid=42200298656434922.85.130.130"
    headers = {"User-Agent": user_agent, "Cookie": cookie}
    request = urllib2.Request(url, headers=headers)
    response = urllib2.urlopen(request)
    contents = response.read()
    soup = BeautifulSoup(contents, "html.parser")
    # address
    for tag1 in soup.find_all(attrs={"class":"main"}):
        print u'短租房地址:'
        for tag2 in tag1.find_all(attrs={"class":"desWord"}):
            address = tag2.find('p').get_text()
            print address
        # occupancy
        print u'可住人数:'
        for tag4 in tag1.find_all(attrs={"class":"w258"}):
            yy = tag4.find('span').get_text()
            print yy
        # calculate per‑person price and write to CSV
        fpeople = yy[2:3]
        ones = int(float(fprice))/int(float(fpeople))
        writer.writerow([fname, address, fprice, fscore, fpeople, ones])

# main crawler
def gydzf(url):
    user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"
    headers = {"User-Agent": user_agent}
    request = urllib2.Request(url, headers=headers)
    response = urllib2.urlopen(request)
    contents = response.read()
    soup = BeautifulSoup(contents, "html.parser")
    for tag in soup.find_all('dd'):
        for name in tag.find_all(attrs={"class":"room-detail clearfloat"}):
            fname = name.find('p').get_text()
            # price, score, link extraction same as before …
            # after obtaining urlss (detail page URL):
            getInfo(urlss, fname, fprice, fscore, user_agent)

if __name__ == '__main__':
    i = 0
    while i < 33:
        print u'页码', (i+1)
        if i == 0:
            url = 'http://www.mayi.com/guiyang/?map=no'
        else:
            num = i + 2
            url = 'http://www.mayi.com/guiyang/' + str(num) + '/?map=no'
        gydzf(url)
        i += 1
    c.close()

The script writes all collected data into a CSV file for further analysis. The article also notes that the Cookie expires roughly every hour, so it must be refreshed manually or automated.

Alternative approach

Using Selenium to control a real browser is mentioned as another viable method for bypassing the anti‑scraping checks.

Overall, the guide provides a step‑by‑step solution for extracting rental listings from Ant Short‑Term Rental, covering header acquisition, cookie handling, data parsing with BeautifulSoup, and data storage.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

data extraction cookies beautifulsoup web-scraping urllib2

Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.