Big Data 11 min read

How to Scrape and Analyze China’s Tourist Attractions Data with Python

This article demonstrates how to crawl Qunar’s nationwide tourism listings with Python, extract key fields such as name, level, location and price, handle anti‑scraping measures, and then perform comprehensive data analysis and visualisation—including sales rankings, popularity scores, geographic distribution and geocoding using the Amap API.

Python Crawling & Data Mining

Nov 22, 2018

How to Scrape and Analyze China’s Tourist Attractions Data with Python

01 Data Scraping

Qunar provides extensive tourism information covering all provinces of China. Since Qunar's ticket service does not offer an API, the author parses the web pages to collect data such as attraction name, level, location, description, price, sales volume and popularity.

Parsing is performed layer by layer; to avoid being blocked, a proxy pool is used. The script extracts the required fields with try/except blocks, resulting in 41,611 attraction records.

for i in s:
    inf = {}
    try:
        inf['level'] = i.find('span', class_='level').text[0]
    except Exception as e:
        inf['level'] = '0'
    try:
        inf['price'] = i.find('span', class_='sight_item_price').find('em').text
    except Exception as e:
        inf['price'] = ''
    try:
        inf['name'] = i.find('a', class_='name').text
    except Exception as e:
        inf['name'] = ''
    try:
        inf['num'] = i.find('span', class_='hot_num').text
    except Exception as e:
        inf['num'] = ''
    try:
        inf['add_pro'] = i.find('span', class_='area').find('a').text.split('·')[0]
        inf['add_city'] = i.find('span', class_='area').find('a').text.split('·')[1]
    except Exception as e:
        inf['add_pro'] = i.find('span', class_='area').find('a').text
        inf['add_city'] = i.find('span', class_='area').find('a').text
    try:
        inf['hot'] = i.find('span', class_='product_star_level').find('em').get('title').split(':')[1]
    except Exception as e:
        inf['hot'] = ''
    try:
        inf['descri'] = i.find('div', class_='intro color999').text
    except Exception as e:
        inf['descri'] = ''

02 Data Analysis

Analysis of 5A attractions shows that the Terracotta Army leads in sales, followed by Chimelong Paradise, which outperforms the second‑ranked site by a factor of 5/3. Six of the top‑20 sales belong to amusement parks, suggesting that cities without historic sites can develop tourism through large‑scale parks.

Jiangsu has the most 5A attractions (41), followed by Zhejiang and Guangdong (21 each). Eastern provinces host more 5A sites than western ones.

4A attractions: Chengdu Panda Base ranks highest in sales; amusement parks again occupy a large share. Shandong has the most 4A sites (167), while Tibet has only six.

3A attractions: Bamboo Forest Longevity Mountain leads with 1,326 sales. Shandong has the most 3A sites (211). Popularity scores are low for 3A and many 4A sites.

Box‑plot comparisons show that 5A attractions have significantly higher popularity and sales than 4A and 3A.

Word cloud generated from all attraction descriptions highlights terms such as “culture”, “leisure”, “tourism”, “experience”, “park”, “history”.

Geocoding: The Amap API is used to convert scraped address data into latitude and longitude. Example request URL:

https://restapi.amap.com/v3/geocode/geo?address=地址&output=XML&key=<用户的key>&city=城市

Python code iterates over the address list, sends requests to the Amap service, extracts coordinates, and appends them to a CSV file.

for i in range(len(name)):
    x = pandas.DataFrame()
    t = {}
    add = name[i]
    chengshi = city[i]
    parameters = {'address': add, 'key': '9c2084d0d553d8152ad0debe26375a4c', 'city': chengshi}
    html = requests.get('https://restapi.amap.com/v3/geocode/geo', params=parameters).json()
    try:
        t['jingwei'] = html['geocodes'][0]['location']
    except IndexError:
        t['jingwei'] = '0,0'
    finally:
        t['n'] = name[i]
        t['level'] = level[i]
        t['pro'] = pro[i]
        t['city'] = city[i]
        x = x.append(t, ignore_index=True)
        x.to_csv('543.csv', encoding='utf-8', index=False, mode='a', header=False)

Heat‑map visualisations reveal Beijing as the most resource‑rich city, with Chongqing, Guangzhou, Tianjin and Suzhou also attractive. Nationwide distribution maps and hexagonal heat maps illustrate regional tourism density.

Finally, the author recommends visiting Hunan province, especially Changsha, Zhangjiajie, Yongzhou, Huaihua and Chenzhou.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python data analysis visualization Geocoding Tourism Data

Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.