How to Scrape and Analyze China’s Tourist Attractions Data with Python
This article demonstrates how to crawl Qunar’s nationwide tourism listings with Python, extract key fields such as name, level, location and price, handle anti‑scraping measures, and then perform comprehensive data analysis and visualisation—including sales rankings, popularity scores, geographic distribution and geocoding using the Amap API.
01 Data Scraping
Qunar provides extensive tourism information covering all provinces of China. Since Qunar's ticket service does not offer an API, the author parses the web pages to collect data such as attraction name, level, location, description, price, sales volume and popularity.
Parsing is performed layer by layer; to avoid being blocked, a proxy pool is used. The script extracts the required fields with try/except blocks, resulting in 41,611 attraction records.
for i in s:
inf = {}
try:
inf['level'] = i.find('span', class_='level').text[0]
except Exception as e:
inf['level'] = '0'
try:
inf['price'] = i.find('span', class_='sight_item_price').find('em').text
except Exception as e:
inf['price'] = ''
try:
inf['name'] = i.find('a', class_='name').text
except Exception as e:
inf['name'] = ''
try:
inf['num'] = i.find('span', class_='hot_num').text
except Exception as e:
inf['num'] = ''
try:
inf['add_pro'] = i.find('span', class_='area').find('a').text.split('·')[0]
inf['add_city'] = i.find('span', class_='area').find('a').text.split('·')[1]
except Exception as e:
inf['add_pro'] = i.find('span', class_='area').find('a').text
inf['add_city'] = i.find('span', class_='area').find('a').text
try:
inf['hot'] = i.find('span', class_='product_star_level').find('em').get('title').split(':')[1]
except Exception as e:
inf['hot'] = ''
try:
inf['descri'] = i.find('div', class_='intro color999').text
except Exception as e:
inf['descri'] = ''02 Data Analysis
Analysis of 5A attractions shows that the Terracotta Army leads in sales, followed by Chimelong Paradise, which outperforms the second‑ranked site by a factor of 5/3. Six of the top‑20 sales belong to amusement parks, suggesting that cities without historic sites can develop tourism through large‑scale parks.
Jiangsu has the most 5A attractions (41), followed by Zhejiang and Guangdong (21 each). Eastern provinces host more 5A sites than western ones.
4A attractions: Chengdu Panda Base ranks highest in sales; amusement parks again occupy a large share. Shandong has the most 4A sites (167), while Tibet has only six.
3A attractions: Bamboo Forest Longevity Mountain leads with 1,326 sales. Shandong has the most 3A sites (211). Popularity scores are low for 3A and many 4A sites.
Box‑plot comparisons show that 5A attractions have significantly higher popularity and sales than 4A and 3A.
Word cloud generated from all attraction descriptions highlights terms such as “culture”, “leisure”, “tourism”, “experience”, “park”, “history”.
Geocoding: The Amap API is used to convert scraped address data into latitude and longitude. Example request URL:
https://restapi.amap.com/v3/geocode/geo?address=地址&output=XML&key=<用户的key>&city=城市Python code iterates over the address list, sends requests to the Amap service, extracts coordinates, and appends them to a CSV file.
for i in range(len(name)):
x = pandas.DataFrame()
t = {}
add = name[i]
chengshi = city[i]
parameters = {'address': add, 'key': '9c2084d0d553d8152ad0debe26375a4c', 'city': chengshi}
html = requests.get('https://restapi.amap.com/v3/geocode/geo', params=parameters).json()
try:
t['jingwei'] = html['geocodes'][0]['location']
except IndexError:
t['jingwei'] = '0,0'
finally:
t['n'] = name[i]
t['level'] = level[i]
t['pro'] = pro[i]
t['city'] = city[i]
x = x.append(t, ignore_index=True)
x.to_csv('543.csv', encoding='utf-8', index=False, mode='a', header=False)Heat‑map visualisations reveal Beijing as the most resource‑rich city, with Chongqing, Guangzhou, Tianjin and Suzhou also attractive. Nationwide distribution maps and hexagonal heat maps illustrate regional tourism density.
Finally, the author recommends visiting Hunan province, especially Changsha, Zhangjiajie, Yongzhou, Huaihua and Chenzhou.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
