How to Scrape and Visualize 6,000+ Chinese Tourist Spots with Selenium and Python
This article demonstrates how to use Selenium and Python to crawl over 6,000 Chinese tourist attractions from Qunar, extract ratings, popularity and sales data, and visualize the results with pandas, seaborn, matplotlib, and pyecharts, revealing the most visited sites and regional travel trends during the 2019 National Day holiday.
Based on data from the Ministry of Culture and Tourism predicting nearly 800 million trips during the 2019 National Day holiday, the author decided to explore which cities and attractions would be most congested by crawling Qunar.com.
Using Selenium, the script opens Qunar, locates each sight_item, and extracts the attraction name, rating, popularity, address, and sales volume. The crawler processes 6,000+ attractions across all provinces, collecting 6,630 records.
from tqdm import tqdm
import time
from selenium import webdriver
from selenium.common.exceptions import TimeoutException, WebDriverException
import pandas as pd
import numpy as np
position = ["北京","天津","上海","重庆","河北","山西","辽宁","吉林","黑龙江","江苏","浙江","安徽","福建","江西","山东","河南","湖北","湖南","广东","海南","四川","贵州","云南","陕西","甘肃","青海","台湾","内蒙古","广西","西藏","宁夏","新疆","香港","澳门"]
name, level, hot, address, num = [], [], [], [], []
def get_one_page(key, page):
try:
option_chrome = webdriver.ChromeOptions()
option_chrome.add_argument('--headless')
driver = webdriver.Chrome(chrome_options=option_chrome)
time.sleep(1)
url = f"http://piao.qunar.com/ticket/list.htm?keyword={key}®ion=&from=mpl_search_suggest&page={page}"
driver.get(url)
infor = driver.find_elements_by_class_name("sight_item")
for i in range(len(infor)):
name.append(infor[i].find_element_by_class_name("name").text)
try:
level.append(infor[i].find_element_by_class_name("level").text)
except:
level.append("")
hot.append(infor[i].find_element_by_class_name("product_star_level").text[3:])
address.append(infor[i].find_element_by_class_name("area").text)
try:
num.append(infor[i].find_element_by_class_name("hot_num").text)
except:
num.append(0)
driver.quit()
return
except (TimeoutException, WebDriverException):
return get_one_page(key, page)
for key in tqdm(position):
print(f"Crawling {key}")
for page in range(1, 14):
print(f"Page {page}")
get_one_page(key, page)
sight = {'name': name, 'level': level, 'hot': hot, 'address': address, 'num': num}
sight = pd.DataFrame(sight, columns=['name', 'level', 'hot', 'address', 'num'])
sight.to_csv("sight.csv", encoding="utf_8_sig")The resulting dataset shows the top 30 most popular attractions, led by the Giant Panda base, the Forbidden City, Zhengzhou Zoo, Mount Emei, and the Terracotta Army. Heatmaps are generated for province‑level and city‑level popularity and sales using the AMap API for geocoding.
Visualization code (pandas, seaborn, matplotlib, pyecharts) processes the CSV, splits addresses into province, city, and area, and creates bar charts for the top‑selling attractions, stacked bar charts for ratings per province, and heatmaps for sales and popularity across China.
data = pd.read_csv("sight.csv")
data = data.fillna(0)
data = data.drop(columns=['Unnamed: 0'])
data["address"] = data["address"].apply(lambda x: x.replace("[","").replace("]",""))
data["province"] = data["address"].apply(lambda x: x.split("·")[0])
data["city"] = data["address"].apply(lambda x: x.split("·")[1])
data["area"] = data["address"].apply(lambda x: x.split("·")[-1])
num_top = data.sort_values(by='num', ascending=False).reset_index(drop=True)
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(15,10))
sns.barplot(x=num_top["name"][:30], y=num_top["num"][:30])
plt.xticks(rotation=90)
plt.show()
# Heatmap with pyecharts
from pyecharts import Map
map = Map("Province Attraction Sales Heatmap", width=1200, height=600, background_color='#404a59')
map.add("", data["province"], data["num"], maptype="china", visual_range=[5000,80000], is_visualmap=True)
map.render(path="province_sales.html")Overall, provinces such as Beijing, Sichuan, and coastal regions dominate tourism demand, suggesting travelers should consider less‑crowded destinations during peak holiday periods.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
