Analyzing National Day Travel Crowds Using Python Web Scraping and Search Index Data

This article describes how to use Python, Selenium, and search‑index services to scrape and visualize popularity data for Chinese tourist spots during the National Day holiday, presenting a ranking of destinations and providing full code examples for data collection, cleaning, and storage.

Qunar Tech Salon
Qunar Tech Salon
Qunar Tech Salon
Analyzing National Day Travel Crowds Using Python Web Scraping and Search Index Data

The author wanted to avoid crowded tourist sites during the upcoming National Day holiday and decided to collect travel‑related data using Python web scraping.

Instead of trying to scrape structured data from travel websites like Ctrip or Mafengwo, the author used search‑index platforms (initially Baidu Index, then Sogou Index) as a proxy to gauge public interest in various scenic spots.

By querying the index for keyword search volumes, the author compiled a list of 100 destinations and grouped them into five popularity tiers, illustrating which locations are likely to be overcrowded.

Technical analysis shows the crawler relies on Selenium for page interaction, regular expressions for HTML parsing, pyecharts for visualization, and MongoDB (accessed via pymongo) for data storage.

Key code snippets are provided below.

# 这是数据展示的代码片段

def show_data(self):
    for index in range(5):
        queryArgs = {"day_avg_pv": {"$lt": 100000}}
        rets = self.zfdb.national_month_index.find(queryArgs).sort("day_avg_pv", pymongo.DESCENDING).limit(10).skip(index*10)
        atts = []
        values = []
        file_name = "top" + str(index*10) + "-" + str((index+1)*10) + ".html"
        for ret in rets:
            print(ret)
            atts.append(ret["address"])
            values.append(ret["day_avg_pv"])
        self.show_line("各景点 30 天内平均搜索量", atts, values)
        os.rename("render.html", file_name)

The second snippet shows the main data‑extraction routine that fetches JSON embedded in the page source, parses it, and inserts both daily and monthly statistics into MongoDB collections.

# 这是数据爬取的代码片段

def get_index_data(self):
    try:
        for url in self.get_url():
            print("当前地址为:" + url)
            self.browser.get(url)
            self.browser.implicitly_wait(10)
            ret = re.findall(r'root.SG.data = (.*)}]};', self.browser.page_source)
            totalJson = json.loads(ret[0] + "}")
            topPvDataList = totalJson["topPvDataList"]
            infoList = totalJson["infoList"]
            pvList = totalJson["pvList"]
            for index, info in enumerate(infoList):
                for pvDate in pvList[index]:
                    print("index => " + str(index) + "地址 => " + info["kwdName"] + "日期 => " + str(pvDate["date"]) + " => " + str(pvDate["pv"]))
                    self.zfdb.national_day_index.insert({
                        "address": info["kwdName"],
                        "date": pvDate["date"],
                        "day_pv": pvDate["pv"]
                    })
                    self.zfdb.national_month_index.insert({
                        "address": info["kwdName"],
                        "day_avg_pv": info["avgWapPv"],
                        "sum_pv": info["kwdSumPv"]["sumPv"]
                    })
    except:
        print("exception")

The analysis reveals that top‑tier destinations such as Guilin, Sanya, and Mount Tai have extremely high search volumes and are likely to be overcrowded, while lower‑tier spots are safer choices for a quieter holiday.

In the concluding note, the author reflects on the challenges of extracting data from Baidu Index, the usefulness of Sogou Index, and expresses a desire to explore Baidu Index further in future work.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

PythonMongoDBWeb ScrapingSeleniumtravel analysis
Qunar Tech Salon
Written by

Qunar Tech Salon

Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.