Backend Development 5 min read

How to Scrape Hao123 Travel Site with Python: Step‑by‑Step Guide

This tutorial demonstrates how to use Python's requests, lxml, and pprint libraries to crawl the Hao123 travel website, extract attraction names, opening times, reviews, and prices, and display the results, while providing complete code examples and practical tips.

Python Crawling & Data Mining

Jun 4, 2020

How to Scrape Hao123 Travel Site with Python: Step‑by‑Step Guide

1. Introduction

When traveling, we often want to know the attractions, prices, opening hours, and user reviews of a destination. This article uses Python web‑scraping techniques to obtain such information from the Hao123 travel website.

2. Project Goals

Extract each attraction's name, opening time, highlighted reviews, and price from the site.

3. Required Libraries and Target URL

The target URL is:

https://go.hao123.com/ticket?city=%E5%B9%BF%E5%B7%9E&theme=all&pn=1

Here city=%E5%B9%BF%E5%B7%9E represents Guangzhou, and pn indicates the page number.

The required Python libraries are requests , lxml , and pprint .

4. Implementation Details

1. Import the necessary libraries:

import requests
from lxml import etree
from pprint import pprint

2. Define a class with an __init__ method to set the base URL and request headers.

3. Create a request function that sends an HTTP GET request and returns the response data.

4. Parse the returned HTML using lxml.etree and XPath expressions to locate the required fields.

5. Extract the secondary page links for each attraction by inspecting the page with browser developer tools (F12) and locating the XPath for the attraction name links.

6. For each secondary page, send another request, parse the content, and store the attraction name, opening time, highlighted review, and price in a dictionary, handling missing values with conditional checks.

7. Implement a main function that orchestrates the crawling process, iterates over the desired pages, and prints the collected data.

5. Result Demonstration

Running the script (by clicking the green button in the IDE) prompts for the number of pages to crawl and displays the extracted information in the console.

6. Conclusion

• Avoid scraping excessive amounts of data to prevent server overload. • This project helps users understand how to retrieve travel attraction information programmatically. • The source code is simple, but hands‑on implementation deepens comprehension. • Interested readers can request the full source by replying with the keyword “旅游”.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

web-scraping lxml Tourism Data hao123

Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.