How to Scrape Hao123 Travel Site with Python: Step‑by‑Step Guide
This tutorial demonstrates how to use Python's requests, lxml, and pprint libraries to crawl the Hao123 travel website, extract attraction names, opening times, reviews, and prices, and display the results, while providing complete code examples and practical tips.
1. Introduction
When traveling, we often want to know the attractions, prices, opening hours, and user reviews of a destination. This article uses Python web‑scraping techniques to obtain such information from the Hao123 travel website.
2. Project Goals
Extract each attraction's name, opening time, highlighted reviews, and price from the site.
3. Required Libraries and Target URL
The target URL is:
https://go.hao123.com/ticket?city=%E5%B9%BF%E5%B7%9E&theme=all&pn=1Here city=%E5%B9%BF%E5%B7%9E represents Guangzhou, and pn indicates the page number.
The required Python libraries are requests , lxml , and pprint .
4. Implementation Details
1. Import the necessary libraries:
import requests
from lxml import etree
from pprint import pprint2. Define a class with an __init__ method to set the base URL and request headers.
3. Create a request function that sends an HTTP GET request and returns the response data.
4. Parse the returned HTML using lxml.etree and XPath expressions to locate the required fields.
5. Extract the secondary page links for each attraction by inspecting the page with browser developer tools (F12) and locating the XPath for the attraction name links.
6. For each secondary page, send another request, parse the content, and store the attraction name, opening time, highlighted review, and price in a dictionary, handling missing values with conditional checks.
7. Implement a main function that orchestrates the crawling process, iterates over the desired pages, and prints the collected data.
5. Result Demonstration
Running the script (by clicking the green button in the IDE) prompts for the number of pages to crawl and displays the extracted information in the console.
6. Conclusion
• Avoid scraping excessive amounts of data to prevent server overload. • This project helps users understand how to retrieve travel attraction information programmatically. • The source code is simple, but hands‑on implementation deepens comprehension. • Interested readers can request the full source by replying with the keyword “旅游”.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
