How to Scrape Shenzhen Bus Data with Python: Step‑by‑Step Guide
This tutorial walks you through using Python and BeautifulSoup to crawl the 8684.cn website, extract city‑specific bus route information, parse classification pages, and collect detailed schedule and station data, while providing reusable code snippets and data‑cleaning suggestions.
Overview
This article demonstrates how to build a Python web‑scraper for the public‑transport site 8684.cn , focusing on Shenzhen bus data. It explains how to construct URLs for city pages, navigate classification links, and retrieve detailed route information.
Step‑by‑Step Process
Generate the first‑level URL using the city name (e.g., https://shenzhen.8684.cn/).
Request the page and parse the HTML with BeautifulSoup.
Locate the bus‑layer div and extract the classification section (class pl10).
Identify classification titles ( span class="kt") and their corresponding link containers ( div class="list").
Build second‑level URLs for each bus category and request those pages.
From each category page, find the list of routes ( div class="list clearfix") and collect the route name, href, and title.
Construct third‑level URLs for individual routes and scrape the detailed information, including operating hours, fare, company, update time, and station coordinates.
Core Code Snippets
url = 'https://shenzhen.8684.cn/'
response = requests.get(url=url, headers={'User-Agent': get_ua()}, timeout=10)
# Parse classification data
soup = BeautifulSoup(response.text, 'lxml')
soup_buslayer = soup.find('div', class_='bus-layer depth w120')
dic_result = {}
soup_buslist = soup_buslayer.find_all('div', class_='pl10')
for soup_bus in soup_buslist:
name = soup_bus.find('span', class_='kt').get_text()
if '线路分类' in name:
soup_a_list = soup_bus.find('div', class_='list')
for soup_a in soup_a_list.find_all('a'):
text = soup_a.get_text()
href = soup_a.get('href')
dic_result[text] = "https://shenzhen.8684.cn" + href
print(dic_result) bus_arr = []
index = 0
for key, value in dic_result.items():
print('key: ', key, 'value: ', value)
response = requests.get(url=value, headers={'User-Agent': get_ua()}, timeout=10)
soup = BeautifulSoup(response.text, 'lxml')
soup_buslist = soup.find('div', class_='list clearfix')
for soup_a in soup_buslist.find_all('a'):
text = soup_a.get_text()
href = soup_a.get('href')
title = soup_a.get('title')
bus_arr.append([title, text, "https://shenzhen.8684.cn" + href])
print(bus_arr)Data Cleaning Recommendations
Split the operating‑time column into earliest departure, latest departure, start station, and end station.
Extract numeric values from the fare column.
Parse the last‑update column into a proper timestamp.
Analyze the station‑info column to count total stations, detect circular routes, and list traversed districts.
Conclusion
The scraper, although a bit verbose, provides a solid foundation for extracting bus route data and can be extended to capture station coordinates or integrated into larger data‑analysis pipelines. Remember to respect the target site by adding reasonable delays and following ethical scraping practices.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
