Fundamentals 8 min read

How to Scrape Shenzhen Bus Data with Python: Step‑by‑Step Guide

This tutorial walks you through using Python and BeautifulSoup to crawl the 8684.cn website, extract city‑specific bus route information, parse classification pages, and collect detailed schedule and station data, while providing reusable code snippets and data‑cleaning suggestions.

Python Crawling & Data Mining

Mar 31, 2022

How to Scrape Shenzhen Bus Data with Python: Step‑by‑Step Guide

Overview

This article demonstrates how to build a Python web‑scraper for the public‑transport site 8684.cn , focusing on Shenzhen bus data. It explains how to construct URLs for city pages, navigate classification links, and retrieve detailed route information.

Step‑by‑Step Process

Generate the first‑level URL using the city name (e.g., https://shenzhen.8684.cn/).

Request the page and parse the HTML with BeautifulSoup.

Locate the bus‑layer div and extract the classification section (class pl10).

Identify classification titles ( span class="kt") and their corresponding link containers ( div class="list").

Build second‑level URLs for each bus category and request those pages.

From each category page, find the list of routes ( div class="list clearfix") and collect the route name, href, and title.

Construct third‑level URLs for individual routes and scrape the detailed information, including operating hours, fare, company, update time, and station coordinates.

Core Code Snippets

url = 'https://shenzhen.8684.cn/'
response = requests.get(url=url, headers={'User-Agent': get_ua()}, timeout=10)

# Parse classification data
soup = BeautifulSoup(response.text, 'lxml')
soup_buslayer = soup.find('div', class_='bus-layer depth w120')

dic_result = {}
soup_buslist = soup_buslayer.find_all('div', class_='pl10')
for soup_bus in soup_buslist:
    name = soup_bus.find('span', class_='kt').get_text()
    if '线路分类' in name:
        soup_a_list = soup_bus.find('div', class_='list')
        for soup_a in soup_a_list.find_all('a'):
            text = soup_a.get_text()
            href = soup_a.get('href')
            dic_result[text] = "https://shenzhen.8684.cn" + href

print(dic_result)

bus_arr = []
index = 0

for key, value in dic_result.items():
    print('key: ', key, 'value: ', value)
    response = requests.get(url=value, headers={'User-Agent': get_ua()}, timeout=10)
    soup = BeautifulSoup(response.text, 'lxml')
    soup_buslist = soup.find('div', class_='list clearfix')
    for soup_a in soup_buslist.find_all('a'):
        text = soup_a.get_text()
        href = soup_a.get('href')
        title = soup_a.get('title')
        bus_arr.append([title, text, "https://shenzhen.8684.cn" + href])

print(bus_arr)

Data Cleaning Recommendations

Split the operating‑time column into earliest departure, latest departure, start station, and end station.

Extract numeric values from the fare column.

Parse the last‑update column into a proper timestamp.

Analyze the station‑info column to count total stations, detect circular routes, and list traversed districts.

Conclusion

The scraper, although a bit verbose, provides a solid foundation for extracting bus route data and can be extended to capture station coordinates or integrated into larger data‑analysis pipelines. Remember to respect the target site by adding reasonable delays and following ethical scraping practices.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

data mining beautifulsoup Bus Data

Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.