Master Python Web Scraping: Extract Internship Data from Shixi.com
This tutorial walks beginners through the complete process of building a Python web scraper to collect internship listings from Shixi.com, covering page analysis, static vs dynamic detection, data location with XPath, pagination handling, and storing results in a CSV file using pandas.
1. Introduction
This article is aimed at beginners who want to learn how to write a Python web crawler to fetch internship data from the “Shixi” website.
2. Page Analysis
The target site is https://www.shixi.com/search/index?key=数据分析. The tutorial shows the main (first‑level) page and the detail (second‑level) page, and lists the fields to extract: company name, job title, address, degree requirement, salary, job demand, company field, and company size.
It explains how to determine whether a page is static or dynamic by viewing the page source.
3. Locating Data
Data can be located using XPath, regular expressions, BeautifulSoup or pyquery; the example uses XPath. The steps to open the browser’s developer tools (right‑click → Inspect or press F12) are described.
4. Scraper Code Explanation
The script imports pandas, requests, chardet, re, lxml.etree, time and warnings. It disables SSL verification and warning messages.
import pandas as pd
import requests
import chardet
import re
from lxml import etree
import time
import warnings
warnings.filterwarnings("ignore")It defines a function to request the first‑level page, extract the six fields with XPath, build the list of second‑level URLs, request each detail page and extract three additional fields.
# request first‑level page
url = 'https://www.shixi.com/search/index?key=数据分析&districts=&education=0&full_opportunity=0&stage=0&practice_days=0&nature=0&trades=&lang=zh_cn'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}
rqg = requests.get(url, headers=headers, verify=False)
rqg.encoding = chardet.detect(rqg.content)['encoding']
html = etree.HTML(rqg.text)
# extract fields with XPath
company_list = html.xpath('//div[@class="job-pannel-list"]//div[@class="job-pannel-one"]//a/text()')
# ... similar code for job_list, address_list, degree_list, salary_list
# get second‑level URLs
deep_url_list = html.xpath('//div[@class="job-pannel-list"]//dt/a/@href')
deep_url_list = ["https://www.shixi.com" + i for i in deep_url_list]
# request each detail page
for deep_url in deep_url_list:
rqg = requests.get(deep_url, headers=headers, verify=False)
rqg.encoding = chardet.detect(rqg.content)['encoding']
html = etree.HTML(rqg.text)
demand = html.xpath('//div[@class="container-fluid"]//div[@class="intros"]/span[2]/text()')
area = html.xpath('//div[@class="container-fluid"]//div[@class="detail-intro-title"]//p[1]/span/text()')
scale = html.xpath('//div[@class="container-fluid"]//div[@class="detail-intro-title"]//p[2]/span/text()')
demand_list.append(demand)
area_list.append(area)
scale_list.append(scale)Pagination is handled by constructing URLs with a page parameter and looping from 1 to 60.
x = "https://www.shixi.com/search/index?key=数据分析&page="
url_list = [x + str(i) for i in range(1, 61)]All collected data are assembled into a pandas DataFrame and saved as a CSV file.
data = pd.DataFrame({
'公司名': company_list,
'岗位名': job_list,
'地址': address_list,
'学历': degree_list,
'薪资': salary_list,
'岗位需求量': demand_list,
'公司领域': area_list,
'公司规模': scale_list
})
data.to_csv('aliang.csv', encoding='utf_8_sig')5. Result
The script outputs a CSV file containing the eight extracted fields for each internship listing.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
