Backend Development 13 min read

Master Python Web Scraping: Extract Internship Data from Shixi.com

This tutorial walks beginners through the complete process of building a Python web scraper to collect internship listings from Shixi.com, covering page analysis, static vs dynamic detection, data location with XPath, pagination handling, and storing results in a CSV file using pandas.

Python Crawling & Data Mining

May 26, 2021

Master Python Web Scraping: Extract Internship Data from Shixi.com

1. Introduction

This article is aimed at beginners who want to learn how to write a Python web crawler to fetch internship data from the “Shixi” website.

2. Page Analysis

The target site is https://www.shixi.com/search/index?key=数据分析. The tutorial shows the main (first‑level) page and the detail (second‑level) page, and lists the fields to extract: company name, job title, address, degree requirement, salary, job demand, company field, and company size.

It explains how to determine whether a page is static or dynamic by viewing the page source.

3. Locating Data

Data can be located using XPath, regular expressions, BeautifulSoup or pyquery; the example uses XPath. The steps to open the browser’s developer tools (right‑click → Inspect or press F12) are described.

4. Scraper Code Explanation

The script imports pandas, requests, chardet, re, lxml.etree, time and warnings. It disables SSL verification and warning messages.

import pandas as pd
import requests
import chardet
import re
from lxml import etree
import time
import warnings
warnings.filterwarnings("ignore")

It defines a function to request the first‑level page, extract the six fields with XPath, build the list of second‑level URLs, request each detail page and extract three additional fields.

# request first‑level page
url = 'https://www.shixi.com/search/index?key=数据分析&districts=&education=0&full_opportunity=0&stage=0&practice_days=0&nature=0&trades=&lang=zh_cn'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}
rqg = requests.get(url, headers=headers, verify=False)
rqg.encoding = chardet.detect(rqg.content)['encoding']
html = etree.HTML(rqg.text)

# extract fields with XPath
company_list = html.xpath('//div[@class="job-pannel-list"]//div[@class="job-pannel-one"]//a/text()')
# ... similar code for job_list, address_list, degree_list, salary_list

# get second‑level URLs
deep_url_list = html.xpath('//div[@class="job-pannel-list"]//dt/a/@href')
deep_url_list = ["https://www.shixi.com" + i for i in deep_url_list]

# request each detail page
for deep_url in deep_url_list:
    rqg = requests.get(deep_url, headers=headers, verify=False)
    rqg.encoding = chardet.detect(rqg.content)['encoding']
    html = etree.HTML(rqg.text)
    demand = html.xpath('//div[@class="container-fluid"]//div[@class="intros"]/span[2]/text()')
    area = html.xpath('//div[@class="container-fluid"]//div[@class="detail-intro-title"]//p[1]/span/text()')
    scale = html.xpath('//div[@class="container-fluid"]//div[@class="detail-intro-title"]//p[2]/span/text()')
    demand_list.append(demand)
    area_list.append(area)
    scale_list.append(scale)

Pagination is handled by constructing URLs with a page parameter and looping from 1 to 60.

x = "https://www.shixi.com/search/index?key=数据分析&page="
url_list = [x + str(i) for i in range(1, 61)]

All collected data are assembled into a pandas DataFrame and saved as a CSV file.

data = pd.DataFrame({
    '公司名': company_list,
    '岗位名': job_list,
    '地址': address_list,
    '学历': degree_list,
    '薪资': salary_list,
    '岗位需求量': demand_list,
    '公司领域': area_list,
    '公司规模': scale_list
})
data.to_csv('aliang.csv', encoding='utf_8_sig')

5. Result

The script outputs a CSV file containing the eight extracted fields for each internship listing.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python data extraction Web Scraping Pandas

Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.