How to Build a Python Web Scraper for Job Listings and Bypass Anti‑Scraping Measures
This tutorial explains how to crawl 58.com job listings with Python, extract location, company, and salary information, handle anti‑scraping defenses using realistic headers and random User‑Agents, and save the results into a text file.
Introduction
During the pandemic, finding a good job became harder, so many people turn to online recruitment platforms. However, the information on sites like 58.com is often messy and incomplete, making it necessary to automate data extraction.
Project Goal
The goal is to crawl recruitment information from 58.com, extract location, company name, salary, and save the results into a txt file.
Preparation
Required software: PyCharm. Required libraries: requests, lxml, fake_useragent. Target URL pattern:
https://gz.58.com/job/pn2/?param7503=1&from=yjz2_zhaopin&PGTID=0d302408-0000-3efd-48f6-ff64d26b4b1c&ClickID={}where {} is the page number.
Anti‑scraping Measures
The site blocks requests without proper headers and bans IPs after repeated accesses. To overcome this, the script sets realistic HTTP headers and uses fake_useragent to generate random User‑Agent strings.
Implementation
1. Define a Zhaopin class with __init__ storing the URL and a main method.
import requests
from lxml import etree
from fake_useragent import UserAgent
class Zhaopin(object):
def __init__(self):
self.url = "https://gz.58.com/job/pn2/?param7503=1&from=yjz2_zhaopin&PGTID=0d302408-0000-3efd-48f6-ff64d26b4b1c&ClickID={}"
def main(self):
pass
if __name__ == '__main__':
spider = Zhaopin()
spider.main()2. Generate random headers:
for i in range(1, 50):
self.headers = {
'User-Agent': ua.random,
}3. Fetch a page:
def get_page(self, url):
res = requests.get(url=url, headers=self.headers)
html = res.content.decode("utf-8")
return html4. Parse the HTML with XPath to locate job listings:
def page_page(self, html):
parse_html = etree.HTML(html)
one = parse_html.xpath('//div[@class="main clearfix"]//div[@class="leftCon"]/ul/li')5. Iterate over each node, extract job title, category, salary, and company, format the result, and write it to a file:
for l in one:
o = l.xpath('.//a/span[1]/text()')[0].strip()
t = l.xpath('.//a//span[@class="name"]/text()')[0].strip()
f = l.xpath('.//p[@class="job_salary"]/text()')
thr = l.xpath('.//div[@class="comp_name"]//a/text()')[0].strip()
for e in f:
boss = '''
%s:||%s:
公司:%s,
工资:%s元/月
=========================================================
''' % (o, t, thr, e)
print(str(boss))
f = open('g.txt', 'a', encoding='utf-8')
f.write(str(boss))
f.write("
")
f.close()6. Call the methods to run the crawler:
html = self.get_page(url)
self.page_page(html)Result Demonstration
After entering the start and end pages, the program prints the extracted information in the console and saves it to g.txt. The following screenshots show the input UI, console output, the saved txt file, and its content.
Conclusion
Do not scrape excessively to avoid overloading the server. The article demonstrates the main challenges of crawling a recruitment site and provides practical solutions for handling anti‑scraping mechanisms, string concatenation, and list type conversion. The code is simple and intended to help beginners get started.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
