Big Data 7 min read

How to Build a Python Web Scraper for Job Listings and Bypass Anti‑Scraping Measures

This tutorial explains how to crawl 58.com job listings with Python, extract location, company, and salary information, handle anti‑scraping defenses using realistic headers and random User‑Agents, and save the results into a text file.

Python Crawling & Data Mining

Jul 9, 2020

How to Build a Python Web Scraper for Job Listings and Bypass Anti‑Scraping Measures

Introduction

During the pandemic, finding a good job became harder, so many people turn to online recruitment platforms. However, the information on sites like 58.com is often messy and incomplete, making it necessary to automate data extraction.

Project Goal

The goal is to crawl recruitment information from 58.com, extract location, company name, salary, and save the results into a txt file.

Preparation

Required software: PyCharm. Required libraries: requests, lxml, fake_useragent. Target URL pattern:

https://gz.58.com/job/pn2/?param7503=1&from=yjz2_zhaopin&PGTID=0d302408-0000-3efd-48f6-ff64d26b4b1c&ClickID={}

where {} is the page number.

Anti‑scraping Measures

The site blocks requests without proper headers and bans IPs after repeated accesses. To overcome this, the script sets realistic HTTP headers and uses fake_useragent to generate random User‑Agent strings.

Implementation

1. Define a Zhaopin class with __init__ storing the URL and a main method.

import requests
from lxml import etree
from fake_useragent import UserAgent

class Zhaopin(object):
    def __init__(self):
        self.url = "https://gz.58.com/job/pn2/?param7503=1&from=yjz2_zhaopin&PGTID=0d302408-0000-3efd-48f6-ff64d26b4b1c&ClickID={}"
    def main(self):
        pass

if __name__ == '__main__':
    spider = Zhaopin()
    spider.main()

2. Generate random headers:

for i in range(1, 50):
    self.headers = {
        'User-Agent': ua.random,
    }

3. Fetch a page:

def get_page(self, url):
    res = requests.get(url=url, headers=self.headers)
    html = res.content.decode("utf-8")
    return html

4. Parse the HTML with XPath to locate job listings:

def page_page(self, html):
    parse_html = etree.HTML(html)
    one = parse_html.xpath('//div[@class="main clearfix"]//div[@class="leftCon"]/ul/li')

5. Iterate over each node, extract job title, category, salary, and company, format the result, and write it to a file:

for l in one:
    o = l.xpath('.//a/span[1]/text()')[0].strip()
    t = l.xpath('.//a//span[@class="name"]/text()')[0].strip()
    f = l.xpath('.//p[@class="job_salary"]/text()')
    thr = l.xpath('.//div[@class="comp_name"]//a/text()')[0].strip()
    for e in f:
        boss = '''
 %s:||%s: 
 公司：%s,
 工资：%s元/月
 =========================================================
''' % (o, t, thr, e)
        print(str(boss))

f = open('g.txt', 'a', encoding='utf-8')
f.write(str(boss))
f.write("
")
f.close()

6. Call the methods to run the crawler:

html = self.get_page(url)
self.page_page(html)

Result Demonstration

After entering the start and end pages, the program prints the extracted information in the console and saves it to g.txt. The following screenshots show the input UI, console output, the saved txt file, and its content.

Conclusion

Do not scrape excessively to avoid overloading the server. The article demonstrates the main challenges of crawling a recruitment site and provides practical solutions for handling anti‑scraping mechanisms, string concatenation, and list type conversion. The code is simple and intended to help beginners get started.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python data mining Web Scraping requests anti-scraping lxml

Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.