Backend Development 8 min read

How to Scrape Tencent Job Listings with Scrapy: Step-by-Step Guide

This tutorial walks you through analyzing Tencent's recruitment page, locating the Ajax JSON endpoint, and using Scrapy to create a project, spider, items, pagination, settings, and data export to collect job postings efficiently.

Python Crawling & Data Mining

Sep 17, 2021

How to Scrape Tencent Job Listings with Scrapy: Step-by-Step Guide

Pre-crawl Analysis

Before crawling, we inspect Tencent's technical job page using the browser's developer tools to locate where the job data is rendered.

The data is not present in the static HTML; it is generated via JavaScript, so we look at the Ajax requests.

The Ajax request URL contains the job information. A simplified version of the URL is:

https://careers.tencent.com/tencentcareer/api/post/Query?categoryId=40001001,40001002,40001003,40001004,40001005,40001006&pageIndex=1&pageSize=10&language=zh-cn&area=cn

We can change the pageIndex parameter to paginate.

Hands-on Practice

1. Create Scrapy Project

Run the following command to start a new Scrapy project:

scrapy startproject Tencent

2. Create Spider

Create a spider with the command: scrapy genspider tencent careers.tencent.com This generates tencent.py in the spiders folder.

import scrapy

class TencentSpider(scrapy.Spider):
    name = 'tencent'
    allowed_domains = ['careers.tencent.com']
    start_urls = ['https://careers.tencent.com/tencentcareer/api/post/Query?categoryId=40001001,40001002,40001003,40001004,40001005,40001006&pageIndex=1&pageSize=10&language=zh-cn&area=cn']

    def parse(self, response):
        json = response.json()
        datas = json.get('Data').get('Posts')
        for data in datas:
            item = {}
            item['RecruitPostName'] = data.get('RecruitPostName')
            item['LocationName'] = data.get('LocationName')
            item['Responsibility'] = data.get('Responsibility').replace('
', '')
            yield item

The spider extracts the job title, location, and responsibilities.

3. Define Items

In items.py define the fields to store:

import scrapy

class TencentItem(scrapy.Item):
    RecruitPostName = scrapy.Field()  # Job title
    LocationName = scrapy.Field()    # Location
    Responsibility = scrapy.Field() # Job description

4. Extract Data

Use the spider code above to parse the JSON response and yield TencentItem objects.

5. Configure settings.py

LOG_LEVEL = "WARNING"
ITEM_PIPELINES = {'Tencent.pipelines.TencentPipeline': 300}
USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36"

These settings suppress logs, enable pipelines, and set a realistic user agent.

6. Save Data

Run the spider and export results directly to a file:

scrapy crawl tencent -o tencent.json   # JSON output
scrapy crawl tencent -o tencent.csv    # CSV output
scrapy crawl tencent -o tencent.xml    # XML output

For database or custom file formats, implement the logic in pipelines.py.

Result Display

The spider successfully retrieves Tencent's technical job postings.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

data extraction Web Scraping Scrapy Tencent Jobs

Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.