How to Scrape Tencent Job Listings with Scrapy: Step-by-Step Guide
This tutorial walks you through analyzing Tencent's recruitment page, locating the Ajax JSON endpoint, and using Scrapy to create a project, spider, items, pagination, settings, and data export to collect job postings efficiently.
Pre-crawl Analysis
Before crawling, we inspect Tencent's technical job page using the browser's developer tools to locate where the job data is rendered.
The data is not present in the static HTML; it is generated via JavaScript, so we look at the Ajax requests.
The Ajax request URL contains the job information. A simplified version of the URL is:
https://careers.tencent.com/tencentcareer/api/post/Query?categoryId=40001001,40001002,40001003,40001004,40001005,40001006&pageIndex=1&pageSize=10&language=zh-cn&area=cnWe can change the pageIndex parameter to paginate.
Hands-on Practice
1. Create Scrapy Project
Run the following command to start a new Scrapy project:
scrapy startproject Tencent2. Create Spider
Create a spider with the command: scrapy genspider tencent careers.tencent.com This generates tencent.py in the spiders folder.
import scrapy
class TencentSpider(scrapy.Spider):
name = 'tencent'
allowed_domains = ['careers.tencent.com']
start_urls = ['https://careers.tencent.com/tencentcareer/api/post/Query?categoryId=40001001,40001002,40001003,40001004,40001005,40001006&pageIndex=1&pageSize=10&language=zh-cn&area=cn']
def parse(self, response):
json = response.json()
datas = json.get('Data').get('Posts')
for data in datas:
item = {}
item['RecruitPostName'] = data.get('RecruitPostName')
item['LocationName'] = data.get('LocationName')
item['Responsibility'] = data.get('Responsibility').replace('
', '')
yield itemThe spider extracts the job title, location, and responsibilities.
3. Define Items
In items.py define the fields to store:
import scrapy
class TencentItem(scrapy.Item):
RecruitPostName = scrapy.Field() # Job title
LocationName = scrapy.Field() # Location
Responsibility = scrapy.Field() # Job description4. Extract Data
Use the spider code above to parse the JSON response and yield TencentItem objects.
5. Configure settings.py
LOG_LEVEL = "WARNING"
ITEM_PIPELINES = {'Tencent.pipelines.TencentPipeline': 300}
USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36"These settings suppress logs, enable pipelines, and set a realistic user agent.
6. Save Data
Run the spider and export results directly to a file:
scrapy crawl tencent -o tencent.json # JSON output
scrapy crawl tencent -o tencent.csv # CSV output
scrapy crawl tencent -o tencent.xml # XML outputFor database or custom file formats, implement the logic in pipelines.py.
Result Display
The spider successfully retrieves Tencent's technical job postings.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
