Backend Development 15 min read

Build a Python Scraper for Lagou.com to Extract Job Requirements with Baidu NLP

This article demonstrates a compact, runnable Python 3 scraper that fetches job listings from Lagou.com based on a keyword, filters by city and salary, extracts detailed job requirements using XPath, and applies Baidu's free NLP service for word segmentation and part‑of‑speech tagging to reveal key skill terms.

MaGe Linux Operations

Apr 2, 2018

Build a Python Scraper for Lagou.com to Extract Job Requirements with Baidu NLP

Overview

This article shows a short, runnable Python 3 web crawler targeting Lagou.com. It fetches job listings for a given keyword, filters by city and salary range, extracts the "job requirements" section, and uses Baidu's free NLP service for word segmentation and part‑of‑speech tagging to illustrate basic crawling techniques.

Data Source

Lagou.com

Tools

Python 3 with third‑party libraries Requests , lxml , and AipNlp . All three can be installed via pip install requests lxml baidu-aip.

Implementation Code

def fetch_list(page_index):
    headers = {"User-Agent": USER_AGENT, "Referer": REFERER, "Cookie": COOKIE}
    params = {"px": "default", "city": CITY, "yx": SALARY}
    data = {"first": page_index == 1, "pn": page_index, "kd": KEY}
    # POST request to https://www.lagou.com/jobs/positionAjax.json
    s = requests.post(BASE_URL, headers=headers, params=params, data=data)
    return s.json()

# Example JSON structure returned
{
  "content": {
    "pageNo": ..., 
    "positionResult": {
      "resultSize": ..., 
      "result": [
        {
          "companyFullName": "Company Name",
          "city": "City",
          "education": "Education",
          "salary": "Salary",
          "positionName": "Job Title",
          "positionId": "ID"
        }
      ]
    }
  }
}

def fetch_detail(id):
    headers = {"User-Agent": USER_AGENT, "Referer": REFERER, "Cookie": COOKIE}
    url = DETAIL_URL.format(id)
    s = requests.get(url, headers=headers)
    return s.text

# XPath to extract requirement paragraphs
//dd[@class="job_bt"]/div/p/text()

def fetch_requirements(result, segment):
    time.sleep(2)
    content = fetch_detail(result["positionId"])
    details = [d.strip() for d in etree.HTML(content).xpath('//dd[@class="job_bt"]/div/p/text()')]
    is_requirement = False
    requirements = {}
    for detail in details:
        if not detail:
            continue
        if is_requirement:
            m = re.match("([0-9]+|-)", detail)
            if m:
                words = segment(detail[m.end():])
                for w in words:
                    requirements[w] = requirements.get(w, 0) + 1
            else:
                break
        elif re.search("\w?[\.:：、 ]?(任职要求|任职资格|我们希望你|任职条件|岗位要求|要求：|职位要求|工作要求|职位需求)", detail):
            is_requirement = True
    return requirements

def init_segment():
    APP_ID = "xxxxxxxxx"
    API_KEY = "xxxxxxxxx"
    SECRET_KEY = "xxxxxxxxx"
    client = AipNlp(APP_ID, API_KEY, SECRET_KEY)
    retains = set(["n", "nr", "ns", "s", "nt", "an", "t", "nw", "vn"])
    def segment(text):
        try:
            items = client.lexer(re.sub('\s', '', text))["items"]
            cur = ""
            result = []
            for item in items:
                if item["pos"] in retains:
                    cur += item["item"]
                    continue
                if cur:
                    result.append(cur)
                    cur = ""
                if item.get("ne") or item["pos"] == "nz":
                    result.append(item["item"])
            if cur:
                result.append(cur)
            return result
        except Exception:
            return []
    return segment

Logic Breakdown

4.1 Fetch job list : Construct a POST request to Lagou's search API with the keyword, city, and salary filters. The response is paginated JSON; iterate pages until resultSize is zero.

4.2 Fetch job detail : Use the positionId from each list entry to request the detail page URL and obtain the raw HTML.

4.3 Extract requirements : Apply the XPath expression //dd[@class="job_bt"]/div/p/text() to pull all paragraph texts under the job description block.

4.4 Segment with Baidu NLP : Call the Baidu AipNlp lexer service, keep words with selected part‑of‑speech tags, merge consecutive tokens, and also retain named entities. The resulting token list is used to count occurrence of each skill term.

Sample Extracted Requirements

Solid data structures and algorithm fundamentals

Meticulous work attitude and strong learning ability, familiar with common crawling tools

Familiar with Linux development environment and Python

Understanding of HTTP, HTML, DOM, XPath, Scrapy preferred

Experience with crawling, information extraction, text classification; knowledge of Hadoop, Spark, and streaming frameworks preferred

Result Presentation

The script aggregates all extracted terms across jobs, counts their frequencies, sorts them, and prints the top ten most mentioned skills.

Conclusion

The provided code forms a basic yet complete pipeline for crawling job postings, extracting requirement texts, and performing lightweight NLP analysis. It can be extended to scrape multiple cities, salary ranges, additional filters, or to separate the crawling and NLP stages for more robust processing.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python NLP Web Scraping XPath Baidu AI Lagou Job Data

Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.