Fundamentals 10 min read

Master JSON and JSONPath in Python: From Basics to Real‑World Scraping

This tutorial explains JSON fundamentals, demonstrates Python's json module functions, introduces the JSONPath library for extracting data from JSON documents, and provides a complete example of crawling city information from a website and parsing it with JSONPath.

Python Crawling & Data Mining
Python Crawling & Data Mining
Python Crawling & Data Mining
Master JSON and JSONPath in Python: From Basics to Real‑World Scraping

Background Introduction

During web crawling, the fetched page data often contains unnecessary information, so we need to parse it. Common parsing methods include regular expressions, XPath, and BeautifulSoup; this article introduces another parsing library—jsonpath—and first explains what JSON is.

1. Getting to Know JSON

JSON (JavaScript Object Notation) is a lightweight data interchange format that is easy for humans to read and write and convenient for machines to parse and generate. It is suitable for data exchange scenarios such as front‑end and back‑end communication.

Python 2.7 and later include a built‑in json module, which can be used simply with import json.

Official documentation: http://docs.python.org/library/json.html

Online JSON parser: http://www.json.cn/

2. Basic Usage of JSON

Overview

JSON essentially consists of objects and arrays, which can represent complex structures.

Object: represented by {} with key‑value pairs, e.g., {key: value, ...}. Values can be numbers, strings, arrays, or other objects. Access via object.key.

Array: represented by [], e.g., ["Python", "javascript", "C++"]. Elements are accessed by index.

Functions

The json module provides four main functions: dumps, dump, loads, and load, which convert between JSON strings and Python data types.

Conversion mapping:

JSON object → Python dict JSON array → Python list JSON string → Python unicode JSON number (int) → Python int / long JSON number (real) → Python float JSON true/false → Python True / False JSON null → Python

None

1. json.loads()

import json
strDict = '{"city": "广州", "name": "小黑"}'
r = json.loads(strDict)  # JSON data stored as Unicode
print(r)
{'city': '广州', 'name': '小黑'}

2. json.load()

Read JSON formatted string from a file and convert to Python type.

import json
s = json.load(open('test.json', 'r', encoding='utf-8'))
print(s, type(s))
{'city': '广州', 'name': '小黑'} <class 'dict'>

3. json.dumps()

Convert Python objects to JSON strings, returning a str object.

import json
listStr = [1, 2, 3, 4]
dictStr = {"city": "北京", "name": "大猫"}
s1 = json.dumps(listStr)
s2 = json.dumps(dictStr, ensure_ascii=False)
print(s1, type(s1))
print(s2)
[1, 2, 3, 4] <class 'str'> {"city": "北京", "name": "大猫"} <class 'str'>

Note: json.dumps() uses ASCII encoding by default; setting ensure_ascii=False disables it and uses UTF‑8.

4. json.dump()

Serialize a Python object to JSON and write it to a file.

import json
json_info = "{'age': '12'}"
file = open('ceshi.json', 'w', encoding='utf-8')
json.dump(json_info, file)
ceshii,json (file created)

3. JsonPath

JsonPath is an information‑extraction library for JSON documents, analogous to XPath for XML, with implementations in JavaScript, Python, PHP, and Java.

Download: https://pypi.python.org/pypi/jsonpath Installation: download the package and run python setup.py install Documentation: http://goessner.net/articles/JsonPath

JsonPath vs XPath Syntax Comparison

JSON’s clear structure makes matching easy; the list below maps common XPath expressions to their JsonPath equivalents.

/ → $ (Root node)

. → @ (Current node)

/ → . or [] (Select child node)

.. → n/a (Parent node, unsupported)

// → .. (Select all matching nodes regardless of location)

* → * (Match all element nodes)

[] → [] (Iterator notation for indexing or filtering)

| → [,] (Multiple selection within an iterator)

[] → ?() (Filter operation)

() → () (Expression evaluation)

4. Case Test

We crawled city information from the Taopiaopiao website, saved it as a JSON file, and used JsonPath to extract all city names.

Request

import requests
import time
url = 'https://dianying.taobao.com/cityAction.json?...'
headers = {
    'user-agent': 'Mozilla/5.0 ...'
}
res = requests.get(url, headers=headers)
result = res.content.decode('utf-8')
print(result)  # truncated output

Note: Adding all header key‑value pairs helps avoid anti‑scraping measures.

Save Data

content = result.split('(')[1].split(')')[0]
with open('tpp.json', 'w', encoding='utf-8') as fp:
    fp.write(content)

Resulting JSON file preview:

JSON file preview
JSON file preview

Parse Data

import json
import jsonpath
obj = json.load(open('tpp.json', 'r', encoding='utf-8'))
city_list = jsonpath.jsonpath(obj, '$..regionName')
print(city_list)

Output shows the list of city names:

City list output
City list output

5. Summary

JSON is a common data transmission format, and mastering its operations speeds up data extraction in web crawling. This article introduced basic JSON handling and JsonPath usage, and demonstrated a simple extraction of city data from a real website.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

BackendWeb ScrapingData Parsing
Python Crawling & Data Mining
Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.