Backend Development 6 min read

Master Python Web Scraping: Extract JSON Data Efficiently with Real Code

This article walks through extracting JSON data from a web page using Python, showcasing multiple code examples with requests, regex, BeautifulSoup, and json handling, and explains how to simplify storage and avoid common parsing pitfalls for reliable web scraping.

Python Crawling & Data Mining

Nov 6, 2024

Master Python Web Scraping: Extract JSON Data Efficiently with Real Code

1. Introduction

Hello, I am a Python enthusiast. A fan named Rr asked about storing JSON data from a web page during Python web scraping. This article shares the solution.

2. Solution Process

Initially the response was written to a txt file and not in JSON format, which was inconvenient.

The following code snippets demonstrate two approaches.

with open('Rr.txt', 'r', encoding='utf-8') as f:
    for line in f.readlines():
        # regex = re.compile('"summary":"(.*?)"', re.S)
        regex = re.compile('desc":"(.*?)","desc_module"', re.S)
        result = re.findall(regex, line)
        for item in result:
            print(item)

A simpler method is to use response.json() and directly extract response['data']['desc'] and store it to a txt file.

import requests
import re
from bs4 import BeautifulSoup as bs

url = "https://scdn.gongyi.qq.com/json_data/data_detail/54/detail.37754.js"
resp = requests.get(url)
text = resp.text.replace('\\/', '/')
text = text.encode('utf-8').decode('unicode_escape')
regex = re.compile('"detail_top_img":null,"desc":"(.*?)","desc_module"', re.S)
result = re.findall(regex, text)
page = bs(result[0], "lxml")
print(page.text)

A refined version using the json module:

import requests
from bs4 import BeautifulSoup as bs
import json

url = "https://scdn.gongyi.qq.com/json_data/data_detail/54/detail.37754.js"
resp = requests.get(url)
text = resp.text.replace('\\/', '/')
text = text.encode('utf-8').decode('unicode_escape')
page = bs(text, "lxml")
data = page.text.replace('_cb_fn_proj_37754(', '').replace(');', '')
json_data = json.loads(data)
print(json_data["detail"]["desc"])

Another concise approach:

import requests
import json

resp = requests.get('https://scdn.gongyi.qq.com/json_data/data_detail/54/detail.37754.js')
text = resp.text
text = text[text.find('(')+1 : text.rfind(')')]
print(json.loads(text)['detail']['desc'])

The slicing line extracts the JSON payload between parentheses. Adding appropriate headers is recommended for polite crawling.

3. Conclusion

This article presented four methods to extract JSON data from a web page during Python crawling, demonstrating code improvements and best practices for reliable data extraction.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

json requests beautifulsoup web-scraping

Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.