How to Efficiently Extract JSON Data with Python Web Scraping – 4 Proven Methods

This article walks through a Python fan's question about storing JSON from web pages during crawling, presents four step‑by‑step code solutions—including regex, BeautifulSoup, and json.loads—and offers practical tips for clean extraction and reliable scraping.

Python Crawling & Data Mining
Python Crawling & Data Mining
Python Crawling & Data Mining
How to Efficiently Extract JSON Data with Python Web Scraping – 4 Proven Methods

The author, a Python enthusiast, shares a fan's question about storing JSON data from web pages during Python crawling.

Solution Process

The original approach wrote the response to a txt file and did not produce JSON; the author recommends using response.json() and extracting response['data']['desc'] directly.

Original code example:

with open('Rr.txt','r',encoding='utf-8') as f:
    for line in f.readlines():
        # regex = re.compile('"summary":"(.*?)"', re.S)
        regex = re.compile('desc":"(.*?)","desc_module"', re.S)
        result = re.findall(regex, line)
        for item in result:
            print(item)

Improved version using requests, BeautifulSoup, and regex to fetch and parse the JSON data:

import requests
import re
from bs4 import BeautifulSoup as bs

url = "https://scdn.gongyi.qq.com/json_data/data_detail/54/detail.37754.js"
resp = requests.get(url)
text = resp.text.replace('\\/', '/')
text = text.encode('utf-8').decode('unicode_escape')
regex = re.compile('"detail_top_img":null,"desc":"(.*?)","desc_module"', re.S)
result = re.findall(regex, text)
page = bs(result[0], "lxml")
print(page.text)

Further optimized version using json.loads for cleaner extraction:

import requests
from bs4 import BeautifulSoup as bs
import json

url = "https://scdn.gongyi.qq.com/json_data/data_detail/54/detail.37754.js"
resp = requests.get(url)
text = resp.text.replace('\\/', '/')
text = text.encode('utf-8').decode('unicode_escape')
page = bs(text, "lxml")
data = page.text.replace('_cb_fn_proj_37754(', '').replace(');', '')
json_data = json.loads(data)
print(json_data["detail"]["desc"])

Another concise method extracting the JSON substring before loading:

import requests
import json

resp = requests.get('https://scdn.gongyi.qq.com/json_data/data_detail/54/detail.37754.js')
text = resp.text
text = text[text.find('(')+1:text.rfind(')')]
print(json.loads(text)['detail']['desc'])

Note: adding appropriate headers to requests is recommended for longer crawls to avoid being flagged.

Conclusion

The article demonstrates four practical ways to retrieve JSON description fields from a web page using Python, highlighting the importance of proper string cleaning, regex handling, and JSON parsing for reliable data extraction.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

PythonJSONrequestsbeautifulsoupweb-scraping
Python Crawling & Data Mining
Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.