How to Efficiently Extract JSON Data with Python Web Scraping – 4 Proven Methods
This article walks through a Python fan's question about storing JSON from web pages during crawling, presents four step‑by‑step code solutions—including regex, BeautifulSoup, and json.loads—and offers practical tips for clean extraction and reliable scraping.
The author, a Python enthusiast, shares a fan's question about storing JSON data from web pages during Python crawling.
Solution Process
The original approach wrote the response to a txt file and did not produce JSON; the author recommends using response.json() and extracting response['data']['desc'] directly.
Original code example:
with open('Rr.txt','r',encoding='utf-8') as f:
for line in f.readlines():
# regex = re.compile('"summary":"(.*?)"', re.S)
regex = re.compile('desc":"(.*?)","desc_module"', re.S)
result = re.findall(regex, line)
for item in result:
print(item)Improved version using requests, BeautifulSoup, and regex to fetch and parse the JSON data:
import requests
import re
from bs4 import BeautifulSoup as bs
url = "https://scdn.gongyi.qq.com/json_data/data_detail/54/detail.37754.js"
resp = requests.get(url)
text = resp.text.replace('\\/', '/')
text = text.encode('utf-8').decode('unicode_escape')
regex = re.compile('"detail_top_img":null,"desc":"(.*?)","desc_module"', re.S)
result = re.findall(regex, text)
page = bs(result[0], "lxml")
print(page.text)Further optimized version using json.loads for cleaner extraction:
import requests
from bs4 import BeautifulSoup as bs
import json
url = "https://scdn.gongyi.qq.com/json_data/data_detail/54/detail.37754.js"
resp = requests.get(url)
text = resp.text.replace('\\/', '/')
text = text.encode('utf-8').decode('unicode_escape')
page = bs(text, "lxml")
data = page.text.replace('_cb_fn_proj_37754(', '').replace(');', '')
json_data = json.loads(data)
print(json_data["detail"]["desc"])Another concise method extracting the JSON substring before loading:
import requests
import json
resp = requests.get('https://scdn.gongyi.qq.com/json_data/data_detail/54/detail.37754.js')
text = resp.text
text = text[text.find('(')+1:text.rfind(')')]
print(json.loads(text)['detail']['desc'])Note: adding appropriate headers to requests is recommended for longer crawls to avoid being flagged.
Conclusion
The article demonstrates four practical ways to retrieve JSON description fields from a web page using Python, highlighting the importance of proper string cleaning, regex handling, and JSON parsing for reliable data extraction.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
