Master Python Web Scraping: Extract JSON Data Efficiently with Real Code
This article walks through extracting JSON data from a web page using Python, showcasing multiple code examples with requests, regex, BeautifulSoup, and json handling, and explains how to simplify storage and avoid common parsing pitfalls for reliable web scraping.
1. Introduction
Hello, I am a Python enthusiast. A fan named Rr asked about storing JSON data from a web page during Python web scraping. This article shares the solution.
2. Solution Process
Initially the response was written to a txt file and not in JSON format, which was inconvenient.
The following code snippets demonstrate two approaches.
with open('Rr.txt', 'r', encoding='utf-8') as f:
for line in f.readlines():
# regex = re.compile('"summary":"(.*?)"', re.S)
regex = re.compile('desc":"(.*?)","desc_module"', re.S)
result = re.findall(regex, line)
for item in result:
print(item)A simpler method is to use response.json() and directly extract response['data']['desc'] and store it to a txt file.
import requests
import re
from bs4 import BeautifulSoup as bs
url = "https://scdn.gongyi.qq.com/json_data/data_detail/54/detail.37754.js"
resp = requests.get(url)
text = resp.text.replace('\\/', '/')
text = text.encode('utf-8').decode('unicode_escape')
regex = re.compile('"detail_top_img":null,"desc":"(.*?)","desc_module"', re.S)
result = re.findall(regex, text)
page = bs(result[0], "lxml")
print(page.text)A refined version using the json module:
import requests
from bs4 import BeautifulSoup as bs
import json
url = "https://scdn.gongyi.qq.com/json_data/data_detail/54/detail.37754.js"
resp = requests.get(url)
text = resp.text.replace('\\/', '/')
text = text.encode('utf-8').decode('unicode_escape')
page = bs(text, "lxml")
data = page.text.replace('_cb_fn_proj_37754(', '').replace(');', '')
json_data = json.loads(data)
print(json_data["detail"]["desc"])Another concise approach:
import requests
import json
resp = requests.get('https://scdn.gongyi.qq.com/json_data/data_detail/54/detail.37754.js')
text = resp.text
text = text[text.find('(')+1 : text.rfind(')')]
print(json.loads(text)['detail']['desc'])The slicing line extracts the JSON payload between parentheses. Adding appropriate headers is recommended for polite crawling.
3. Conclusion
This article presented four methods to extract JSON data from a web page during Python crawling, demonstrating code improvements and best practices for reliable data extraction.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
