How to Scrape NetEase Cloud Music Hot Tracks with Python and XPath
Learn how to extract song names and URLs from NetEase Cloud Music's hot tracks page using Python's requests library and XPath selectors, including handling malformed HTML, code examples, and tips for replacing interfering tags to ensure reliable scraping.
1. Introduction
A user asked how to fetch the names and links of popular songs from NetEase Cloud Music. The original attempt using xpath failed because the response HTML was not well‑formed, so direct XPath selection returned nothing.
2. Implementation
The solution replaces the problematic "<>" fragments in the response, then parses the cleaned HTML with lxml.etree. The script sends a GET request with a random user‑agent, fixes the HTML, and extracts each song name and its full URL via XPath.
# coding:utf-8
# @Time : 2022/5/11 12:46
# @Author: 皮皮
# @公众号: Python共享之家
# @website : http://pdcfighting.com/
# @File : 网易云音乐热门作品名字和链接(xpath).py
# @Software: PyCharm
import requests, re
from lxml import etree
from fake_useragent import UserAgent
class Wangyiyun(object):
def __init__(self):
self.base_url = 'https://music.163.com/discover/artist'
self.headers = {
'user-agent': UserAgent().random,
'referer': 'https://music.163.com/',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9'
}
def get_xpath(self, url):
res = requests.get(url, headers=self.headers)
html = res.text.replace('<适合才重要>', '适合才重要')
return etree.HTML(html)
def singers_parse(self, url, items):
html = self.get_xpath(url)
song_dict = {}
a_lis = html.xpath('//div[@id="song-list-pre-cache"]/ul/li/a')
for a in a_lis:
song_name = a.xpath('.//text()')[0]
print(song_name)
song_url = 'https://music.163.com' + a.xpath('./@href')[0]
print(song_url)
items['所有歌曲:'] = song_dict
Wangyiyun().singers_parse(url='https://music.163.com/artist?id=50653542', items={})The script runs successfully and prints each song name and its corresponding link. The key point is removing the interfering "<>" characters that cause the HTML parser to misinterpret the content.
3. Conclusion
The XPath‑based approach works for scraping NetEase Cloud Music hot tracks. The difficulty lies in cleaning the malformed HTML before parsing. Future articles will demonstrate similar tasks using regular expressions, BeautifulSoup, and pyquery to reinforce Python selector fundamentals.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
