Scrape NetEase Cloud Music Hot Tracks with Python and html5lib
This article demonstrates how to scrape popular song titles and links from NetEase Cloud Music using Python's html5lib parser, covering the challenges of XPath and pyquery selectors, code implementation, encoding fixes, and a complete working example.
1. Introduction
A user asked how to obtain the names and URLs of popular songs from NetEase Cloud Music. The author tried XPath but could not retrieve the data, even though the source HTML was visible.
2. Implementation
The solution uses the html5lib parser to clean the HTML, then extracts song information with XPath, BeautifulSoup (bs4) or pyquery. Below is a complete script that works with html5lib and resolves the parsing issue.
# coding:utf-8
# @Time : 2022/5/10 10:46
# @Author: 皮皮
# @公众号: Python共享之家
# @website : http://pdcfighting.com/
# @File : 网易云音乐热门作品名字和链接(html5lib).py
# @Software: PyCharm
import requests, re
from lxml import etree
from fake_useragent import UserAgent
import html5lib
class Wangyiyun(object):
def __init__(self):
self.base_url = 'https://music.163.com/discover/artist'
self.headers = {
'user-agent': UserAgent().random,
'referer': 'https://music.163.com/',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9'
}
def get_xpath(self, url):
res = requests.get(url, headers=self.headers)
return etree.HTML(etree.tostring(html5lib.parse(res.text, treebuilder='lxml')))
def singers_parse(self, url, items):
html = self.get_xpath(url)
song_dict = {}
a_lis = html.xpath('//div[@id="song-list-pre-cache"]/ul/li/a')
for a in a_lis:
song_name = a.xpath('.//text()')[0]
print(song_name)
song_url = 'https://music.163.com' + a.xpath('./@href')[0]
print(song_url)
items['所有歌曲:'] = song_dict
Wangyiyun().singers_parse(url='https://music.163.com/artist?id=50653542', items={})The script runs successfully and prints the song names and URLs.
A common error arises from encoding issues. Adding an explicit encoding parameter to the html5lib.parse call resolves it.
return etree.HTML(etree.tostring(html5lib.parse(res.text, treebuilder='lxml'), encoding='iso8859-1'))3. Conclusion
The author confirms that the pyquery selector is the most challenging part, but with the provided code and encoding fix, scraping NetEase Cloud Music becomes straightforward. The article also references earlier implementations using regular expressions, XPath, and BeautifulSoup, encouraging readers to experiment with different parsers.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
