How to Scrape NetEase Cloud Music Hot Songs with Python and html5lib
This article explains how to retrieve the names and links of hot songs from NetEase Cloud Music using Python's requests library together with html5lib for HTML parsing, providing full code, handling encoding issues, and comparing it with previous regex, xpath, bs4, and pyquery approaches.
1. Introduction
Recently a fan asked how to fetch the names and links of hot songs from NetEase Cloud Music. Previous articles covered regex, xpath, bs4, and pyquery. This article demonstrates using the html5lib library to parse the page.
2. Implementation
The following code defines a Wangyiyun class that sends a request to the artist page, parses the HTML with html5lib, and extracts song names and URLs using XPath.
# coding:utf-8
# @Time : 2022/5/10 10:46
# @Author: 皮皮
# @公众号: Python共享之家
# @website : http://pdcfighting.com/
# @File : 网易云音乐热门作品名字和链接(html5lib).py
# @Software: PyCharm
import requests, re
from lxml import etree
from fake_useragent import UserAgent
import html5lib
class Wangyiyun(object):
def __init__(self):
self.base_url = 'https://music.163.com/discover/artist'
self.headers = {
'user-agent': UserAgent().random,
'referer': 'https://music.163.com/',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9'
}
def get_xpath(self, url):
res = requests.get(url, headers=self.headers)
return etree.HTML(etree.tostring(html5lib.parse(res.text, treebuilder='lxml')))
def singers_parse(self, url, items):
html = self.get_xpath(url)
song_dict = {}
a_lis = html.xpath('//div[@id="song-list-pre-cache"]/ul/li/a')
for a in a_lis:
song_name = a.xpath('.//text()')[0]
print(song_name)
song_url = 'https://music.163.com' + a.xpath('./@href')[0]
print(song_url)
items['所有歌曲:'] = song_dict
Wangyiyun().singers_parse(url='https://music.163.com/artist?id=50653542', items={})The script runs successfully; the resulting output is shown below.
If an encoding error occurs, add the encoding='iso8859-1' argument to the html5lib.parse call, as illustrated.
return etree.HTML(etree.tostring(html5lib.parse(res.text, treebuilder='lxml', encoding='iso8859-1')))3. Conclusion
The html5lib‑based approach reliably retrieves hot song titles and links. The main difficulty lies in mastering pyquery selectors, but this method complements earlier solutions using regex, xpath, bs4, and pyquery.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
