How to Scrape NetEase Cloud Music Hot Songs with Python and html5lib

This article explains how to retrieve the names and links of hot songs from NetEase Cloud Music using Python's requests library together with html5lib for HTML parsing, providing full code, handling encoding issues, and comparing it with previous regex, xpath, bs4, and pyquery approaches.

Python Crawling & Data Mining
Python Crawling & Data Mining
Python Crawling & Data Mining
How to Scrape NetEase Cloud Music Hot Songs with Python and html5lib

1. Introduction

Recently a fan asked how to fetch the names and links of hot songs from NetEase Cloud Music. Previous articles covered regex, xpath, bs4, and pyquery. This article demonstrates using the html5lib library to parse the page.

2. Implementation

The following code defines a Wangyiyun class that sends a request to the artist page, parses the HTML with html5lib, and extracts song names and URLs using XPath.

# coding:utf-8

# @Time : 2022/5/10 10:46
# @Author: 皮皮
# @公众号: Python共享之家
# @website : http://pdcfighting.com/
# @File : 网易云音乐热门作品名字和链接(html5lib).py
# @Software: PyCharm

import requests, re
from lxml import etree
from fake_useragent import UserAgent
import html5lib

class Wangyiyun(object):
    def __init__(self):
        self.base_url = 'https://music.163.com/discover/artist'
        self.headers = {
            'user-agent': UserAgent().random,
            'referer': 'https://music.163.com/',
            'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9'
        }
    def get_xpath(self, url):
        res = requests.get(url, headers=self.headers)
        return etree.HTML(etree.tostring(html5lib.parse(res.text, treebuilder='lxml')))
    def singers_parse(self, url, items):
        html = self.get_xpath(url)
        song_dict = {}
        a_lis = html.xpath('//div[@id="song-list-pre-cache"]/ul/li/a')
        for a in a_lis:
            song_name = a.xpath('.//text()')[0]
            print(song_name)
            song_url = 'https://music.163.com' + a.xpath('./@href')[0]
            print(song_url)
        items['所有歌曲:'] = song_dict

Wangyiyun().singers_parse(url='https://music.163.com/artist?id=50653542', items={})

The script runs successfully; the resulting output is shown below.

If an encoding error occurs, add the encoding='iso8859-1' argument to the html5lib.parse call, as illustrated.

return etree.HTML(etree.tostring(html5lib.parse(res.text, treebuilder='lxml', encoding='iso8859-1')))

3. Conclusion

The html5lib‑based approach reliably retrieves hot song titles and links. The main difficulty lies in mastering pyquery selectors, but this method complements earlier solutions using regex, xpath, bs4, and pyquery.

web-scrapinghtml5libnetease-music
Python Crawling & Data Mining
Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.