Backend Development 6 min read

How to Scrape NetEase Cloud Music Hot Songs with Python and html5lib

This article explains how to retrieve the names and links of hot songs from NetEase Cloud Music using Python's requests library together with html5lib for HTML parsing, providing full code, handling encoding issues, and comparing it with previous regex, xpath, bs4, and pyquery approaches.

Python Crawling & Data Mining

Apr 9, 2025

How to Scrape NetEase Cloud Music Hot Songs with Python and html5lib

1. Introduction

Recently a fan asked how to fetch the names and links of hot songs from NetEase Cloud Music. Previous articles covered regex, xpath, bs4, and pyquery. This article demonstrates using the html5lib library to parse the page.

2. Implementation

The following code defines a Wangyiyun class that sends a request to the artist page, parses the HTML with html5lib, and extracts song names and URLs using XPath.

# coding:utf-8

# @Time : 2022/5/10 10:46
# @Author: 皮皮
# @公众号: Python共享之家
# @website : http://pdcfighting.com/
# @File : 网易云音乐热门作品名字和链接(html5lib).py
# @Software: PyCharm

import requests, re
from lxml import etree
from fake_useragent import UserAgent
import html5lib

class Wangyiyun(object):
    def __init__(self):
        self.base_url = 'https://music.163.com/discover/artist'
        self.headers = {
            'user-agent': UserAgent().random,
            'referer': 'https://music.163.com/',
            'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9'
        }
    def get_xpath(self, url):
        res = requests.get(url, headers=self.headers)
        return etree.HTML(etree.tostring(html5lib.parse(res.text, treebuilder='lxml')))
    def singers_parse(self, url, items):
        html = self.get_xpath(url)
        song_dict = {}
        a_lis = html.xpath('//div[@id="song-list-pre-cache"]/ul/li/a')
        for a in a_lis:
            song_name = a.xpath('.//text()')[0]
            print(song_name)
            song_url = 'https://music.163.com' + a.xpath('./@href')[0]
            print(song_url)
        items['所有歌曲：'] = song_dict

Wangyiyun().singers_parse(url='https://music.163.com/artist?id=50653542', items={})

The script runs successfully; the resulting output is shown below.

If an encoding error occurs, add the encoding='iso8859-1' argument to the html5lib.parse call, as illustrated.

return etree.HTML(etree.tostring(html5lib.parse(res.text, treebuilder='lxml', encoding='iso8859-1')))

3. Conclusion

The html5lib‑based approach reliably retrieves hot song titles and links. The main difficulty lies in mastering pyquery selectors, but this method complements earlier solutions using regex, xpath, bs4, and pyquery.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

web-scraping html5lib netease-music

Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.