Backend Development 6 min read

Scrape NetEase Cloud Music Hot Tracks with Python and html5lib

This article demonstrates how to scrape popular song titles and links from NetEase Cloud Music using Python's html5lib parser, covering the challenges of XPath and pyquery selectors, code implementation, encoding fixes, and a complete working example.

Python Crawling & Data Mining

May 23, 2022

Scrape NetEase Cloud Music Hot Tracks with Python and html5lib

1. Introduction

A user asked how to obtain the names and URLs of popular songs from NetEase Cloud Music. The author tried XPath but could not retrieve the data, even though the source HTML was visible.

2. Implementation

The solution uses the html5lib parser to clean the HTML, then extracts song information with XPath, BeautifulSoup (bs4) or pyquery. Below is a complete script that works with html5lib and resolves the parsing issue.

# coding:utf-8

# @Time : 2022/5/10 10:46
# @Author: 皮皮
# @公众号: Python共享之家
# @website : http://pdcfighting.com/
# @File : 网易云音乐热门作品名字和链接(html5lib).py
# @Software: PyCharm

import requests, re
from lxml import etree
from fake_useragent import UserAgent
import html5lib

class Wangyiyun(object):
    def __init__(self):
        self.base_url = 'https://music.163.com/discover/artist'
        self.headers = {
            'user-agent': UserAgent().random,
            'referer': 'https://music.163.com/',
            'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9'
        }

    def get_xpath(self, url):
        res = requests.get(url, headers=self.headers)
        return etree.HTML(etree.tostring(html5lib.parse(res.text, treebuilder='lxml')))

    def singers_parse(self, url, items):
        html = self.get_xpath(url)
        song_dict = {}
        a_lis = html.xpath('//div[@id="song-list-pre-cache"]/ul/li/a')
        for a in a_lis:
            song_name = a.xpath('.//text()')[0]
            print(song_name)
            song_url = 'https://music.163.com' + a.xpath('./@href')[0]
            print(song_url)
        items['所有歌曲：'] = song_dict

Wangyiyun().singers_parse(url='https://music.163.com/artist?id=50653542', items={})

The script runs successfully and prints the song names and URLs.

A common error arises from encoding issues. Adding an explicit encoding parameter to the html5lib.parse call resolves it.

return etree.HTML(etree.tostring(html5lib.parse(res.text, treebuilder='lxml'), encoding='iso8859-1'))

3. Conclusion

The author confirms that the pyquery selector is the most challenging part, but with the provided code and encoding fix, scraping NetEase Cloud Music becomes straightforward. The article also references earlier implementations using regular expressions, XPath, and BeautifulSoup, encouraging readers to experiment with different parsers.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python web-scraping html5lib bs4 pyquery netease-music

Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.