How to Scrape NetEase Cloud Music Hot Tracks with Python and XPath

Learn how to extract song names and URLs from NetEase Cloud Music's hot tracks page using Python's requests library and XPath selectors, including handling malformed HTML, code examples, and tips for replacing interfering tags to ensure reliable scraping.

Python Crawling & Data Mining
Python Crawling & Data Mining
Python Crawling & Data Mining
How to Scrape NetEase Cloud Music Hot Tracks with Python and XPath

1. Introduction

A user asked how to fetch the names and links of popular songs from NetEase Cloud Music. The original attempt using xpath failed because the response HTML was not well‑formed, so direct XPath selection returned nothing.

2. Implementation

The solution replaces the problematic "<>" fragments in the response, then parses the cleaned HTML with lxml.etree. The script sends a GET request with a random user‑agent, fixes the HTML, and extracts each song name and its full URL via XPath.

# coding:utf-8
# @Time : 2022/5/11 12:46
# @Author: 皮皮
# @公众号: Python共享之家
# @website : http://pdcfighting.com/
# @File : 网易云音乐热门作品名字和链接(xpath).py
# @Software: PyCharm

import requests, re
from lxml import etree
from fake_useragent import UserAgent

class Wangyiyun(object):
    def __init__(self):
        self.base_url = 'https://music.163.com/discover/artist'
        self.headers = {
            'user-agent': UserAgent().random,
            'referer': 'https://music.163.com/',
            'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9'
        }
    def get_xpath(self, url):
        res = requests.get(url, headers=self.headers)
        html = res.text.replace('<适合才重要>', '适合才重要')
        return etree.HTML(html)
    def singers_parse(self, url, items):
        html = self.get_xpath(url)
        song_dict = {}
        a_lis = html.xpath('//div[@id="song-list-pre-cache"]/ul/li/a')
        for a in a_lis:
            song_name = a.xpath('.//text()')[0]
            print(song_name)
            song_url = 'https://music.163.com' + a.xpath('./@href')[0]
            print(song_url)
        items['所有歌曲:'] = song_dict

Wangyiyun().singers_parse(url='https://music.163.com/artist?id=50653542', items={})

The script runs successfully and prints each song name and its corresponding link. The key point is removing the interfering "<>" characters that cause the HTML parser to misinterpret the content.

3. Conclusion

The XPath‑based approach works for scraping NetEase Cloud Music hot tracks. The difficulty lies in cleaning the malformed HTML before parsing. Future articles will demonstrate similar tasks using regular expressions, BeautifulSoup, and pyquery to reinforce Python selector fundamentals.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

HTML ParsingPythonWeb ScrapingXPathnetease-music
Python Crawling & Data Mining
Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.