Backend Development 5 min read

How to Scrape NetEase Cloud Music Hot Tracks with Python Regex (Step‑by‑Step)

This article explains why XPath fails on NetEase Cloud Music's hot‑song page, presents a regular‑expression solution in Python, provides a complete script, and shows the resulting song names and URLs, while previewing upcoming tutorials using xpath, bs4, and pyquery.

Python Crawling & Data Mining

May 20, 2022

How to Scrape NetEase Cloud Music Hot Tracks with Python Regex (Step‑by‑Step)

1. Introduction

A user in a Python community asked how to retrieve the names and links of hot songs from NetEase Cloud Music. Although the page source is visible, using xpath cannot extract the data because the response is not well‑formed HTML.

2. Implementation

The response lacks regular HTML structure, so xpath fails. A regular‑expression approach is used instead. Below is the full Python script that fetches the page, cleans the HTML, applies a regex pattern, and prints each song URL and name.

# coding:utf-8
# @Time : 2022/5/9 23:46
# @Author: 皮皮
# @公众号: Python共享之家
# @website : http://pdcfighting.com/
# @File: 网易云音乐热门作品名字和链接(正则表达式).py
# @Software: PyCharm

import requests
from lxml import etree
from fake_useragent import UserAgent
import re

class Wangyiyun(object):
    def __init__(self):
        self.base_url = 'https://music.163.com/discover/artist'
        self.headers = {
            'user-agent': UserAgent().random,
            'referer': 'https://music.163.com/',
            'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9'
        }

    def get_xpath(self, url):
        res = requests.get(url, headers=self.headers)
        html = res.text.replace('<适合才重要>', '适合才重要')
        return html

    def singers_parse(self, url, items):
        html = self.get_xpath(url)
        pattern = re.compile(r'>(<li><a href="(?P<song_url>.*?)">(?P<song_name>.*?)</a></li>).*?', re.S)
        items = re.finditer(pattern, html)
        for item in items:
            song_url = item.group('song_url')
            song_name = item.group('song_name')
            print(song_url, song_name)

Wangyiyun().singers_parse(url='https://music.163.com/artist?id=50653542', items={})

The script runs correctly and prints each song's URL and name, as shown in the following screenshot.

3. Summary

The regex‑based method effectively extracts hot‑song names and links from NetEase Cloud Music; the main challenge is constructing the correct regular expression. Upcoming articles will cover similar extraction tasks using xpath, bs4, and pyquery to help readers strengthen their Python selector skills.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

regex XPath beautifulsoup netease-music

Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.