Big Data 5 min read

How to Scrape 700K Ximalaya Audio Records with Python and MongoDB

This article details a step‑by‑step process for crawling all popular Ximalaya channels, extracting each audio's metadata, and storing roughly 700,000 records in MongoDB, while also showing how to speed up the crawl with asynchronous requests.

MaGe Linux Operations
MaGe Linux Operations
MaGe Linux Operations
How to Scrape 700K Ximalaya Audio Records with Python and MongoDB

1. Introduction

This project crawls all channels under Ximalaya's popular section, retrieving every audio's download URL, channel information, description, and other metadata, and saves the data into MongoDB for later analysis. The total dataset is about 700,000 records.

The author undertook this crawl to meet the requirements of an AI‑big‑data company interview, hoping to demonstrate experience with audio data extraction.

2. Runtime Environment

IDE: PyCharm 2017

Python 3.6

pymongo 3.4.0

requests 2.14.2

lxml 3.7.2

BeautifulSoup 4.5.3

3. Example Analysis

1. Open the main page http://www.ximalaya.com/dq/all/. Each page lists 12 channels, each channel contains many audios and may have pagination. The plan is to loop through 84 pages, parse each page, and store channel name, image link, and channel link in MongoDB.

2. Enable developer mode, inspect the page, and locate the desired data. The following code (shown in the images) extracts all popular channel information and saves it to MongoDB.

3. To fetch all audio data within a channel, first obtain the channel link, then analyze its structure. Each audio has a unique ID found in a div attribute, which can be extracted using split() and int().

4. Click an audio link, refresh the XHR tab in developer tools, and open the JSON response to view the full audio details.

5. Audio links are paginated; the images illustrate how to handle pagination when extracting all audios from a channel.

6. The complete source code is available at github.com/rieuse/learnPython.

7. Converting the crawler to an asynchronous version can increase the speed by about 100 records per minute; the async code is also hosted on GitHub.

5. Summary

The crawl collected roughly 700,000 audio records, which can later be used for analyses such as play‑count rankings, time‑segment statistics, and channel audio counts. Future work includes applying scientific computing and visualization tools for data cleaning and analysis.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

data collectionPythonMongoDBWeb ScrapingXimalaya
MaGe Linux Operations
Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.