How to Scrape 700K Ximalaya Audio Records with Python and MongoDB
This article details a step‑by‑step process for crawling all popular Ximalaya channels, extracting each audio's metadata, and storing roughly 700,000 records in MongoDB, while also showing how to speed up the crawl with asynchronous requests.
1. Introduction
This project crawls all channels under Ximalaya's popular section, retrieving every audio's download URL, channel information, description, and other metadata, and saves the data into MongoDB for later analysis. The total dataset is about 700,000 records.
The author undertook this crawl to meet the requirements of an AI‑big‑data company interview, hoping to demonstrate experience with audio data extraction.
2. Runtime Environment
IDE: PyCharm 2017
Python 3.6
pymongo 3.4.0
requests 2.14.2
lxml 3.7.2
BeautifulSoup 4.5.3
3. Example Analysis
1. Open the main page http://www.ximalaya.com/dq/all/. Each page lists 12 channels, each channel contains many audios and may have pagination. The plan is to loop through 84 pages, parse each page, and store channel name, image link, and channel link in MongoDB.
2. Enable developer mode, inspect the page, and locate the desired data. The following code (shown in the images) extracts all popular channel information and saves it to MongoDB.
3. To fetch all audio data within a channel, first obtain the channel link, then analyze its structure. Each audio has a unique ID found in a div attribute, which can be extracted using split() and int().
4. Click an audio link, refresh the XHR tab in developer tools, and open the JSON response to view the full audio details.
5. Audio links are paginated; the images illustrate how to handle pagination when extracting all audios from a channel.
6. The complete source code is available at github.com/rieuse/learnPython.
7. Converting the crawler to an asynchronous version can increase the speed by about 100 records per minute; the async code is also hosted on GitHub.
5. Summary
The crawl collected roughly 700,000 audio records, which can later be used for analyses such as play‑count rankings, time‑segment statistics, and channel audio counts. Future work includes applying scientific computing and visualization tools for data cleaning and analysis.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
