How to Scrape and Analyze Quanmin K‑Song User Data with Python
This tutorial walks through using Python and BeautifulSoup to crawl user profiles, fan lists, and song information from the Quanmin K‑Song app, clean and store the data in MongoDB, handle pagination, and prepare the dataset for further analysis.
Python crawler to get user data
By visiting a user's personal center on the Quanmin K‑Song app, the required fields (age, address, fan count, follow count, gender) can be identified. Gender is encoded in the class name of an icon ("icon_boy" or "icon_girl"). These fields are extracted with BeautifulSoup and saved to a MongoDB database for later analysis.
Getting multiple users' data
To collect more users, the script starts from a seed user (User A), retrieves the IDs of their fans, then visits each fan's profile to obtain further fan lists, forming a recursive crawl. Because the fan list loads asynchronously, packet capture is used to intercept the JSON responses that contain batches of 20 fans. The pagination key last_tm is extracted from each response and used for the next request.
The number of pages required is calculated by dividing the total fan count by 20 and taking the integer part. The final script collects 8,671 user records, which is sufficient for basic analysis.
Python crawler to get song data
Using the previously gathered user IDs, the script fetches each user's uploaded songs. Important fields include the timestamp of activity and the mobile device model, which are extracted from the song page. Song lists also load asynchronously, so the same packet‑capture technique is applied, parsing JSON payloads for each page (8 songs per page).
Pagination is handled by incrementing the start parameter; the required number of pages equals the total number of songs divided by 8. A dedicated function extracts the activity time from each song entry.
All retrieved data are stored back into MongoDB. The final dataset contains 840,000 records, with a sample format shown below.
Summary
The main difficulty encountered was locating the last_tm value needed for fan‑list pagination; after a day’s break the missing value was discovered in the previous page's response. The experience highlighted the importance of cleaning and normalizing data at the time of storage, such as separating combined fields, to simplify downstream analysis.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
