Backend Development 8 min read

How to Scrape and Analyze Quanmin K‑Song User Data with Python

This tutorial walks through using Python and BeautifulSoup to crawl user profiles, fan lists, and song information from the Quanmin K‑Song app, clean and store the data in MongoDB, handle pagination, and prepare the dataset for further analysis.

MaGe Linux Operations

Sep 19, 2017

How to Scrape and Analyze Quanmin K‑Song User Data with Python

Python crawler to get user data

By visiting a user's personal center on the Quanmin K‑Song app, the required fields (age, address, fan count, follow count, gender) can be identified. Gender is encoded in the class name of an icon ("icon_boy" or "icon_girl"). These fields are extracted with BeautifulSoup and saved to a MongoDB database for later analysis.

Getting multiple users' data

To collect more users, the script starts from a seed user (User A), retrieves the IDs of their fans, then visits each fan's profile to obtain further fan lists, forming a recursive crawl. Because the fan list loads asynchronously, packet capture is used to intercept the JSON responses that contain batches of 20 fans. The pagination key last_tm is extracted from each response and used for the next request.

The number of pages required is calculated by dividing the total fan count by 20 and taking the integer part. The final script collects 8,671 user records, which is sufficient for basic analysis.

Python crawler to get song data

Using the previously gathered user IDs, the script fetches each user's uploaded songs. Important fields include the timestamp of activity and the mobile device model, which are extracted from the song page. Song lists also load asynchronously, so the same packet‑capture technique is applied, parsing JSON payloads for each page (8 songs per page).

Pagination is handled by incrementing the start parameter; the required number of pages equals the total number of songs divided by 8. A dedicated function extracts the activity time from each song entry.

All retrieved data are stored back into MongoDB. The final dataset contains 840,000 records, with a sample format shown below.

Summary

The main difficulty encountered was locating the last_tm value needed for fan‑list pagination; after a day’s break the missing value was discovered in the previous page's response. The experience highlighted the importance of cleaning and normalizing data at the time of storage, such as separating combined fields, to simplify downstream analysis.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

data collection Python API MongoDB Web Scraping

Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.