Multithreaded Python Crawl of Xiaomi App Store Games
This tutorial demonstrates how to use Python's requests, threading, and queue modules to build a multithreaded crawler that extracts game names, download links, and execution time from the Xiaomi App Store, complete with code examples and performance tips.
Project Background
Xiaomi App Store offers a wide range of Android apps and games, but manually searching for each game is time‑consuming and the site can be slow.
Project Goal
Automatically retrieve game information—specifically the category "Chat & Social", app name, and download link—and display them in the console for user download.
Libraries and Tools
Requests
Threading
Queue
JSON
Time
PyCharm (IDE)
Project Analysis
The site loads data dynamically, so we capture network packets using Chrome DevTools to find the JSON API endpoint.
Example API URL:
http://app.mi.com/categotyAllListApi?page={}&categoryId=2&pageSize=30Key query parameters are page, categoryId, and pageSize. By iterating the page value we can fetch multiple pages of JSON data.
Implementation
1. Define the Spider Class
import requests
from threading import Thread
from queue import Queue
import json
import time
class XiaomiSpider(object):
def __init__(self):
self.headers = {'User-Agent': 'Mozilla/5.0'}
self.url = 'http://app.mi.com/categotyAllListApi?page={}&categoryId=15&pageSize=30'
def main(self):
pass
if __name__ == '__main__':
spider = XiaomiSpider()
spider.main()2. URL Queue
self.url_queue = Queue()3. Enqueue URLs
def url_in(self):
# Generate URLs for pages 0‑66 and put them into the queue
for i in range(67):
self.url = self.url.format(i)
self.url_queue.put(self.url)4. Thread Worker to Fetch Pages
def get_page(self):
while True:
if not self.url_queue.empty():
url = self.url_queue.get()
html = requests.get(url, headers=self.headers).text
self.parse_page(html)
else:
break5. Parse JSON and Extract Data
def parse_page(self, html):
app_json = json.loads(html)
for app in app_json['data']:
name = app['displayName']
link = 'http://app.mi.com/details?id={}'.format(app['packageName'])
print({'名称': name, '链接': link})6. Launch Multiple Threads
def main(self):
self.url_in()
t_list = []
for i in range(10):
t = Thread(target=self.get_page)
t.start()
t_list.append(t)
for t in t_list:
t.join()7. Measure Execution Time
start = time.time()
spider = XiaomiSpider()
spider.main()
end = time.time()
print('执行时间:%.2f' % (end - start))Result Display
Running the script prints each game's name, download URL, and the total execution time in the console. Sample screenshots show the output and the clickable download links.
Conclusion
Do not overload the server with excessive requests; a moderate crawl is sufficient.
Python multithreading can significantly speed up I/O‑bound tasks like web crawling.
While single‑threaded programs can be pre‑empted, multithreading offers more flexibility and can release resources such as memory during idle periods.
Feel free to adapt this approach to other categories; hands‑on practice deepens understanding.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
