How to Crawl Taobao Product Data with Python: From Login to Excel Export
This tutorial walks you through logging into Taobao with Python requests, handling anti‑scraping measures, extracting product information via the PC search API, parsing JSON data, and saving the results to Excel, while also covering common pitfalls like sliders and proxy management.
Apology for the long gap and a warning that the tutorial is for learning only.
Warning: This tutorial is for learning and exchange only; do not use it for commercial profit.
1. Review Taobao Login
We previously introduced how to log in to Taobao using the Python requests library. If the login fails with “apply st code failed”, replace all request parameters in the _verify_password method. In version 2.0 we added cookie serialization to avoid anti‑scraping mechanisms when the same IP logs in frequently.
The login success rate is high; if it fails, adjust the parameters as described.
2. Crawling Taobao Product Information
This article explains how to fetch product data; analysis will be in the next article. The goal is to make the process understandable for beginners.
We call the Taobao PC search interface, extract the returned data, and save it to an Excel file.
3. Crawling a Single Page
1. Find the data‑loading URL
Open Taobao, log in, open Chrome DevTools, go to the Network tab, enable “Preserve log”, and search for a product. The first page request returns the product information embedded in the HTML, not as pure JSON.
2. Is there a pure JSON endpoint?
By checking the second page request we discover a pure JSON response when the URL contains ajax=true. Using this parameter we can request JSON directly, but the first page still returns an HTML page with a link that triggers a slider verification.
3. Use the HTML page request
When ajax=true is not present we receive the full page and must extract the data ourselves.
4. Extracting Product Attributes
After obtaining the page, the JavaScript variable g_page_config contains the product information in JSON format. A regular expression can capture it:
goods_match = re.search(r'g_page_config = (.*?)}};', response.text)Parsing this JSON reveals the structure, allowing us to extract price and other attributes.
5. Saving to Excel
1. Install required libraries
pip install xlrd
pip install openpyxl
pip install numpy
pip install pandas2. Write to Excel
Pandas is used because it is convenient for data analysis. Note that pandas does not support an append mode directly; you must read the existing file, append rows, and write back.
6. Batch Crawling
After the single‑page workflow (crawl, extract, save) is completed, we can loop over many pages. A timeout of 3–10 seconds is recommended to avoid frequent captcha challenges. Using this approach we have collected over two thousand records.
7. Issues Encountered
Login problems
If “apply st code failed”, replace all request parameters in _verify_password. With correct parameters the login usually succeeds.
Proxy pool
High‑quality proxies are required; free proxies often fail. The site ip.zdaye.com provides hourly updated proxies that work for Taobao.
Retry mechanism
We added a retry mechanism using the retry library to handle occasional request failures.
pip install retrySlider verification
Even with the above measures, a slider captcha may appear after 20–40 requests. Currently it cannot be solved with requests; future work may involve Selenium.
Incomplete crawler
The current script is a half‑finished prototype. Future improvements include automatic proxy pool maintenance, multithreaded segmented crawling, and slider solving.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
