Backend Development 9 min read

How to Crawl Taobao Product Data with Python: From Login to Excel Export

This tutorial walks you through logging into Taobao with Python requests, handling anti‑scraping measures, extracting product information via the PC search API, parsing JSON data, and saving the results to Excel, while also covering common pitfalls like sliders and proxy management.

MaGe Linux Operations

Sep 8, 2019

How to Crawl Taobao Product Data with Python: From Login to Excel Export

Apology for the long gap and a warning that the tutorial is for learning only.

Warning: This tutorial is for learning and exchange only; do not use it for commercial profit.

1. Review Taobao Login

We previously introduced how to log in to Taobao using the Python requests library. If the login fails with “apply st code failed”, replace all request parameters in the _verify_password method. In version 2.0 we added cookie serialization to avoid anti‑scraping mechanisms when the same IP logs in frequently.

The login success rate is high; if it fails, adjust the parameters as described.

2. Crawling Taobao Product Information

This article explains how to fetch product data; analysis will be in the next article. The goal is to make the process understandable for beginners.

We call the Taobao PC search interface, extract the returned data, and save it to an Excel file.

3. Crawling a Single Page

1. Find the data‑loading URL

Open Taobao, log in, open Chrome DevTools, go to the Network tab, enable “Preserve log”, and search for a product. The first page request returns the product information embedded in the HTML, not as pure JSON.

2. Is there a pure JSON endpoint?

By checking the second page request we discover a pure JSON response when the URL contains ajax=true. Using this parameter we can request JSON directly, but the first page still returns an HTML page with a link that triggers a slider verification.

3. Use the HTML page request

When ajax=true is not present we receive the full page and must extract the data ourselves.

4. Extracting Product Attributes

After obtaining the page, the JavaScript variable g_page_config contains the product information in JSON format. A regular expression can capture it:

goods_match = re.search(r'g_page_config = (.*?)}};', response.text)

Parsing this JSON reveals the structure, allowing us to extract price and other attributes.

5. Saving to Excel

1. Install required libraries

pip install xlrd
pip install openpyxl
pip install numpy
pip install pandas

2. Write to Excel

Pandas is used because it is convenient for data analysis. Note that pandas does not support an append mode directly; you must read the existing file, append rows, and write back.

6. Batch Crawling

After the single‑page workflow (crawl, extract, save) is completed, we can loop over many pages. A timeout of 3–10 seconds is recommended to avoid frequent captcha challenges. Using this approach we have collected over two thousand records.

7. Issues Encountered

Login problems

If “apply st code failed”, replace all request parameters in _verify_password. With correct parameters the login usually succeeds.

Proxy pool

High‑quality proxies are required; free proxies often fail. The site ip.zdaye.com provides hourly updated proxies that work for Taobao.

Retry mechanism

We added a retry mechanism using the retry library to handle occasional request failures.

pip install retry

Slider verification

Even with the above measures, a slider captcha may appear after 20–40 requests. Currently it cannot be solved with requests; future work may involve Selenium.

Incomplete crawler

The current script is a half‑finished prototype. Future improvements include automatic proxy pool maintenance, multithreaded segmented crawling, and slider solving.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python Taobao data extraction Web Scraping

Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.