Backend Development 5 min read

Using Python Regex to Crawl Taobao Product Information

This article demonstrates how to use Python's requests and regular‑expression libraries to crawl Taobao product listings, extract product titles and prices, handle pagination, and store the results, providing complete sample code for each step.

Python Programming Learning Circle

Mar 2, 2020

Using Python Regex to Crawl Taobao Product Information

To scrape product information from Taobao with Python, the first step is to identify the search URL pattern. The keyword parameter is q=, so the base URL becomes https://s.taobao.com/search?q=python. Pagination is controlled by the s parameter, which increases by 44 for each subsequent page because each page displays 44 items.

Inspecting the page source reveals that product titles are stored under the JSON key raw_title and prices under view_price. These keys can be extracted directly with regular expressions.

Below is a simple script that fetches a page, extracts the title and price lists, and prints them:

# coding:utf-8
import requests
import re

goods = '水杯'
url = 'https://s.taobao.com/search?q=' + goods
r = requests.get(url=url, timeout=10)
html = r.text

tlist = re.findall(r'"raw_title":".*?"', html)  # extract product titles
plist = re.findall(r'"view_price":"[\d\.]*"', html)  # extract product prices

print(tlist)
print(plist)
print(type(plist))  # both are stored as lists

To combine each title with its corresponding price, a loop can be used to build a list of [title, price] pairs:

goodlist = []
for i in range(len(tlist)):
    title = eval(tlist[i].split(':')[1])  # remove surrounding quotes
    price = eval(plist[i].split(':')[1])
    goodlist.append([title, price])
print(goodlist)

For a more structured approach, the article defines reusable functions:

def get_html(url):
    """Fetch the HTML source of a URL"""
    try:
        r = requests.get(url=url, timeout=10)
        r.encoding = r.apparent_encoding
        return r.text
    except:
        print("Failed to retrieve")

def get_data(html, goodlist):
    """Parse product titles and prices using regular expressions"""
    tlist = re.findall(r'"raw_title":".*?"', html)
    plist = re.findall(r'"view_price":"[\d\.]*"', html)
    for i in range(len(tlist)):
        title = eval(tlist[i].split(':')[1])
        price = eval(plist[i].split(':')[1])
        goodlist.append([title, price])

def write_data(lst, num):
    for i in range(num):
        u = lst[i]
        with open('E:/Crawler/case/taob.txt', 'a') as data:
            print(u, file=data)

def main():
    goods = '水杯'
    depth = 3  # number of pages to crawl
    start_url = 'https://s.taobao.com/search?q=' + goods
    infoList = []
    for i in range(depth):
        try:
            url = start_url + '&s=' + str(44 * i)
            html = get_html(url)
            get_data(html, infoList)
        except:
            continue
    write_data(infoList, len(infoList))

if __name__ == '__main__':
    main()

This complete script demonstrates how to control crawl depth, fetch each page, extract the required fields with regex, accumulate the results, and finally write them to a text file.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

backend Python Taobao regex web-scraping data-extraction

Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.