Using Python Regex to Crawl Taobao Product Information
This article demonstrates how to use Python's requests and regular‑expression libraries to crawl Taobao product listings, extract product titles and prices, handle pagination, and store the results, providing complete sample code for each step.
To scrape product information from Taobao with Python, the first step is to identify the search URL pattern. The keyword parameter is q=, so the base URL becomes https://s.taobao.com/search?q=python. Pagination is controlled by the s parameter, which increases by 44 for each subsequent page because each page displays 44 items.
Inspecting the page source reveals that product titles are stored under the JSON key raw_title and prices under view_price. These keys can be extracted directly with regular expressions.
Below is a simple script that fetches a page, extracts the title and price lists, and prints them:
# coding:utf-8
import requests
import re
goods = '水杯'
url = 'https://s.taobao.com/search?q=' + goods
r = requests.get(url=url, timeout=10)
html = r.text
tlist = re.findall(r'"raw_title":".*?"', html) # extract product titles
plist = re.findall(r'"view_price":"[\d\.]*"', html) # extract product prices
print(tlist)
print(plist)
print(type(plist)) # both are stored as listsTo combine each title with its corresponding price, a loop can be used to build a list of [title, price] pairs:
goodlist = []
for i in range(len(tlist)):
title = eval(tlist[i].split(':')[1]) # remove surrounding quotes
price = eval(plist[i].split(':')[1])
goodlist.append([title, price])
print(goodlist)For a more structured approach, the article defines reusable functions:
def get_html(url):
"""Fetch the HTML source of a URL"""
try:
r = requests.get(url=url, timeout=10)
r.encoding = r.apparent_encoding
return r.text
except:
print("Failed to retrieve")
def get_data(html, goodlist):
"""Parse product titles and prices using regular expressions"""
tlist = re.findall(r'"raw_title":".*?"', html)
plist = re.findall(r'"view_price":"[\d\.]*"', html)
for i in range(len(tlist)):
title = eval(tlist[i].split(':')[1])
price = eval(plist[i].split(':')[1])
goodlist.append([title, price])
def write_data(lst, num):
for i in range(num):
u = lst[i]
with open('E:/Crawler/case/taob.txt', 'a') as data:
print(u, file=data)
def main():
goods = '水杯'
depth = 3 # number of pages to crawl
start_url = 'https://s.taobao.com/search?q=' + goods
infoList = []
for i in range(depth):
try:
url = start_url + '&s=' + str(44 * i)
html = get_html(url)
get_data(html, infoList)
except:
continue
write_data(infoList, len(infoList))
if __name__ == '__main__':
main()This complete script demonstrates how to control crawl depth, fetch each page, extract the required fields with regex, accumulate the results, and finally write them to a text file.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
