How to Scrape Recipes from XiaChuFang with Python: A Step‑by‑Step Guide
This tutorial walks you through building a Python web scraper that extracts recipe names, ingredients, and download links from the XiaChuFang cooking website, handling anti‑scraping measures with custom headers and fake user agents, and saves the collected data into a Word document for future use.
Introduction
This article explains how to use Python to crawl the XiaChuFang cooking website, extract recipe information, and store it in a Word document.
Project Goal
Collect recipe names, ingredients, and download links from multiple pages and save them into a .doc file.
Preparation
Software: PyCharm
Required libraries: requests , lxml , fake_useragent , time
Handling Anti‑Scraping Measures
Two main issues are addressed: the site returns no data without proper HTTP headers, and repeated requests from the same IP can be blocked. Solutions include setting realistic request headers and using fake_useragent to generate random User‑Agent strings.
Implementation
import requests
from lxml import etree
from fake_useragent import UserAgent
import time
class kitchen(object):
def __init__(self):
self.url = "https://www.xiachufang.com/explore/?page={}"
self.u = 0
self.headers = {}
self.ua = UserAgent()
def set_headers(self):
self.headers = {"User-Agent": self.ua.random}
def get_page(self, url):
res = requests.get(url=url, headers=self.headers)
html = res.content.decode("utf-8")
return html
def parse_page(self, html):
parse_html = etree.HTML(html)
image_src_list = parse_html.xpath('//li/div/a/@href')
return image_src_list
def run(self, start_page, end_page):
for page in range(start_page, end_page + 1):
self.set_headers()
url = self.url.format(page)
html = self.get_page(url)
src_list = self.parse_page(html)
for i in src_list:
detail_url = "https://www.xiachufang.com/" + i
detail_html = self.get_page(detail_url)
detail_tree = etree.HTML(detail_html)
num = detail_tree.xpath('.//h2[@id="steps"]/text()')[0].strip()
name = detail_tree.xpath('.//li[@class="container"]/p/text()')
ingredients = detail_tree.xpath('.//td//a/text()')
self.u += 1
food_info = f"""第 {self.u} 种
菜 名 : {name}
原 料 : {ingredients}
下 载 链 接 : {detail_url}
================================================================="""
with open('菜谱.doc', 'a', encoding='utf-8') as f:
f.write(food_info)
time.sleep(1.4)
if __name__ == '__main__':
spider = kitchen()
spider.run(start_page=1, end_page=5)Optimization
Added a short delay ( time.sleep(1.4)) between requests and used a counter variable self.u to track the number of recipes processed.
Result Display
Running the script shows progress in the console, and the extracted recipes are saved in 菜谱.doc. Screenshots of the console output and the generated document are included.
Conclusion
The guide demonstrates a simple yet effective Python web‑scraping workflow for gathering cooking recipes, handling anti‑scraping defenses, and exporting the data for personal use.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
