How to Scrape and Visualize 3,000 Chinese Recipes with Python
This article demonstrates how to use Python to crawl 3,032 Chinese recipe entries from Douguo.com, clean the data with Pandas, and create insightful visualizations—including rating distributions, cuisine comparisons, and ingredient word clouds—using pyecharts, providing complete code snippets and analysis of the results.
Introduction
To explore the rapid changes in Chinese food culture, the author crawled the latest 3,032 recipe entries of various Chinese cuisines from Douguo.com, then cleaned and visualized the data to gain insights into recipe popularity, ratings, and ingredient usage.
Data Acquisition
The crawling process is straightforward; the core code is shown below.
# 主函数
def main(x):
url = 'https://www.douguo.com/caipu/{}/0/{}'.format(caipu, x*20)
print(url)
html = get_page(url)
parse_page(html, caipu)
if __name__ == '__main__':
caipu_list = ['川菜', '湘菜', '粤菜', '东北菜', '鲁菜', '浙菜', '湖北菜', '清真菜']
start = time.time()
for caipu in caipu_list:
for i in range(22):
main(x=i)
time.sleep(random.uniform(1, 2))
print(caipu, "第" + str(i+1) + "页提取完成")
end = time.time()
print('共用时', round((end - start) / 60, 2), '分钟')This script iterates over eight major Chinese cuisines and fetches 22 pages per cuisine.
Data Cleaning
Using Pandas, the raw CSV is loaded, duplicate entries are removed, missing values are dropped, rating strings are cleaned and converted to numeric types, and a new column counting the number of ingredients per recipe is added.
# Example cleaning steps (illustrative)
import pandas as pd
df = pd.read_csv('recipes.csv')
df = df.drop_duplicates()
df = df.dropna()
# Clean rating column
df['评分'] = df['评分'].str.replace('分', '').astype(float)
# Count ingredients
df['用料数'] = df['用料'].apply(lambda x: x.count(',') + 1)Data Visualization
Visualization is performed with the pyecharts library.
Rating Distribution (Rose Chart)
# Rose chart code (simplified)
from pyecharts import options as opts
from pyecharts.charts import Pie
def cut(x):
if x < 4:
return '4分以下'
elif x <= 4.5:
return '4.1-4.5分'
elif x <= 4.9:
return '4.6-4.9分'
else:
return '5分'
df['评分分布'] = df['评分'].map(cut)
df2 = df.groupby('评分分布')['评分'].count().sort_values(ascending=False)
pie = (Pie()
.add('', [list(z) for z in zip(df2.index.tolist(), df2.tolist())], radius=['20%', '80%'], rosetype='area')
.set_global_opts(title_opts=opts.TitleOpts(title='菜谱评分分布'))
.set_series_opts(label_opts=opts.LabelOpts(formatter='{b}:{d}%')))
pie.render_notebook()The chart shows that recipes scoring below 4 points account for less than 2%, while perfect‑score recipes reach 32.6%.
Cuisine Recipe Count (Pie Chart)
# Pie chart for recipe count per cuisine
from pyecharts.charts import Pie
df2 = df.groupby('菜系')['评分'].count().sort_values(ascending=False)
pie = (Pie()
.add('', [list(z) for z in zip(df2.index.tolist(), df2.tolist())])
.set_global_opts(title_opts=opts.TitleOpts(title='各菜系菜谱数量占比', subtitle='数据来源:豆果美食')))
pie.render_notebook()Results indicate that Sichuan and Cantonese cuisines have the most recipes, while Hubei and Halal cuisines are less represented.
Average Rating per Cuisine (Ring Chart)
# Ring chart for average rating
from pyecharts.charts import Pie
df2 = df.groupby('菜系')['评分'].mean().sort_values(ascending=False).round(2)
pie = (Pie()
.add('', [list(z) for z in zip(df2.index.tolist(), df2.tolist())], radius=['40%', '75%'])
.set_global_opts(title_opts=opts.TitleOpts(title='各菜系平均评分')))
pie.render_notebook()All cuisines score above 4.6, indicating uniformly high user satisfaction.
Ingredient Count per Cuisine (Bar Chart)
# Bar chart for average ingredient count
from pyecharts.charts import Bar
df1 = df.groupby('菜系')['用料数'].mean().sort_values(ascending=False).round(0)
bar = (Bar()
.add_xaxis(df1.index.tolist())
.add_yaxis('用料数量', df1.tolist())
.set_global_opts(title_opts=opts.TitleOpts(title='各菜系用料数量', subtitle='数据来源:豆果美食')))
bar.render_notebook()Sichuan and Northeastern cuisines use the most ingredients, reflecting their rich and hearty cooking styles.
Ingredient Word Clouds
Word clouds for each cuisine highlight characteristic ingredients.
Sichuan: peppercorn, doubanjiang, dried chilies.
Cantonese: pepper, pork belly, sugar.
Hunan: chilies, garlic, peppercorn.
Northeastern: potatoes, flour, carrots.
Hubei: glutinous rice, pepper, flour.
Zhejiang: sugar, ice sugar, pepper.
Shandong: flour, carrots, oyster sauce.
Halal: protein, egg white, flour.
Key Findings
Overall recipe ratings are very high, with a large proportion of perfect‑score dishes.
Sichuan and Cantonese dominate in recipe count, reflecting their status among the traditional “Eight Major Cuisines”.
Ingredient analysis reveals distinct culinary traits: Sichuan emphasizes spices, Cantonese prefers lighter seasoning, Halal cuisine adheres to dietary restrictions, etc.
Conclusion
The project demonstrates a complete pipeline—from web crawling to data cleaning, analysis, and visualization—providing a practical example for learners interested in Python data mining and culinary data exploration.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
