Big Data 15 min read

How to Scrape and Visualize 3,000 Chinese Recipes with Python

This article demonstrates how to use Python to crawl 3,032 Chinese recipe entries from Douguo.com, clean the data with Pandas, and create insightful visualizations—including rating distributions, cuisine comparisons, and ingredient word clouds—using pyecharts, providing complete code snippets and analysis of the results.

Python Crawling & Data Mining

Sep 4, 2020

How to Scrape and Visualize 3,000 Chinese Recipes with Python

Introduction

To explore the rapid changes in Chinese food culture, the author crawled the latest 3,032 recipe entries of various Chinese cuisines from Douguo.com, then cleaned and visualized the data to gain insights into recipe popularity, ratings, and ingredient usage.

Data Acquisition

The crawling process is straightforward; the core code is shown below.

# 主函数

def main(x):
    url = 'https://www.douguo.com/caipu/{}/0/{}'.format(caipu, x*20)
    print(url)
    html = get_page(url)
    parse_page(html, caipu)

if __name__ == '__main__':
    caipu_list = ['川菜', '湘菜', '粤菜', '东北菜', '鲁菜', '浙菜', '湖北菜', '清真菜']
    start = time.time()
    for caipu in caipu_list:
        for i in range(22):
            main(x=i)
            time.sleep(random.uniform(1, 2))
            print(caipu, "第" + str(i+1) + "页提取完成")
    end = time.time()
    print('共用时', round((end - start) / 60, 2), '分钟')

This script iterates over eight major Chinese cuisines and fetches 22 pages per cuisine.

Data Cleaning

Using Pandas, the raw CSV is loaded, duplicate entries are removed, missing values are dropped, rating strings are cleaned and converted to numeric types, and a new column counting the number of ingredients per recipe is added.

# Example cleaning steps (illustrative)
import pandas as pd

df = pd.read_csv('recipes.csv')
df = df.drop_duplicates()
df = df.dropna()
# Clean rating column
df['评分'] = df['评分'].str.replace('分', '').astype(float)
# Count ingredients
df['用料数'] = df['用料'].apply(lambda x: x.count(',') + 1)

Data Visualization

Visualization is performed with the pyecharts library.

Rating Distribution (Rose Chart)

# Rose chart code (simplified)
from pyecharts import options as opts
from pyecharts.charts import Pie

def cut(x):
    if x < 4:
        return '4分以下'
    elif x <= 4.5:
        return '4.1-4.5分'
    elif x <= 4.9:
        return '4.6-4.9分'
    else:
        return '5分'

df['评分分布'] = df['评分'].map(cut)
df2 = df.groupby('评分分布')['评分'].count().sort_values(ascending=False)
pie = (Pie()
       .add('', [list(z) for z in zip(df2.index.tolist(), df2.tolist())], radius=['20%', '80%'], rosetype='area')
       .set_global_opts(title_opts=opts.TitleOpts(title='菜谱评分分布'))
       .set_series_opts(label_opts=opts.LabelOpts(formatter='{b}:{d}%')))
pie.render_notebook()

The chart shows that recipes scoring below 4 points account for less than 2%, while perfect‑score recipes reach 32.6%.

Cuisine Recipe Count (Pie Chart)

# Pie chart for recipe count per cuisine
from pyecharts.charts import Pie

df2 = df.groupby('菜系')['评分'].count().sort_values(ascending=False)
pie = (Pie()
       .add('', [list(z) for z in zip(df2.index.tolist(), df2.tolist())])
       .set_global_opts(title_opts=opts.TitleOpts(title='各菜系菜谱数量占比', subtitle='数据来源：豆果美食')))
pie.render_notebook()

Results indicate that Sichuan and Cantonese cuisines have the most recipes, while Hubei and Halal cuisines are less represented.

Average Rating per Cuisine (Ring Chart)

# Ring chart for average rating
from pyecharts.charts import Pie

df2 = df.groupby('菜系')['评分'].mean().sort_values(ascending=False).round(2)
pie = (Pie()
       .add('', [list(z) for z in zip(df2.index.tolist(), df2.tolist())], radius=['40%', '75%'])
       .set_global_opts(title_opts=opts.TitleOpts(title='各菜系平均评分')))
pie.render_notebook()

All cuisines score above 4.6, indicating uniformly high user satisfaction.

Ingredient Count per Cuisine (Bar Chart)

# Bar chart for average ingredient count
from pyecharts.charts import Bar

df1 = df.groupby('菜系')['用料数'].mean().sort_values(ascending=False).round(0)
bar = (Bar()
       .add_xaxis(df1.index.tolist())
       .add_yaxis('用料数量', df1.tolist())
       .set_global_opts(title_opts=opts.TitleOpts(title='各菜系用料数量', subtitle='数据来源：豆果美食')))
bar.render_notebook()

Sichuan and Northeastern cuisines use the most ingredients, reflecting their rich and hearty cooking styles.

Ingredient Word Clouds

Word clouds for each cuisine highlight characteristic ingredients.

Sichuan: peppercorn, doubanjiang, dried chilies.

Cantonese: pepper, pork belly, sugar.

Hunan: chilies, garlic, peppercorn.

Northeastern: potatoes, flour, carrots.

Hubei: glutinous rice, pepper, flour.

Zhejiang: sugar, ice sugar, pepper.

Shandong: flour, carrots, oyster sauce.

Halal: protein, egg white, flour.

Key Findings

Overall recipe ratings are very high, with a large proportion of perfect‑score dishes.

Sichuan and Cantonese dominate in recipe count, reflecting their status among the traditional “Eight Major Cuisines”.

Ingredient analysis reveals distinct culinary traits: Sichuan emphasizes spices, Cantonese prefers lighter seasoning, Halal cuisine adheres to dietary restrictions, etc.

Conclusion

The project demonstrates a complete pipeline—from web crawling to data cleaning, analysis, and visualization—providing a practical example for learners interested in Python data mining and culinary data exploration.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python Web Scraping Pandas Pyecharts Chinese Cuisine

Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.