Big Data 12 min read

Analyzing and Visualizing Maoyan Movie Reviews for “Chinese Doctors” Using Python

This tutorial demonstrates how to crawl approximately 40,000 Maoyan movie reviews for the film “Chinese Doctors,” preprocess the data, and create visualizations such as rating pie charts, city distribution maps, top‑viewer bar charts, and a word cloud using Python libraries like requests, pyecharts, and wordcloud.

Python Programming Learning Circle

Nov 25, 2021

Analyzing and Visualizing Maoyan Movie Reviews for “Chinese Doctors” Using Python

This article explains a complete workflow for collecting, processing, and visualizing user comments of the Chinese movie "Chinese Doctors" from the Maoyan platform. It begins with a web‑scraping step that retrieves comment JSON data, including user ID, nickname, city, content, score, and timestamp.

Data acquisition is performed with a simple function that sends an HTTP GET request with appropriate headers:

# 获取数据，根据url获取
def get_data(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.123 Safari/537.36',
    }
    response = requests.get(url=url, headers=headers)
    html_data = response.text
    return html_data

The JSON response is parsed to extract the cmts list, and each comment is transformed into a dictionary containing the fields mentioned above:

# 处理数据
def parse_data(html):
    data = json.loads(html)['cmts']
    comments = []
    for item in data:
        comment = {
            'id': item['id'],
            'nickName': item['nickName'],
            'cityName': item.get('cityName', ''),
            'content': item['content'].replace('
', ' ', 10),
            'score': item['score'],
            'startTime': item['startTime']
        }
        comments.append(comment)
    return comments

All comments are saved to a plain‑text file 中国医生.txt for later analysis:

# 存储数据，存储到文本文件
def save_to_data():
    start_time = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
    end_time = '2021-07-09 00:00:00'
    while start_time > end_time:
        url = 'https://m.maoyan.com/mmdb/comments/movie/1337700.json?_v_=yes&offset=0&startTime=' + start_time.replace(' ', '%20')
        html = None
        try:
            html = get_data(url)
        except Exception as e:
            time.sleep(0.5)
            html = get_data(url)
        else:
            time.sleep(0.1)
        comments = parse_data(html)
        for item in comments:
            with open('中国医生.txt', 'a', encoding='utf-8') as f:
                f.write(str(item['id']) + ',' + item['nickName'] + ',' + item['cityName'] + ',' + item['content'] + ',' + str(item['score']) + ',' + item['startTime'] + '
')
        start_time = comments[14]['startTime']
        start_time = datetime.strptime(start_time, '%Y-%m-%d %H:%M:%S') - timedelta(seconds=1)
        start_time = datetime.strftime(start_time, '%Y-%m-%d %H:%M:%S')

Once the data file is ready, the script performs several analyses:

Rating distribution: the score field is read, grouped into five star levels, and visualized with a pie chart using pyecharts.

Geographic distribution: the cityName field is counted, the top 25 cities are selected, and a map of China is rendered.

Top‑viewer ranking: the same city counts are displayed as a bar chart.

Word cloud: all comment texts are concatenated, segmented with jieba, filtered by a custom stop‑word list, and rendered with wordcloud on a background image.

Example code for the rating pie chart:

pie = (
    Pie()
    .add("", [list(z) for z in zip(attr, value)])
    .set_global_opts(title_opts=opts.TitleOpts(title="《中国医生》评分比例饼图"))
    .set_series_opts(label_opts=opts.LabelOpts(formatter="{b}:{d}%"))
)
pie.render('评分.html')

Example code for the city distribution map:

def geo_base():
    c = (
        Geo()
        .add_schema(maptype="china")
        .add("", [list(x) for x in data])
        .set_series_opts(label_opts=opts.LabelOpts(is_show=False))
        .set_global_opts(
            visualmap_opts=opts.VisualMapOpts(is_piecewise=True, min_=min2, max_=max2),
            title_opts=opts.TitleOpts(title="《中国医生》观影用户分布", subtitle="数据来源：猫眼 -- Dragon少年", pos_left="300px")
        )
    )
    return c

Word‑cloud generation snippet:

# 设置分词
comment_after_split = jieba.cut(str(comments), cut_all=False)
words = ' '.join(comment_after_split)
# 设置屏蔽词
stopwords = STOPWORDS.copy()
for w in ['电影','我','我们','的','是','了','没有','什么','有点','不是','真的','感觉','觉得','还是','但是']:
    stopwords.add(w)
# 生成词云
wc = WordCloud(width=1024, height=768, background_color='white', mask=bg_image,
               font_path='STKAITI.TTF', stopwords=stopwords, max_font_size=400, random_state=50)
wc.generate_from_text(words)
wc.to_file('词云图.jpg')

Finally, the article shows the resulting visualizations – a pie chart where five‑star reviews account for about 89.5%, a bar chart of the top 25 cities, a geographic heat map of viewer distribution, and a word cloud highlighting frequent terms in the comments. The analysis concludes that the film enjoys a strong positive reception and that the most active viewing cities correspond to regions with higher GDP.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python Data Visualization Web Scraping Movie Reviews

Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.