Big Data 24 min read

2018 Chinese Variety Show Data Analysis: Web Scraping, Rankings, and Reviews

This article demonstrates how to scrape the full 2018 Chinese variety‑show list from Douban using Python Selenium and BeautifulSoup, compile detailed metadata and actor information into Excel, and then analyze popularity rankings, rating distributions, frequent celebrity appearances, and common negative feedback.

Tencent Cloud Developer
Tencent Cloud Developer
Tencent Cloud Developer
2018 Chinese Variety Show Data Analysis: Web Scraping, Rankings, and Reviews

This article presents a comprehensive data analysis of Chinese variety shows from 2018, using web scraping techniques to collect data from Douban. The author demonstrates how to scrape the complete list of 2018 Chinese variety programs and their detailed information using Python with Selenium and BeautifulSoup.

Data Collection Methodology:

The first step involves obtaining the complete list of 2018 Chinese variety shows from Douban:

import json import time import pandas as pd import os from bs4 import BeautifulSoup from pyecharts import Bar,Line,Overlap from selenium import webdriver os.chdir('D:/爬虫/综艺') ## 爬取2018国产列表,并输出成为excel表格 driver = webdriver.Chrome() driver.maximize_window() driver.close() driver.switch_to_window(driver.window_handles[0]) url = 'https://movie.douban.com/tag/#/?sort=U⦥=2,10&tags=2018,%E4%B8%AD%E5%9B%BD%E5%A4%A7%E9%99%86,%E7%BB%BC%E8%89%BA' js='window.open("'+url+'")' driver.execute_script(js) driver.close() driver.switch_to_window(driver.window_handles[0]) while True: try: js="var q=document.documentElement.scrollTop=100000000" driver.execute_script(js) driver.find_element_by_class_name('more').click() time.sleep(2) except: break name = [k.text for k in driver.find_elements_by_class_name('title')] score = [k.text for k in driver.find_elements_by_class_name('rate')] url = [k.get_attribute('href') for k in driver.find_elements_by_class_name('item')] pd.DataFrame({'name':name,'score':score,'url':url}).to_excel('综艺名称.xlsx')

The second code block shows how to collect detailed information for each show:

drama_list = pd.read_excel('综艺名称.xlsx') driver = webdriver.Chrome() driver.maximize_window() driver.close() driver.switch_to_window(driver.window_handles[0]) drama_info = pd.DataFrame(columns=['id','name','image','score','count','year', 'content','publish']) actor_info = pd.DataFrame(columns=['name','url','drama_id','score','drama','rank','count']) err = [] for i in range(drama_list.shape[0]): try: url = drama_list['url'][i] js='window.open("'+url+'")' driver.execute_script(js) driver.close() driver.switch_to_window(driver.window_handles[0]) bsObj=BeautifulSoup(driver.page_source,"html.parser") time.sleep(2) data = json.loads(bsObj.find('script',attrs={'type':'application/ld+json'}).contents[0].replace('\n','').replace(' ','')) actor_name = [k['name'] for k in data['actor']] actor_url = [k['url'] for k in data['actor']] drama_score = data['aggregateRating']['ratingValue'] drama_count = data['aggregateRating']['ratingCount'] drama_name = data['name'] drama_genre = data['genre'] drama_image = data['image'] drama_publish = data['datePublished'] drama_year = bsObj.find('span',attrs={"class":"year"}).text[1:5] drama_content = bsObj.find('span',attrs={"property":"v:summary"}).text.replace('\n','') drama_short =[k.text for k in bsObj.find_all('span',attrs={"class":"short"})] drama_info = drama_info.append({'id':drama_list['url'][i],'name':drama_name,'image':drama_image, 'score':drama_score,'count':drama_count, 'year':drama_year,'content':drama_content, 'short':drama_short,'publish':drama_publish}, ignore_index=True) this_actors=pd.DataFrame({'name':actor_name,'url':actor_url,'drama_id':drama_list['url'][i],'score':drama_score, 'drama':drama_name,'rank':list(range(len(actor_name))),'count':drama_count}) actor_info = pd.concat([actor_info,this_actors]) print(str(i)) except: print(drama_list['name'][i]) err.append(drama_list['url'][i]) continue

Key Findings:

The analysis covers multiple dimensions including popularity rankings (based on comment counts), ratings analysis, and most frequent celebrity appearances. Top shows by popularity include "Idol Producer" (偶像练习生), "Street Dance of China" (这!就是街舞), and "Keep Running" (奔跑吧). Cultural programs like "National Treasure" (国家宝藏) and "Reader" (朗读者) received the highest ratings.

The article also analyzes negative reviews and identifies common complaints about various shows, providing a balanced perspective on 2018 Chinese variety programming.

Pythondata analysisdata visualizationentertainment-analyticsWeb Scrapingbig-datachinese-tvDoubanSeleniumvariety-shows
Tencent Cloud Developer
Written by

Tencent Cloud Developer

Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.