Big Data 24 min read

2018 Chinese Variety Show Data Analysis: Web Scraping, Rankings, and Reviews

This article demonstrates how to scrape the full 2018 Chinese variety‑show list from Douban using Python Selenium and BeautifulSoup, compile detailed metadata and actor information into Excel, and then analyze popularity rankings, rating distributions, frequent celebrity appearances, and common negative feedback.

Tencent Cloud Developer

Jan 10, 2019

2018 Chinese Variety Show Data Analysis: Web Scraping, Rankings, and Reviews

This article presents a comprehensive data analysis of Chinese variety shows from 2018, using web scraping techniques to collect data from Douban. The author demonstrates how to scrape the complete list of 2018 Chinese variety programs and their detailed information using Python with Selenium and BeautifulSoup.

Data Collection Methodology:

The first step involves obtaining the complete list of 2018 Chinese variety shows from Douban:

import json
import time
import pandas as pd
import os
from bs4 import BeautifulSoup  
from pyecharts import Bar,Line,Overlap
from selenium import webdriver 
os.chdir('D:/爬虫/综艺')

## 爬取2018国产列表，并输出成为excel表格
driver = webdriver.Chrome()
driver.maximize_window()    
driver.close() 
driver.switch_to_window(driver.window_handles[0])  
url = 'https://movie.douban.com/tag/#/?sort=U&range=2,10&tags=2018,%E4%B8%AD%E5%9B%BD%E5%A4%A7%E9%99%86,%E7%BB%BC%E8%89%BA'
js='window.open("'+url+'")'
driver.execute_script(js)
driver.close() 
driver.switch_to_window(driver.window_handles[0])
while True:
  try: 
    js="var q=document.documentElement.scrollTop=100000000"  
    driver.execute_script(js)
    driver.find_element_by_class_name('more').click()
    time.sleep(2)
  except:
    break 

name = [k.text for k in driver.find_elements_by_class_name('title')]   
score = [k.text for k in driver.find_elements_by_class_name('rate')]   
url = [k.get_attribute('href') for k in driver.find_elements_by_class_name('item')]  
pd.DataFrame({'name':name,'score':score,'url':url}).to_excel('综艺名称.xlsx')

The second code block shows how to collect detailed information for each show:

drama_list = pd.read_excel('综艺名称.xlsx')
driver = webdriver.Chrome()
driver.maximize_window()    
driver.close() 
driver.switch_to_window(driver.window_handles[0])   
drama_info = pd.DataFrame(columns=['id','name','image','score','count','year',
                                     'content','publish'])
actor_info = pd.DataFrame(columns=['name','url','drama_id','score','drama','rank','count'])

err = []    
   
for i in range(drama_list.shape[0]):
   try:
       url = drama_list['url'][i]
       js='window.open("'+url+'")'
       driver.execute_script(js)
       driver.close() 
       driver.switch_to_window(driver.window_handles[0])
       bsObj=BeautifulSoup(driver.page_source,"html.parser")
       time.sleep(2)
       data =  json.loads(bsObj.find('script',attrs={'type':'application/ld+json'}).contents[0].replace('
','').replace(' ',''))
       actor_name = [k['name'] for k in data['actor']]
       actor_url = [k['url'] for k in data['actor']]       
       drama_score = data['aggregateRating']['ratingValue']
       drama_count = data['aggregateRating']['ratingCount']
       drama_name = data['name']
       drama_genre = data['genre']
       drama_image = data['image']
       drama_publish = data['datePublished']
       drama_year = bsObj.find('span',attrs={"class":"year"}).text[1:5]
       drama_content =  bsObj.find('span',attrs={"property":"v:summary"}).text.replace('
','')
       drama_short =[k.text for k in  bsObj.find_all('span',attrs={"class":"short"})]
       drama_info = drama_info.append({'id':drama_list['url'][i],'name':drama_name,'image':drama_image,
                             'score':drama_score,'count':drama_count,
                             'year':drama_year,'content':drama_content,
                             'short':drama_short,'publish':drama_publish},
                             ignore_index=True)       
       this_actors=pd.DataFrame({'name':actor_name,'url':actor_url,'drama_id':drama_list['url'][i],'score':drama_score,
                             'drama':drama_name,'rank':list(range(len(actor_name))),'count':drama_count})
       actor_info = pd.concat([actor_info,this_actors])
       print(str(i))
   except:
       print(drama_list['name'][i])
       err.append(drama_list['url'][i])
       continue

Key Findings:

The analysis covers multiple dimensions including popularity rankings (based on comment counts), ratings analysis, and most frequent celebrity appearances. Top shows by popularity include "Idol Producer" (偶像练习生), "Street Dance of China" (这！就是街舞), and "Keep Running" (奔跑吧). Cultural programs like "National Treasure" (国家宝藏) and "Reader" (朗读者) received the highest ratings.

The article also analyzes negative reviews and identifies common complaints about various shows, providing a balanced perspective on 2018 Chinese variety programming.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

entertainment-analytics big-data Chinese TV douban Selenium variety-shows data-analysis web-scraping data-visualization

Written by

Tencent Cloud Developer

Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.