How to Scrape Douyin Comments and Reveal Hidden Topics with Python LDA
This article demonstrates how to collect Douyin video comments using Python, perform exploratory data analysis, and apply LDA topic modeling to uncover the main themes and user interests hidden in thousands of short comments.
The author crawled comments from a popular Douyin video (released on 11‑17) that received millions of likes and a rapid increase in followers. Using the web version of Douyin, the comment API was identified by filtering network requests for the keyword comment, and a Python script was written to fetch the data responsibly.
1. Data Collection
Comments were retrieved via simulated HTTP requests, with attention to request intervals to avoid overloading the service. Two practical tips were noted: retry when the API returns empty data (often after passing a human verification step) and handle duplicate data across pages by paging.
2. Exploratory Data Analysis (EDA)
Approximately 12,000 comments were available; about 10,000 were sampled for analysis. The text column contains the comment content. Basic statistics and visualizations were generated using the ProfileReport tool.
# eda
profile = ProfileReport(df, title='Zhang Douyin Comment Data', explorative=True)
profileKey observations:
Comment volume peaked on the release days (17‑18) but remained high even weeks later.
Most comments are short, typically under 20 characters.
99.8% of commenters are non‑verified users.
3. LDA Topic Modeling
To move beyond coarse statistics, the comments were clustered using Latent Dirichlet Allocation (LDA). After preprocessing (tokenization with jieba, removal of stopwords, emojis, and punctuation), the cleaned text was stored in the text_wd column.
# tokenization and stopword removal
emoji = {...} # set of emoji strings
stopwords = [line.strip() for line in open('stop_words.txt', encoding='UTF-8').readlines()]
def fen_ci(x):
res = []
for token in jieba.cut(x):
if token in stopwords or token in emoji or token in ['[', ']']:
continue
res.append(token)
return ' '.join(res)
df['text_wd'] = df['text'].apply(fen_ci)The LDA model was built with eight topics, which provided the most interpretable results.
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import numpy as np
def run_lda(corpus, k):
cntvec = CountVectorizer(min_df=2, token_pattern='\w+')
cnttf = cntvec.fit_transform(corpus)
lda = LatentDirichletAllocation(n_components=k)
docres = lda.fit_transform(cnttf)
return cntvec, cnttf, docres, lda
cntvec, cnttf, docres, lda = run_lda(df['text_wd'].values, 8)Top‑20 words for each topic were extracted and manually interpreted, yielding themes such as "watching the video", "key location", "rural life", "feeding dogs", "filming techniques", "locking doors", "adding salt to eggs", and "socks under pillows". Topic 3 ("feeding dogs") had the highest proportion, reflecting many users' surprise that the video involved feeding dogs rather than personal consumption.
Topic distribution was visualized, showing a relatively balanced share among the remaining topics.
4. Results and Insights
The analysis revealed that while the video’s rural setting attracted attention, the most discussed aspect was the unconventional scenes, especially those involving dogs. A hierarchical tree diagram (not shown) illustrated representative comments for each topic.
Requests should include reasonable delays to avoid impacting the service.
The core code snippets are included above; the full script is being organized for future reference.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
