How to Crawl Bilibili Video Danmaku Data Using Python
This tutorial explains how to locate Bilibili video danmaku (bullet‑comment) APIs, extract the CID, and use Python libraries such as requests, BeautifulSoup, and pandas to download, clean, and save the comment data to CSV files, with an optional API‑based shortcut.
The article introduces two methods for obtaining Bilibili danmaku data: directly accessing the XML or .so API endpoints using the video’s CID, and using a third‑party Python API library. It walks through finding the CID via the browser’s Network panel, constructing the request URL, and retrieving the XML file.
After acquiring the XML, the guide shows a complete Python script that fetches the file, parses all d tags with BeautifulSoup, removes extra whitespace using regular expressions, and writes the cleaned comments to a CSV file with pandas. The script also prints the number of comments retrieved.
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
# 弹幕保存文件
file_name = '刺客伍六七第一集.csv'
# 获取页面
cid = 47506569
url = "https://comment.bilibili.com/" + str(cid) + ".xml"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'
}
request = requests.get(url=url, headers=headers)
request.encoding = 'utf-8'
# 提取弹幕
soup = BeautifulSoup(request.text, 'lxml')
results = soup.find_all('d')
# 数据处理
data = [data.text for data in results]
# 正则去掉多余的空格和换行
for i in data:
i = re.sub('\s+', '', i)
print("弹幕数量为:{}".format(len(data)))
# 输出到文件
df = pd.DataFrame(data)
df.to_csv(file_name, index=False, header=None, encoding="utf_8_sig")
print("写入文件成功")For a simpler approach, the article suggests installing the bilibili_api package and using its VideoInfo class to fetch danmaku directly by providing the BV ID. The same cleaning and CSV export steps are demonstrated.
pip install bilibili_api
from bilibili_api import video
import re
import pandas as pd
BVid = "BV1oW41157Na"
file_name = '刺客伍六七第一集.csv'
my_video = video.VideoInfo(bvid=BVid)
danmu = my_video.get_danmaku()
data = [d.text for d in danmu]
for i in data:
i = re.sub('\s+', '', i)
print("弹幕数量为:{}".format(len(data)))
df = pd.DataFrame(data)
df.to_csv(file_name, index=False, header=None, encoding="utf_8_sig")
print("写入文件成功")The guide also notes Bilibili’s danmaku pool limits based on video length and explains how older comments are discarded when the limit is exceeded, ensuring the latest comments are always displayed.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.