Python Web Scraper for VIP Anime Collection
This article demonstrates how to build a Python web scraper using requests, lxml, regular expressions, and tqdm to locate, extract, and download video files from a VIP anime website, covering header configuration, XPath parsing, URL reconstruction, and file saving.
The tutorial walks through creating a Python 3.7 web scraper on Windows 10 with PyCharm, employing the requests , lxml , re , and tqdm libraries to fetch anime episode lists and download video files.
First, custom HTTP headers are defined to mimic a browser, and the target page is requested. XPath expressions extract chapter names and relative URLs:
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36',
'Referer': 'http://www.imomoe.la/search.asp'
}
url = 'http://www.imomoe.la/view/8024.html'
response = requests.get(url, headers=headers)
html_data = etree.HTML(response.content.decode('gbk'))
chapter_list = html_data.xpath('//div[@class="movurl"]/ul/li/a/text()')
chapter_url_list = html_data.xpath('//div[@class="movurl"]/ul/li/a/@href')[0]The relative chapter URL is combined with the base domain to fetch the detail page, from which a JavaScript source URL is extracted:
new_url = 'http://www.imomoe.la' + chapter_url_list
response = requests.get(new_url, headers=headers)
html = etree.HTML(response.content.decode('gbk'))
data_url = 'http://www.imomoe.la' + html.xpath('//div[@class="player"]/script[1]/@src')[0]
res = requests.get(data_url, headers=headers).textRegular expressions locate the actual video URLs (both direct MP4 links and m3u8 playlists):
play_url_list = re.findall('\$(.*?)\$flv', res)
print(play_url_list)Using tqdm for progress, each MP4 URL is downloaded and saved to a folder named after the anime:
for chapter, play_url in tqdm(zip(chapter_list, play_url_list)):
result = requests.get(play_url, headers=headers).content
f = open('终末的女武神/' + chapter + '.mp4', "wb")
f.write(result)If m3u8 streams are found, the script builds the full CDN URL, fetches the playlist, extracts TS segment URLs, and downloads each segment, appending them to the corresponding MP4 file:
m3u8_url_list = re.findall('\$(.*?)\$bdhd', res)
for m3u8_url, chapter in zip(m3u8_url_list, chapter_list):
data = requests.get(m3u8_url, headers=headers)
new_m3u8_url = 'https://cdn.605-zy.com/' + re.findall('/(.*?m3u8)', data.text)[0]
ts_data = requests.get(new_m3u8_url, headers=headers)
ts_url_list = re.findall('/(.*?ts)', ts_data.text)
print('正在下载:', chapter)
for ts_url in tqdm(ts_url_list):
result = requests.get('https://cdn.605-zy.com/' + ts_url).content
f = open('斗破苍穹/' + chapter + '.mp4', "ab")
f.write(result)The article concludes with a concise workflow: locate the anime page, extract detail URLs, fetch static JavaScript, parse video or m3u8 links, and save the media files locally.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.