Backend Development 6 min read

Python Web Scraper for VIP Anime Collection

This article demonstrates how to build a Python web scraper using requests, lxml, regular expressions, and tqdm to locate, extract, and download video files from a VIP anime website, covering header configuration, XPath parsing, URL reconstruction, and file saving.

Python Programming Learning Circle

Sep 30, 2021

Python Web Scraper for VIP Anime Collection

The tutorial walks through creating a Python 3.7 web scraper on Windows 10 with PyCharm, employing the requests, lxml, re, and tqdm libraries to fetch anime episode lists and download video files.

First, custom HTTP headers are defined to mimic a browser, and the target page is requested. XPath expressions extract chapter names and relative URLs:

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36',
    'Referer': 'http://www.imomoe.la/search.asp'
}
url = 'http://www.imomoe.la/view/8024.html'
response = requests.get(url, headers=headers)
html_data = etree.HTML(response.content.decode('gbk'))
chapter_list = html_data.xpath('//div[@class="movurl"]/ul/li/a/text()')
chapter_url_list = html_data.xpath('//div[@class="movurl"]/ul/li/a/@href')[0]

The relative chapter URL is combined with the base domain to fetch the detail page, from which a JavaScript source URL is extracted:

new_url = 'http://www.imomoe.la' + chapter_url_list
response = requests.get(new_url, headers=headers)
html = etree.HTML(response.content.decode('gbk'))

data_url = 'http://www.imomoe.la' + html.xpath('//div[@class="player"]/script[1]/@src')[0]
res = requests.get(data_url, headers=headers).text

Regular expressions locate the actual video URLs (both direct MP4 links and m3u8 playlists):

play_url_list = re.findall('\$(.*?)\$flv', res)
print(play_url_list)

Using tqdm for progress, each MP4 URL is downloaded and saved to a folder named after the anime:

for chapter, play_url in tqdm(zip(chapter_list, play_url_list)):
    result = requests.get(play_url, headers=headers).content
    f = open('终末的女武神/' + chapter + '.mp4', "wb")
    f.write(result)

If m3u8 streams are found, the script builds the full CDN URL, fetches the playlist, extracts TS segment URLs, and downloads each segment, appending them to the corresponding MP4 file:

m3u8_url_list = re.findall('\$(.*?)\$bdhd', res)
for m3u8_url, chapter in zip(m3u8_url_list, chapter_list):
    data = requests.get(m3u8_url, headers=headers)
    new_m3u8_url = 'https://cdn.605-zy.com/' + re.findall('/(.*?m3u8)', data.text)[0]
    ts_data = requests.get(new_m3u8_url, headers=headers)
    ts_url_list = re.findall('/(.*?ts)', ts_data.text)
    print('正在下载：', chapter)
    for ts_url in tqdm(ts_url_list):
        result = requests.get('https://cdn.605-zy.com/' + ts_url).content
        f = open('斗破苍穹/' + chapter + '.mp4', "ab")
        f.write(result)

The article concludes with a concise workflow: locate the anime page, extract detail URLs, fetch static JavaScript, parse video or m3u8 links, and save the media files locally.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

tqdm web-scraping lxml anime

Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.