How to Scrape Upcoming Movies from Maoyan with Python: A Step‑by‑Step Guide
Learn how to build a Python web scraper that fetches upcoming movie details from Maoyan.com, covering environment setup, URL pagination, random User‑Agent handling, HTML parsing with XPath, data extraction, and result display, while highlighting best practices and limitations.
Introduction
Because of the pandemic many cinemas have closed, and new movies are now being released. This tutorial shows how to obtain details of upcoming movies from the Maoyan platform.
Project Goal
Retrieve the upcoming movie information (title, starring actors, genre, and detail link) from Maoyan.
Preparation
Software: PyCharm
Required libraries: requests , lxml , random , time
Plugin: XPath
Target website:
https://maoyan.com/films?showType=2&offset={}Implementation
1. Define a MaoyanSpider class that stores the URL template and provides a main method.
import requests
from lxml import etree
import time, random
class MaoyanSpider(object):
def __init__(self):
self.url = "https://maoyan.com/films?showType=2&offset={}"
def main(self):
pass
if __name__ == '__main__':
spider = MaoyanSpider()
spider.main()2. Generate a random User‑Agent for each request.
for i in range(1, 50):
# ua.random must be set here; each request gets a random UA
self.headers = {'User-Agent': ua.random}3. Send the HTTP request and obtain the page content.
def get_page(self, url):
# random.choice must be used here to vary the request
res = requests.get(url, headers=self.headers)
res.encoding = 'utf-8'
html = res.text
self.parse_page(html)4. Parse the first‑level page with XPath.
# Create parsing object
parse_html = etree.HTML(html)
# Base XPath node list
dd_list = parse_html.xpath('//dl[@class="movie-list"]//dd')5. Iterate over each node to extract data.
for dd in dd_list:
name = dd.xpath('.//div[@class="movie-hover-title"]//span[@class="name noscore"]/text()')[0].strip()
star = dd.xpath('.//div[@class="movie-hover-info"]//div[@class="movie-hover-title"][3]/text()')[1].strip()
genre = dd.xpath('.//div[@class="movie-hover-info"]//div[@class="movie-hover-title"][2]/text()')[1].strip()
download = dd.xpath('.//div[@class="movie-item-hover"]/a/@href')[0].strip()
movie = '''【Upcoming】
Title: %s
Stars: %s
Genre: %s
Detail link: https://maoyan.com%s
=================================================''' % (name, star, genre, download)
print(movie)6. Add a random delay between requests.
time.sleep(random.randint(1, 3))7. Call the methods to fetch and parse pages for a range of offsets.
html = self.get_page(url)
self.parse_page(html)Result Demonstration
Running the script prompts the user for a start and end page, then prints the extracted movie information in the console. Screenshots of the console output and the generated result are shown below.
Conclusion
Do not scrape excessive amounts of data to avoid overloading the server.
This tutorial demonstrates a Python web‑crawling solution for Maoyan movie listings.
Hands‑on practice is essential; implementing the code yourself reveals many nuances.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
