Backend Development 6 min read

How to Scrape Upcoming Movies from Maoyan with Python: A Step‑by‑Step Guide

Learn how to build a Python web scraper that fetches upcoming movie details from Maoyan.com, covering environment setup, URL pagination, random User‑Agent handling, HTML parsing with XPath, data extraction, and result display, while highlighting best practices and limitations.

Python Crawling & Data Mining

Jul 25, 2020

How to Scrape Upcoming Movies from Maoyan with Python: A Step‑by‑Step Guide

Introduction

Because of the pandemic many cinemas have closed, and new movies are now being released. This tutorial shows how to obtain details of upcoming movies from the Maoyan platform.

Project Goal

Retrieve the upcoming movie information (title, starring actors, genre, and detail link) from Maoyan.

Preparation

Software: PyCharm

Required libraries: requests , lxml , random , time

Plugin: XPath

Target website:

https://maoyan.com/films?showType=2&offset={}

Implementation

1. Define a MaoyanSpider class that stores the URL template and provides a main method.

import requests
from lxml import etree
import time, random

class MaoyanSpider(object):
    def __init__(self):
        self.url = "https://maoyan.com/films?showType=2&offset={}" 
    def main(self):
        pass

if __name__ == '__main__':
    spider = MaoyanSpider()
    spider.main()

2. Generate a random User‑Agent for each request.

for i in range(1, 50):
    # ua.random must be set here; each request gets a random UA
    self.headers = {'User-Agent': ua.random}

3. Send the HTTP request and obtain the page content.

def get_page(self, url):
    # random.choice must be used here to vary the request
    res = requests.get(url, headers=self.headers)
    res.encoding = 'utf-8'
    html = res.text
    self.parse_page(html)

4. Parse the first‑level page with XPath.

# Create parsing object
parse_html = etree.HTML(html)
# Base XPath node list
dd_list = parse_html.xpath('//dl[@class="movie-list"]//dd')

5. Iterate over each node to extract data.

for dd in dd_list:
    name = dd.xpath('.//div[@class="movie-hover-title"]//span[@class="name noscore"]/text()')[0].strip()
    star = dd.xpath('.//div[@class="movie-hover-info"]//div[@class="movie-hover-title"][3]/text()')[1].strip()
    genre = dd.xpath('.//div[@class="movie-hover-info"]//div[@class="movie-hover-title"][2]/text()')[1].strip()
    download = dd.xpath('.//div[@class="movie-item-hover"]/a/@href')[0].strip()
    movie = '''【Upcoming】
Title: %s
Stars: %s
Genre: %s
Detail link: https://maoyan.com%s
=================================================''' % (name, star, genre, download)
    print(movie)

6. Add a random delay between requests.

time.sleep(random.randint(1, 3))

7. Call the methods to fetch and parse pages for a range of offsets.

html = self.get_page(url)
self.parse_page(html)

Result Demonstration

Running the script prompts the user for a start and end page, then prints the extracted movie information in the console. Screenshots of the console output and the generated result are shown below.

Conclusion

Do not scrape excessive amounts of data to avoid overloading the server.

This tutorial demonstrates a Python web‑crawling solution for Maoyan movie listings.

Hands‑on practice is essential; implementing the code yourself reveals many nuances.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python data extraction Web Scraping requests XPath Maoyan

Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.