Backend Development 5 min read

How to Efficiently Scrape Novel Rankings with Python: De‑duplication and Speed Tips

This guide explains how to extract novel titles and links from a structured ranking website, remove duplicate entries using a set, handle HTML tags, and improve crawling speed with multithreading or the Scrapy framework, all while keeping the code modular and reusable.

MaGe Linux Operations

Oct 29, 2018

How to Efficiently Scrape Novel Rankings with Python: De‑duplication and Speed Tips

1. Goal

Retrieve the address of the ranking page (http://www.qu.la/paihangbang/), find each novel's name and its link on the site.

2. Observe the Webpage Structure

Each category is wrapped in a consistent HTML block, making it easy for a crawler to locate and collect all novel links, then store them in a list.

3. Small Trick for List De‑duplication

Even across different categories, the same novel may appear multiple times, wasting resources when crawling large volumes. A single line of code solves this: unique_list = set(original_list) Using a set ensures the list contains no duplicate elements.

4. Code Implementation

Modular, functional programming is recommended: write each independent feature as a separate function for simplicity and reusability.

1. Web Scraping Header

2. Get Ranking Novels and Their Links

Iterate through each type of novel ranking, write the results sequentially to a file (novel name + novel link), store the content in a list, and return a list filled with URL links.

3. Get All Chapter Links of a Single Novel

Obtain each chapter's URL and create a novel file.

4. Get Single Page Content and Save Locally

When downloading files, they often contain formatting tags like <br/>. A simple method removes them: html = get_html(url).replace('<br/>', '\n') This replaces <br/> with a newline for proper paragraph breaks.

5. Drawbacks

The crawler works smoothly because the target site lacks anti‑scraping measures and has a clean, well‑structured layout. However, crawling an entire novel (≈1000 pages) takes about 8.5 minutes, and all 60 novels in the ranking would require roughly 8.5 hours with a single‑threaded approach.

To boost speed, consider writing a multithreaded module or, better yet, using the Scrapy framework, which can accelerate crawling by dozens or even hundreds of times.

6. Main Function

7. Output Results

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python Deduplication multithreading data extraction Web Scraping

Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.