Backend Development 6 min read

Scrape Unlimited Novel Site and Auto‑Download Texts with Python

This tutorial explains how to use Python's requests, lxml, and fake_useragent libraries to crawl the free novel website "无限小说网", extract each novel's download link, and automatically download the corresponding text files.

Python Crawling & Data Mining

Jun 23, 2020

Scrape Unlimited Novel Site and Auto‑Download Texts with Python

1. Introduction

With the rise of online reading, many people prefer web novels, but most require payment. This tutorial shows how to crawl the free novel site "无限小说网" and directly download the text files.

2. Project Goal

Obtain the download link for a given novel and download the corresponding .txt file.

3. Preparation

Tools: PyCharm. Required libraries: requests, lxml, fake_useragent. Target URL pattern: https://www.555x.org/html/wuxiaxianxia/list_29_{page}.html where {page} is the page number.

4. Implementation

4.1 Define spider class

import requests
from lxml import etree
from fake_useragent import UserAgent
import time

class xiaoshuo(object):
    def __init__(self):
        self.url = "https://www.xiachufang.com/explore/?page={}"
    def main(self):
        pass

if __name__ == '__main__':
    spider = xiaoshuo()
    spider.main()

4.2 Random User‑Agent

for i in range(1, 50):
    self.headers = {'User-Agent': ua.random}

4.3 Request page

def get_page(self, url):
    res = requests.get(url=url, headers=self.headers)
    html = res.content.decode("utf-8")
    return html

4.4 Parse first‑level page with XPath

Use browser developer tools to locate the second‑level page URL, then the third‑level download button URL.

4.5 Parse third‑level page and extract title and download link

for rd in three:
    b = rd.xpath('..//div[@class="shutou"]//b/text()')[0].strip()
    tress = rd.xpath('..//div[@class="shuji"]//ul/li/a/@href')[0].strip()
    # print(tress)

4.6 Save result

read = '''《%s》 下载链接 : %s ''' % (b, tress)
print(read)

4.7 Execute workflow

html = self.get_page(url)
self.parse_page(html)

5. Result

Run the script, input start and end pages, and the console displays each novel’s title and download URL. Clicking the link downloads the text file, which can be opened locally.

6. Conclusion

Avoid excessive crawling to reduce server load.

The project demonstrates how to use Python’s requests, lxml and fake_useragent to fetch novel download links.

Hands‑on practice helps deepen understanding of web scraping.

Request the source code by replying "小说".

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

web-scraping lxml novel-downloader

Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.