Backend Development 7 min read

Automate Daily Report Downloads with a Python Web Scraper – Full Code Explained

This article walks through building a Python web scraper that automatically logs in, fetches daily free reports from a registration‑required site, saves each as a PDF with appropriate filenames, and includes complete source code with detailed explanations for handling retries, cookies, and file management.

Python Crawling & Data Mining

Jul 20, 2023

Automate Daily Report Downloads with a Python Web Scraper – Full Code Explained

In a Python community, the author was asked to create a script that automatically collects and downloads the free reports posted each day on a website that requires phone‑number registration, saving each report as a PDF named after its title.

1. Introduction

The task was initially thought to involve reverse engineering, but the site turned out to be straightforward, leading to a quick development of a web‑scraping solution.

2. Implementation

The core code uses requests for HTTP requests, parsel.Selector for parsing, and Windows registry to locate the desktop folder. It handles headers, cookies, retries, and error logging. The script iterates over pages, extracts report titles, download links, and saves the files to a folder named “今日研报” on the desktop, ensuring duplicate files are skipped.

# -*- coding:utf-8 -*-
"""
开发环境：Python 3.86
脚本名称：2023-07-07 报告厅baogaoting
创建日期：2023年07月07日
"""
import datetime
import os, sys, time, traceback
import pathlib
from parsel import Selector
import requests

headers = {"xxx"}
cookies = {"xxx"}

def _down_file(dow_url, file_name):
    """ 
    :param dow_url:   下载链接
    :param file_name:  图片绝对路径 包括后缀名
    :return:
    """
    re_重试次数 = 0
    while True:
        try:
            response = requests.get(dow_url, headers=headers, cookies=cookies, timeout=10)
            if response.status_code == 200:
                break
            elif response.status_code == 404:
                break
            else:
                print(response.status_code)
                re_重试次数 += 1
        except Exception as e:
            adress = sys.exc_info()[-1]
            line_error = traceback.extract_tb(adress, limit=1)[-1][1]
            print(f"===================
【异常原因】:{e}
【异常类型】:{type(e)}
【异常位置】:{line_error}
===================")
            time.sleep(5)
            re_重试次数 += 1
        if re_重试次数 >= 10:
            response = ""
            break
    if response:
        with open(file_name, "wb+") as f:
            f.write(response.content)
    else:
        print(f"【跳过】:下载失败{file_name}")

import winreg

def get_desktop():
    key = winreg.OpenKey(winreg.HKEY_CURRENT_USER, r'Software\\Microsoft\\Windows\\CurrentVersion\\Explorer\\Shell Folders')
    return winreg.QueryValueEx(key, "Desktop")[0]

if __name__ == '__main__':
    zm_path = get_desktop()
    x_date = time.strftime("%Y-%m-%d")
    list_file_path = list(pathlib.Path(x_date).rglob("*"))
    dict_file_path = {i.name: True for i in list_file_path}
    
    url = "https://www.baogaoting.com/space/30909237"
    for page in range(1, 3):
        params = {"page": page, "size": "15"}
        response = requests.get(url, headers=headers, cookies=cookies, params=params)
        # ... (omitted for brevity)
        print(f"【{title}】:热度{h3} 是当天上传的资料,准备访问下载{href_download}")
        if href_download:
            if not os.path.exists(f"{zm_path}//今日研报"):
                os.makedirs(f"{zm_path}//今日研报", exist_ok=True)
            title = title + pathlib.Path(href_download).suffix
            for k in ["<", ">", "|", '"', "*", '\\', ":", "/", "?", "
", "\r", "\t", "！", "☆"]:
                date = date.replace(k, '').strip()
            if dict_file_path.get(f"{date}_{title}"):
                print("【状态】:已经下载了哦,自动跳过")
                continue
            else:
                _down_file(href_download, f"{zm_path}//今日研报//{title}")
        else:
            print(f"【状态】:链接{href_download} {title}无效,跳过不下载")
        time.sleep(0.5)
    input("【状态】:完成了哦,按任意键退出软件")

3. Conclusion

The script successfully automates the daily download of research reports, eliminating the need to manually locate and save each file.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

data collection Automation PDF Web Scraping

Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.