Backend Development 6 min read

Why Does My Python Scraper Save Only the Last Page? Fix It Now

This article walks through a common Python web‑scraping issue where data from multiple pages is overwritten, explains the root cause, and provides a complete, runnable script using requests, parsel, and openpyxl to correctly collect and store each page's information.

Python Crawling & Data Mining

Jun 8, 2024

Why Does My Python Scraper Save Only the Last Page? Fix It Now

Preface

In a Python community chat a member asked why their crawler only saved data from the last page when scraping multiple pages. The problem was reproduced and the original code was shared.

Implementation

The following script demonstrates a working solution. It fetches each page, parses the food items, and writes the results to an Excel file.

# encoding: UTF-8
# create time: 2024/05/30/0030 16:26:03

import time
from urllib.parse import urljoin
import openpyxl
import requests
import parsel
from fake_useragent import UserAgent

def get_page(pages: int):
    """Send request and get page data.
    :param pages: page number
    :return: response data"""
    url = f"https://www.xiachufang.com/category/40071/?page={pages}"
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36 Edg/125.0.0.0'}
    try:
        response = requests.get(url, headers=headers)
        if response.status_code == 200:
            parse_page(response)
        else:
            return "Request failed, status code: " + str(response.status_code)
    except requests.ReadTimeout as e:
        print("Request timed out", e.args[0])
        time.sleep(2)

def parse_page(response):
    """Parse page data and return a list of items."""
    items = []
    lst = []
    base_url = 'https://www.xiachufang.com'
    html = parsel.Selector(response.text)
    foods_list = html.css('div.info p.name')
    count = 0
    for li in range(0, len(foods_list)):
        count += 1
        lst.append([
            count,
            foods_list[li].css('a::text').extract()[0][16:-14].strip(),
            urljoin(base_url, foods_list[li].css('a::attr(href)').extract()[0])
        ])
    print(lst)
    items.append(lst)
    save1(items)
    save_data(items)

def save1(items):
    with open('data.txt', 'w', encoding='utf-8') as f:
        for item in items:
            for sub in item:
                f.write(str(sub) + '
')

def save_data(items):
    wb = openpyxl.Workbook()
    ws = wb.active
    ws.append(['ID', 'Dish Name', 'Link'])
    for item in items:
        for sub in item:
            ws.append(sub)
    wb.save('xiachufang_breakfast.xlsx')

def main() -> None:
    total_pages = 3
    for i in range(total_pages):
        current_page = i + 1
        get_page(current_page)
        print("Current page:\t" + str(current_page))
        time.sleep(2)

if __name__ == '__main__':
    main()

The original issue persisted because the script overwrote the output file on each iteration. The revised version accumulates results in a list, writes all entries to a single Excel workbook, and prints progress for each page.

Conclusion

The article demonstrates how to diagnose a missing‑page problem in a Python web scraper, presents a complete, functional script, and highlights best practices such as proper file handling and incremental data storage.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

requests openpyxl web-scraping data-extraction

Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.