Backend Development 6 min read

How to Fix Common Python Web Scraping URL Errors and Build a Recipe Crawler

This article walks through a real-world Python web‑scraping issue where an extra slash in the URL caused failures, demonstrates debugging with try/except, shows a complete recipe‑crawling script using requests, lxml, and UserAgent, and explains proper URL joining with urllib.parse.urljoin.

Python Crawling & Data Mining

Apr 18, 2022

How to Fix Common Python Web Scraping URL Errors and Build a Recipe Crawler

Introduction

In a recent Python community discussion a user asked about a web‑scraping problem. The author shares the issue and a step‑by‑step solution to help readers learn how to troubleshoot similar errors.

Problem Description

The target page’s URL was constructed with an extra /, causing the request to fail and raising an index‑out‑of‑range error when using xpath selectors.

Solution Process

The fix is to remove the stray slash from the URL string and add a short sleep between requests to avoid being blocked. Below is the complete Python script that crawls recipes from https://www.xiachufang.com, extracts the step number, dish name, ingredients, and saves the data to a text file.

import requests
from lxml import etree
from fake_useragent import UserAgent
import time

class kitchen(object):
    u = 0
    def __init__(self):
        self.url = "https://www.xiachufang.com/category/40076/"
        ua = UserAgent(verify_ssl=False)
        for i in range(1, 50):
            self.headers = {'User-Agent': ua.random}
    def get_page(self, url):
        res = requests.get(url=url, headers=self.headers)
        html = res.content.decode("utf-8")
        time.sleep(2)
        return html
    def parse_page(self, html):
        parse_html = etree.HTML(html)
        image_src_list = parse_html.xpath('//li/div/a/@href')
        for i in image_src_list:
            try:
                url = "https://www.xiachufang.com" + i
                html1 = self.get_page(url)
                parse_html1 = etree.HTML(html1)
                num = parse_html1.xpath('.//h2[@id="steps"]/text()')[0].strip()
                name = parse_html1.xpath('.//li[@class="container"]/p/text()')
                ingredients = parse_html1.xpath('.//td//a/text()')
                self.u += 1
                food_info = "
    第 %s 种
    菜 名 : %s
    原 料 : %s
    下载 链 接 : %s
    =============================================================
" % (str(self.u), num, ingredients, url)
                f = open('下厨房菜谱.txt', 'a', encoding='utf-8')
                f.write(str(food_info))
                print(str(food_info))
                f.close()
            except:
                print('xpath没获取到内容！')
    def main(self):
        startPage = int(input("起始页:"))
        endPage = int(input("终止页:"))
        for page in range(startPage, endPage + 1):
            url = self.url.format(page)
            html = self.get_page(url)
            self.parse_page(html)
            time.sleep(2.4)
            print("====================================第 %s 页 爬 取 成 功====================================" % page)

if __name__ == '__main__':
    imageSpider = kitchen()
    imageSpider.main()

The script writes each recipe’s information to a .txt file, as shown in the screenshot below.

Using urljoin for Robust URL Construction

When concatenating URLs, it is safer to use urljoin from urllib.parse to correctly handle missing or extra slashes.

from urllib.parse import urljoin
source_url = 'https://www.baidu.com/'
child_url1 = '/robots.txt'
child_url2 = 'robots.txt'
final_url1 = urljoin(source_url, child_url1)
final_url2 = urljoin(source_url, child_url2)
print(final_url1)
print(final_url2)

Conclusion

The article identifies a common URL‑construction mistake in Python web scraping, provides a full working crawler, and recommends using urljoin for reliable URL handling, helping readers avoid similar pitfalls.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

requests XPath lxml urljoin

Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.