Backend Development 6 min read

How to Fix Common URL Construction Errors in Python Web Scraping

This article walks through a typical Python web‑scraping issue caused by incorrect URL concatenation and XPath mismatches, explains the debugging process, and provides complete, ready‑to‑run code along with a practical urljoin example to prevent similar errors.

Python Crawling & Data Mining

Jan 9, 2025

How to Fix Common URL Construction Errors in Python Web Scraping

Introduction

Hello, I’m PiPi. A follower asked a Python web‑scraping question in a chat group, so I’m sharing the problem and solution here.

Problem

The target page structure had changed, causing XPath selectors to miss elements and raise list‑index‑out‑of‑range errors. A simple try block avoided the crash but still returned no data.

Further inspection revealed that the constructed url contained an extra “/”, which broke the request.

Solution

Removing the stray slash and adding a short sleep between requests fixes the issue. Below is the complete script that crawls recipes from the site and saves them to a text file.

import requests
from lxml import etree
from fake_useragent import UserAgent
import time

class kitchen(object):
    u = 0
    def __init__(self):
        self.url = "https://www.xiachufang.com/category/40076/"
        ua = UserAgent(verify_ssl=False)
        for i in range(1, 50):
            self.headers = {'User-Agent': ua.random}
    def get_page(self, url):
        res = requests.get(url=url, headers=self.headers)
        html = res.content.decode("utf-8")
        time.sleep(2)
        return html
    def parse_page(self, html):
        parse_html = etree.HTML(html)
        image_src_list = parse_html.xpath('//li/div/a/@href')
        for i in image_src_list:
            try:
                url = "https://www.xiachufang.com" + i
                html1 = self.get_page(url)
                parse_html1 = etree.HTML(html1)
                num = parse_html1.xpath('.//h2[@id="steps"]/text()')[0].strip()
                name = parse_html1.xpath('.//li[@class="container"]/p/text()')
                ingredients = parse_html1.xpath('.//td//a/text()')
                self.u += 1
                food_info = "
第 %s 种
菜 名 : %s
原 料 : %s
下载 链 接 : %s
" % (str(self.u), num, ingredients, url)
                f = open('下厨房菜谱.txt', 'a', encoding='utf-8')
                f.write(str(food_info))
                print(str(food_info))
                f.close()
            except:
                print('xpath没获取到内容！')
    def main(self):
        startPage = int(input("起始页:"))
        endPage = int(input("终止页:"))
        for page in range(startPage, endPage + 1):
            url = self.url.format(page)
            html = self.get_page(url)
            self.parse_page(html)
            time.sleep(2.4)
            print("====================================第 %s 页 爬 取 成 功====================================" % page)

if __name__ == '__main__':
    imageSpider = kitchen()
    imageSpider.main()

Result

The script writes each recipe’s details to 下厨房菜谱.txt. An example of the saved output is shown below.

URL‑joining tip

When constructing URLs, use urljoin to combine a base URL with a relative path safely. This avoids missing or duplicate slashes.

from urllib.parse import urljoin
source_url = 'https://www.baidu.com/'
child_url1 = '/robots.txt'
child_url2 = 'robots.txt'
final_url1 = urljoin(source_url, child_url1)
final_url2 = urljoin(source_url, child_url2)
print(final_url1)
print(final_url2)

urljoin

fills in missing parts from the base URL; if the second argument is an absolute URL, it takes precedence.

Conclusion

The article identified a common web‑scraping error—incorrect URL concatenation—provided a step‑by‑step fix, demonstrated a full Python crawler, and highlighted the usefulness of urljoin for reliable URL construction.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Error handling requests lxml urljoin

Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.