Backend Development 6 min read

Master Python Web Scraping & Data Extraction with Requests, lxml, pandas

This article walks through a Python web‑scraping solution that fetches GDP data from a website using the requests library, parses HTML with lxml, and demonstrates two approaches—manual XPath extraction and a streamlined pandas.read_html method—while providing complete code snippets and tips for handling pagination and data storage.

Python Crawling & Data Mining

Feb 13, 2023

Master Python Web Scraping & Data Extraction with Requests, lxml, pandas

1. Introduction

Earlier in a Python community a user asked about a Python web‑scraping problem; the following solution is shared.

The original code uses requests, lxml, and csv to crawl GDP data from a website, extract rows via XPath, and write them to a CSV file.

import requests
from lxml import etree
import csv
import time
import pandas as pd 

def gdpData(page):
    url = f'https://www.hongheiku.com/category/gdjsgdp/page/{page}'
    headers ={'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36'}
    resp = requests.get(url,headers = headers)
    # print(resp.text)
    data(resp.text)
file = open('data.csv',mode='a',encoding='utf-8',newline='')
csv_write=csv.DictWriter(file,fieldnames=['排名','地区','GDP','年份'])
csv_write.writeheader()

def data(text):
    e = etree.HTML(text)
    lst = e.xpath('//*[@id="tablepress-48"]/tbody/tr[@class="even"]')
    for l in lst:
        no = l.xpath('./td[1]/center/span/text()')
        name = l.xpath('./td[2]/a/center/text()')
        team = l.xpath('./td[3]/center/text()')
        year = l.xpath('./td[4]/center/text()')
        data_dict ={
            '排名':no,
            '地区':name,
            'GDP':team,
            '年份':year
        }
        print(f'排名：{no} 地区:{name} GDP:{team} 年份:{year} ')
        csv_write.writerow(data_dict)
file.close()
url = 'https://www.hongheiku.com/category/gdjsgdp'
headers ={'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36'}
resp = requests.get(url,headers = headers)
# print(resp.text)
data(resp.text)
e = etree.HTML(resp.text)
#//*[@id="tablepress-48"]/tbody/tr[192]/td[3]/center
count = e.xpath('//div[@class="pagination pagination-multi"][last()]/ul/li[last()]/span/text()')[0].split(' ')[1]
for index in range(int(count) - 1):
    gdpData(index + 2)

2. Implementation

The original script works but sometimes fails to retrieve data because the XPath rules are not robust.

A revised solution leverages pandas to read HTML tables directly, simplifying extraction and eliminating missing data.

import requests
import pandas as pd 
url = 'https://www.hongheiku.com/category/gdjsgdp'
headers ={'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36'}
resp = requests.get(url,headers = headers)
df = pd.read_html(resp.text)[0].dropna()
df.to_excel('1.xlsx',index=None)
df

The pandas approach produces a complete dataset without gaps.

Another participant also wrote a working script that yields the expected results.

3. Conclusion

This article presented a Python web‑scraping issue, provided detailed analysis and code implementations, and helped the user obtain the desired data.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

data extraction Web Scraping Pandas lxml

Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.