Backend Development 8 min read

How to Scrape GDP Data with Python and Save to CSV in Minutes

This article demonstrates how to use Python's requests, lxml, and pandas libraries to crawl GDP data from a website, parse the HTML tables, and efficiently write the extracted rankings, regions, GDP values, and years into a CSV file, providing a complete, runnable example for web scraping beginners.

Python Crawling & Data Mining

Apr 18, 2024

How to Scrape GDP Data with Python and Save to CSV in Minutes

Introduction

The author received a request to modify a Python web‑scraping script that originally used pandas to fetch GDP data but stored the results in an inconvenient way. The goal is to retrieve ranking, region, GDP, and year information from a target site and write it directly to a CSV file.

import requests
from lxml import etree
import csv
import time
import pandas as pd

def gdpData(page):
    url = f'https://www.hongheiku.com/category/gdjsgdp/page/{page}'
    headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36'}
    resp = requests.get(url, headers=headers)
    # print(resp.text)
    data(resp.text)
file = open('data.csv', mode='a', encoding='utf-8', newline='')
csv_write = csv.DictWriter(file, fieldnames=['排名', '地区', 'GDP', '年份'])
csv_write.writeheader()

def data(text):
    e = etree.HTML(text)
    lst = e.xpath('//*[@id="tablepress-48"]/tbody/tr[@class="even"]')
    for l in lst:
        no = l.xpath('./td[1]/center/span/text()')
        name = l.xpath('./td[2]/a/center/text()')
        team = l.xpath('./td[3]/center/text()')
        year = l.xpath('./td[4]/center/text()')
        data_dict = {'排名': no, '地区': name, 'GDP': team, '年份': year}
        print(f'排名：{no} 地区:{name} GDP:{team} 年份:{year} ')
        csv_write.writerow(data_dict)
file.close()
url = 'https://www.hongheiku.com/category/gdjsgdp'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36'}
resp = requests.get(url, headers=headers)
# print(resp.text)
data(resp.text)
e = etree.HTML(resp.text)
#//*[@id="tablepress-48"]/tbody/tr[192]/td[3]/center
count = e.xpath('//div[@class="pagination pagination-multi"][last()]/ul/li[last()]/span/text()')[0].split(' ')[1]
for index in range(int(count) - 1):
    gdpData(index + 2)

Implementation

The revised script moves the CSV file handling inside the data function, ensures each field is joined into a plain string, and adds explicit header writing before looping through the rows. This makes the code self‑contained and ready to run.

import requests
from lxml import etree
import csv
import time
import pandas as pd

def gdpData(page):
    url = f'https://www.hongheiku.com/category/gdjsgdp/page/{page}'
    headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36'}
    resp = requests.get(url, headers=headers)
    # print(resp.text)
    data(resp.text)

def data(text):
    file = open('data.csv', mode='a', encoding='utf-8', newline='')
    csv_write = csv.DictWriter(file, fieldnames=['排名', '地区', 'GDP', '年份'])
    csv_write.writeheader()
    e = etree.HTML(text)
    lst = e.xpath('//*[@id="tablepress-48"]/tbody/tr[@class="even"]')
    for l in lst:
        no = ''.join(l.xpath('./td[1]/center/span/text()'))
        name = ''.join(l.xpath('./td[2]/a/center/text()')[0])
        team = ''.join(l.xpath('./td[3]/center/text()'))
        year = ''.join(l.xpath('./td[4]/center/text()'))
        data_dict = {'排名': no, '地区': name, 'GDP': team, '年份': year}
        print(f'排名：{no} 地区:{name} GDP:{team} 年份:{year} ')
        csv_write.writerow(data_dict)
    file.close()

url = 'https://www.hongheiku.com/category/gdjsgdp'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36'}
resp = requests.get(url, headers=headers)
# print(resp.text)
data(resp.text)

e = etree.HTML(resp.text)
#//*[@id="tablepress-48"]/tbody/tr[192]/td[3]/center
count = e.xpath('//div[@class="pagination pagination-multi"][last()]/ul/li[last()]/span/text()')[0].split(' ')[1]
for index in range(int(count) - 1):
    gdpData(index + 2)

Running the script writes all extracted rows into data.csv, which can then be opened in Excel or processed further.

Conclusion

The article provides a complete, step‑by‑step solution for a Python web‑scraping task that fetches GDP rankings from a public website and stores the results in a CSV file, illustrating how to combine requests, lxml, and the standard csv module for reliable data extraction.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python CSV Web Scraping lxml

Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.