Boost Python Data Loading Speed 10×: 5 Proven Methods
This article demonstrates five practical techniques for loading Excel and CSV data with pandas in Python, including smarter DataFrame construction, parallel processing with Joblib, and using pickle for faster storage, achieving speed improvements of up to three orders of magnitude.
5 Ways to Load Data in Python
As a Python user, I often need to load and store data from Excel or CSV files, but native Excel handling can be slow.
This article presents five methods to load data in Python, achieving up to three orders of magnitude speedup.
1. Load Excel files
Simple approach using pandas.read_excel, appending each file to a DataFrame. Benchmark shows about 53 seconds for ten 20,000‑row files.
import pandas as pd
import numpy as np
from joblib import Parallel, delayed
import time
start = time.time()
df = pd.read_excel("Dummy 0.xlsx")
for file_number in range(1,10):
df.append(pd.read_excel(f"Dummy {file_number}.xlsx"))
end = time.time()
print("Excel:", end - start)
# >> Excel: 53.42. Load CSV files
Using pandas.read_csv dramatically speeds up loading—about 0.63 seconds for the same ten files, roughly ten times faster.
start = time.time()
df = pd.read_csv("Dummy 0.csv")
for file_number in range(1,10):
df.append(pd.read_csv(f"Dummy {file_number}.csv"))
end = time.time()
print("CSV:", end - start)
# >> CSV: 0.632CSV files are usually larger than .xlsx, but the speed gain outweighs the size increase.
3. Build smarter DataFrames
Load each file into a separate DataFrame stored in a list, then concatenate once. This reduces overhead for large DataFrames.
start = time.time()
df = []
for file_number in range(10):
temp = pd.read_csv(f"Dummy {file_number}.csv")
df.append(temp)
df = pd.concat(df, ignore_index=True)
end = time.time()
print("CSV2:", end - start)
# >> CSV2: 0.6194. Parallelize CSV loading with Joblib
Joblib can load files in parallel, roughly doubling the speed on an 8‑core M1 Mac.
start = time.time()
def loop(file_number):
return pd.read_csv(f"Dummy {file_number}.csv")
df = Parallel(n_jobs=-1, verbose=10)(delayed(loop)(i) for i in range(10))
df = pd.concat(df, ignore_index=True)
end = time.time()
print("CSV//:", end - start)
# >> CSV//: 0.3865. Faster storage with Pickle
Saving DataFrames as pickle files speeds up loading to ~0.07 seconds, an 80 % reduction, though pickles are not human‑readable and may occupy more disk space.
start = time.time()
def loop(file_number):
return pd.read_pickle(f"Dummy {file_number}.pickle")
df = Parallel(n_jobs=-1, verbose=10)(delayed(loop)(i) for i in range(10))
df = pd.concat(df, ignore_index=True)
end = time.time()
print("Pickle//:", end - start)
# >> Pickle//: 0.072Use pickles when you need to reload the same data repeatedly or when the data will only be consumed by Python.
Bonus: Parallel Excel loading
Joblib can also parallelize Excel loading, cutting load time from 50 seconds to about 13 seconds.
start = time.time()
def loop(file_number):
return pd.read_excel(f"Dummy {file_number}.xlsx")
df = Parallel(n_jobs=-1, verbose=10)(delayed(loop)(i) for i in range(10))
df = pd.concat(df, ignore_index=True)
end = time.time()
print("Excel//:", end - start)
# >> 13.45These techniques let you load large tabular datasets in Python orders of magnitude faster.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
