Boost Pandas Data Processing Speed Up to 315× with Vectorized Techniques
This article walks through several pandas performance‑boosting methods—from naive for‑loops and iterrows to apply, .isin, pd.cut, and NumPy digitize—showing timing results and demonstrating how vectorized operations can accelerate hourly tariff calculations by hundreds of times.
Introduction
The previous post demonstrated a 50× speed‑up using datetime tricks; this article shares even more common acceleration techniques for pandas data processing.
Naive for‑loop
A simple for loop that applies a custom apply_tariff function to each row takes several seconds for 8,760 rows.
def apply_tariff_loop(df):
energy_cost_list = []
for i in range(len(df)):
energy_used = df.iloc[i]['energy_kwh']
hour = df.iloc[i]['date_time'].hour
energy_cost = apply_tariff(energy_used, hour)
energy_cost_list.append(energy_cost)
df['cost_cents'] = energy_cost_listUsing iterrows
Replacing the range‑based loop with df.iterrows() reduces the runtime to about 0.7 seconds.
def apply_tariff_iterrows(df):
energy_cost_list = []
for index, row in df.iterrows():
energy_used = row['energy_kwh']
hour = row['date_time'].hour
energy_cost = apply_tariff(energy_used, hour)
energy_cost_list.append(energy_cost)
df['cost_cents'] = energy_cost_listUsing pandas apply
Applying the function with df.apply(..., axis=1) cuts the time further to roughly 0.27 seconds.
def apply_tariff_withapply(df):
df['cost_cents'] = df.apply(
lambda row: apply_tariff(kwh=row['energy_kwh'], hour=row['date_time'].hour),
axis=1)Vectorized .isin method
Creating Boolean masks for peak, shoulder, and off‑peak hours and assigning values with df.loc brings the runtime down to 0.01 seconds, a 315× improvement over the naive loop.
def apply_tariff_isin(df):
peak_hours = df.index.hour.isin(range(17, 24))
shoulder_hours = df.index.hour.isin(range(7, 17))
off_peak_hours = df.index.hour.isin(range(0, 7))
df.loc[peak_hours, 'cost_cents'] = df.loc[peak_hours, 'energy_kwh'] * 28
df.loc[shoulder_hours, 'cost_cents'] = df.loc[shoulder_hours, 'energy_kwh'] * 20
df.loc[off_peak_hours, 'cost_cents'] = df.loc[off_peak_hours, 'energy_kwh'] * 12Using pd.cut
Leveraging pd.cut to bin hours and multiply by the appropriate rate reduces the average runtime to 0.272 seconds.
def apply_tariff_cut(df):
cents_per_kwh = pd.cut(
x=df.index.hour,
bins=[0, 7, 17, 24],
include_lowest=True,
labels=[12, 20, 28]
).astype(int)
df['cost_cents'] = cents_per_kwh * df['energy_kwh']Using NumPy digitize
Applying np.digitize with a price array yields the fastest result—about 0.002 seconds for the same dataset.
def apply_tariff_digitize(df):
prices = np.array([12, 20, 28])
bins = np.digitize(df.index.hour.values, bins=[7, 17, 24])
df['cost_cents'] = prices[bins] * df['energy_kwh'].valuesConclusion
Vectorized operations, especially those based on Boolean indexing, pd.cut, or np.digitize, dramatically outperform Pythonic loops and even the apply method, making them the preferred choice for large‑scale time‑based calculations in pandas.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
