Fundamentals 6 min read

How to Speed Up Excel Sheet Row Counting with Pandas, Polars, and Calamine

This article walks through a Python solution for quickly counting rows in each sheet of a large Excel workbook, comparing pandas with the Calamine engine, Polars, and optimization tips to cut processing time from dozens of seconds to just a few.

Python Crawling & Data Mining

Sep 7, 2024

How to Speed Up Excel Sheet Row Counting with Pandas, Polars, and Calamine

1. Introduction

A user asked how to efficiently count the number of rows in each sheet of an Excel file using pandas, noting that the default engine took about 50 seconds for 130,000 rows, while the Calamine engine reduced it to 10 seconds.

import pandas as pd
import polars as pl
import time
start_time = time.time()

df = pd.read_excel('G:\input\测试.xlsx', sheet_name=None, dtype=str, engine='calamine')
sheet_names = list(df.keys())
for sheet_name in sheet_names:
    df_sheet = pl.read_excel('G:\input\测试.xlsx', sheet_name=sheet_name)
    print(f'{sheet_name}----------{df_sheet.height}')
end_time = time.time()
time_taken = end_time - start_time

The goal was to obtain the row count for each sheet more quickly.

2. Implementation Process

Community members suggested avoiding the double read. By reading the entire workbook once with pandas (using sheet_name=None) and then iterating over the resulting dictionary, the extra Polars read becomes unnecessary.

df = pd.read_excel('G:\input\测试.xlsx', sheet_name=None, dtype=str, engine='calamine')
for sheet_name, dataframe in df.items():
    print(f'{sheet_name}-----------{dataframe.shape[0]}')
end_time = time.time()
time_taken = end_time - start_time
print(f'calamine----{time_taken}')

This change reduced the processing time to about 5 seconds, compared with roughly 25 seconds without Calamine. Additional discussion covered:

Whether using dtype or info() affects speed.

Pandas runs on a single CPU core, making it slower for large files.

Polars offers the best speed but requires more code changes.

Modin did not provide a speed benefit in this case.

Converting Excel files to CSV or reading from memory can further improve performance.

The user’s final requirement was to merge hundreds of different sheets into a single unified table and verify row counts, for which the Calamine engine proved to be a practical speed‑up without extensive code modifications.

3. Summary

The discussion presented a concrete solution for a pandas‑based Excel row‑counting problem, demonstrating how using the Calamine engine and a single read‑once approach can cut processing time dramatically, while also highlighting alternative tools like Polars and strategies such as CSV conversion for handling large datasets.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Polars

Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.