Big Data 9 min read

Why Polars Beats Pandas and PySpark in Single‑Node Benchmarks – A Deep Dive

This article compares Pandas, Polars, and PySpark across five dataset sizes, showing how Polars' eager and lazy modes dramatically outperform the other tools, and discusses when each framework is the most suitable choice for data processing workloads.

21CTO

May 17, 2024

Why Polars Beats Pandas and PySpark in Single‑Node Benchmarks – A Deep Dive

Background

The author reflects on personal experience: Pandas was the go‑to tool for feature engineering during a sentiment‑analysis project, Spark (via PySpark) became central in daily ETL pipelines after joining a company, and Polars was recently adopted for handling millions of rows with impressive speed.

Pandas

Pandas is the mainstream tool for data manipulation, exploration and analysis. It integrates seamlessly with many machine‑learning libraries such as scikit‑learn.

NumPy provides linear‑algebra and standard calculations; Pandas builds on NumPy.

Scikit‑learn is the reference library for machine‑learning applications; data is typically loaded, visualized and analyzed with Pandas or NumPy.

PySpark

Spark is a free distributed computing platform; PySpark is its Python API, reshaping the paradigm of big‑data processing.

It offers a unified compute engine with:

In‑memory processing: data stays in RAM for fast access.

Fault tolerance: built‑in mechanisms ensure reliable processing.

Scalability: horizontal scaling across clusters for large datasets.

Polars

Polars is a Python library built on Rust, combining Python’s ease of use with Rust’s speed and safety.

It uses Apache Arrow as the query engine for vectorized operations, supports eager and lazy execution, and its syntax resembles SQL, making complex data transformations easy to express.

Performance Test

Setup

GitHub repository: https://github.com/NachoCP/Pandas-Polars-PySpark-BenchMark

Four notebooks test eager and lazy modes; tasks measured are read, filter, aggregate, join, write on five datasets (50 k, 250 k, 1 M, 5 M, 25 M rows) from a Kaggle financial dataset.

macOS Sonoma
Apple M1 Pro
32 GB

Results

Rows      Pandas  Polars eager  Polars lazy  PySpark
50k       0.368   0.132         0.078         1.216
250k      1.249   0.096         0.156         0.917
1M        4.899   0.302         0.300         1.850
5M       24.320   1.605         1.484         7.372
25M     187.383  13.001        11.662        44.724

Analysis

Pandas performs poorly as dataset size grows, though it is acceptable for small data.

PySpark shows significant improvement over Pandas on larger datasets when run on a single node.

Polars outperforms both, delivering 95‑97 % speedup over Pandas and 70‑75 % over PySpark, confirming its efficiency for large single‑node workloads.

Visuals

Conclusion

The benchmark clearly shows the performance and scalability differences of the four widely used data‑processing tools across varying dataset sizes.

Key takeaways:

Pandas is great for small datasets but does not scale well for large volumes.

Polars (both eager and lazy) consistently delivers superior performance, making it a strong choice for large datasets, though it may not yet be production‑ready.

Tool selection should match project requirements: Polars for medium‑size workloads, PySpark for distributed large‑scale processing, and Pandas for quick prototyping.

As data volumes continue to grow, choosing the right tool becomes increasingly critical.

Big Data data processing benchmark pandas PySpark Polars