Polars vs Pandas: Is Switching Worth It for Ten‑Million‑Row Datasets?
The article shows that Polars, a query‑compiling DataFrame library, can accelerate ten‑million‑row GroupBy workloads by 6‑10× compared with Pandas, explains the underlying optimizer, Arrow columnar engine and Rust parallelism, provides a 20‑item syntax map, three real migration scenarios, streaming for out‑of‑memory data, and AI‑pipeline use cases, and offers a step‑by‑step migration guide.
Performance Gap
On a 24 GB‑RAM MacBook Pro, a 10 M‑row groupby runs in 13 s using 3 GB with Polars, versus 87 s and 8 GB with Pandas – a 6.7× speedup. The bottleneck is the per‑row execution model of Pandas, not Python itself.
You feel the physical limit of row‑wise execution on massive data.
Pandas 2.0 adds a PyArrow backend that reduces memory usage but retains eager, single‑core execution. Each statement triggers a full‑table scan, e.g.:
# 3 lines = 3 independent full scans, no optimisation
df_filtered = df[df['amount'] > 100] # scan 1: filter
df_grouped = df_filtered.groupby('cat') # scan 2: group
result = df_grouped['amount'].mean() # scan 3: aggregateFor 5 M rows this means ~15 M Python object operations, which is physically slow.
Polars Query Optimizer
Polars builds a logical plan and lets the optimizer decide the execution order. It performs three key transformations:
Predicate Push‑Down : Filters are moved to the Parquet scan, allowing the file reader to skip whole row groups.
Projection Push‑Down : Only the columns referenced in the query are read.
Common Sub‑Plan Elimination (CSE) : Identical sub‑expressions are computed once via a CACHE node.
Running lf.explain() on a lazy query prints the optimized plan, e.g.:
AGGREGATE
[col("amount").mean()] BY [col("category")]
FILTER col("amount") > 100 ← pushed down
SCAN data.parquet
PROJECT 2/52 columns ← only needed columns readIn eager mode Pandas requires manual optimisation (e.g., pre‑filtering); Polars does it automatically, adding an extra 3‑5× boost in lazy mode.
Columnar Engine and SIMD
Polars relies on Apache Arrow’s column‑oriented layout and Rust SIMD vectorisation. Arrow stores each column contiguously, so a single cache line brings many values (e.g., amount[0] … amount[7]) into L1 cache at once. SIMD lets a single CPU instruction operate on multiple values simultaneously; Polars’ Rust core exploits SIMD for arithmetic, comparisons, and string matching, which Pandas cannot do at this level.
Parallel Execution
Polars’ expression engine automatically parallelises independent expressions across all CPU cores. Example:
df.with_columns([
(pl.col("revenue") - pl.col("cost")).alias("profit"),
(pl.col("profit") / pl.col("revenue")).alias("margin"),
(pl.col("amount") * 1.13).alias("amount_with_tax")
])In Pandas the three operations would be written as three separate statements, each waiting for the previous one to finish.
Core Syntax Mapping (Pandas → Polars)
Read CSV: pd.read_csv("f.csv") → pl.read_csv("f.csv") Read Parquet (lazy): pd.read_parquet("f.pq") → pl.scan_parquet("f.pq") Select columns: df[["a","b"]] → df.select("a","b") Filter: df[df.x > 5] → df.filter(pl.col("x") > 5) Add column: df["new"] = df.x + 1 → df.with_columns((pl.col("x") + 1).alias("new")) GroupBy + mean: df.groupby("g").y.mean() → df.group_by("g").agg(pl.col("y").mean()) Rename: df.rename(columns={"o":"n"}) → df.rename({"o":"n"}) Sort: df.sort_values("col") → df.sort("col") Join: pd.merge(a, b, on="k") → a.join(b, on="k") Fill NA: df["col"].fillna(0) → df.with_columns(pl.col("col").fill_null(0)) Drop duplicates: df["col"].unique() → df.select("col").unique() String contains: df.col.str.contains("x") → pl.col("col").str.contains("x") To datetime: pd.to_datetime(df.col) → pl.col("col").str.to_datetime() Rolling mean: df.col.rolling(3).mean() → pl.col("col").rolling_mean(window_size=3) Cumulative sum: df.col.cumsum() → pl.col("col").cum_sum() Write Parquet: df.to_parquet("o.pq") → df.write_parquet("o.pq") Drop rows with NA: df.dropna(subset=["c"]) → df.drop_nulls("c") Pivot: df.pivot_table(...) → df.pivot(values="v", index="i", columns="c") Head N rows: df.head(10) → df.head(10) Pandas ↔ Polars conversion: pl.from_pandas(pd_df) /
pl_df.to_pandas()Migration Scenarios
Scenario 1 – CSV → Aggregation ETL
# Pandas (3 full scans)
import pandas as pd
df = pd.read_csv("sales_2026.csv")
df = df[df['date'] >= '2026-01-01']
monthly = df.groupby('month')['revenue'].sum()
# Polars (single scan, optimizer rewrites)
import polars as pl
monthly = (
pl.scan_csv("sales_2026.csv")
.filter(pl.col("date") >= "2026-01-01")
.group_by("month")
.agg(pl.col("revenue").sum())
.collect()
)Scenario 2 – Multi‑table Join
# Polars lazy join with pre‑filter
result = (
orders.lazy()
.filter(pl.col("status") == "completed")
.join(users.lazy().select("user_id", "tier"), on="user_id")
.group_by("tier")
.agg(pl.col("amount").sum())
.collect()
)Scenario 3 – Heavy Lifting + Final Visualisation
# Heavy aggregation in Polars
import polars as pl
import pandas as pd
df_heavy = (
pl.scan_parquet("huge_transactions/*.parquet")
.filter(pl.col("amount") > 1000)
.group_by(["user_id", "date"])
.agg([
pl.col("amount").sum().alias("total_spent"),
pl.col("order_id").n_unique().alias("order_count")
])
.collect()
)
# Convert to Pandas for plotting
pdf = df_heavy.to_pandas()
pdf.plot(kind="bar")Streaming for Out‑of‑Memory Data
When the dataset exceeds RAM (e.g., 50 GB on a 24 GB machine), Polars can process it with .collect(streaming=True). The query plan is broken into micro‑batches, keeping memory usage low without a distributed system.
result = (
pl.scan_parquet("giant_table/*.parquet")
.filter(pl.col("date").is_between("2025-01-01", "2026-05-01"))
.group_by(["category", "month"])
.agg([
pl.col("amount").sum(),
pl.col("quantity").mean()
])
.sort("amount", descending=True)
.collect(streaming=True)
)Global sort, pivot, and rolling_mean are not yet streaming‑compatible; support is being added in successive releases.
Polars in AI Data Pipelines
Polars serves as a fast pre‑processing layer for large‑scale LLM logs, RAG document cleaning, and embedding post‑processing, leveraging Arrow‑native format that vector databases such as LanceDB can consume without copying.
# LLM log analysis – filter slow requests and aggregate
logs = (
pl.scan_parquet("llm_api_logs/2026/*.parquet")
.filter(
pl.col("model").is_in(["claude-sonnet-4-6", "gpt-4o"]),
pl.col("latency_ms") > 5000
)
.group_by([
"model",
pl.col("timestamp").dt.hour().alias("hour")
])
.agg([
pl.col("latency_ms").mean().alias("avg_latency"),
pl.col("tokens_total").sum().alias("total_tokens"),
pl.len().alias("request_count")
])
.sort(["model", "hour"])
.collect()
)
print(logs)
# RAG document preprocessing – filter length, compute stats
docs = (
pl.scan_parquet("knowledge_base/*.parquet")
.with_columns([
pl.col("content").str.len_chars().alias("char_count"),
pl.col("content").str.split(" ").list.len().alias("word_count")
])
.filter(
pl.col("char_count") > 200,
pl.col("char_count") < 8000
)
.with_columns((pl.col("char_count") / 1000).alias("char_count_k"))
.collect()
)
print(f"Retained {docs.height} valid chunks, avg length {docs['char_count'].mean():.0f} chars")Benchmark Script
The following script reproduces a group‑by benchmark and prints the speed‑up factor.
"""Polars vs Pandas performance comparison script"""
import time, numpy as np, pandas as pd, polars as pl
NUM_ROWS = 5_000_000 # adjust to your machine
data = {
"category": np.random.choice(["A","B","C","D","E"], NUM_ROWS),
"value_1": np.random.randn(NUM_ROWS),
"value_2": np.random.randn(NUM_ROWS),
"amount": np.random.randint(1, 1000, NUM_ROWS),
"date": np.random.choice(pd.date_range("2025-01-01", "2026-05-20"), NUM_ROWS)
}
pdf = pd.DataFrame(data)
# Pandas benchmark
t0 = time.perf_counter()
agg_pd = (
pdf.groupby("category")
.agg(mean_val=("value_1", "mean"),
sum_amt=("amount", "sum"),
count=("value_2", "count"))
.sort_values("sum_amt", ascending=False)
)
print("Pandas GroupBy + Agg:", time.perf_counter() - t0, "s")
# Polars benchmark (fair comparison on same data)
pl_df = pl.from_pandas(pdf)
t0 = time.perf_counter()
agg_pl = (
pl_df.group_by("category")
.agg([
pl.col("value_1").mean().alias("mean_val"),
pl.col("amount").sum().alias("sum_amt"),
pl.col("value_2").count().alias("count")
])
.sort("sum_amt", descending=True)
)
print("Polars GroupBy + Agg:", time.perf_counter() - t0, "s")
speedup = t0 / (time.perf_counter() - t0)
print("⚡ Polars speed‑up:", speedup, "x")Practical Migration Roadmap
Today – replace the largest pd.read_csv() with pl.scan_csv() and finish with .collect().to_pandas(). I/O becomes ~3× faster without changing downstream logic.
This week – identify the slowest groupby or join and rewrite it using Polars expressions from the syntax map.
Next month – refactor the entire ETL into a single lazy pipeline ending with .collect(), often yielding >10× end‑to‑end speedup.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Data STUDIO
Click to receive the "Python Study Handbook"; reply "benefit" in the chat to get it. Data STUDIO focuses on original data science articles, centered on Python, covering machine learning, data analysis, visualization, MySQL and other practical knowledge and project case studies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
