Big Data 15 min read

What’s New in pandas 2.0: Arrow Backend, Copy‑On‑Write, and Performance Improvements

The article reviews pandas 2.0’s major upgrades—including an Apache Arrow backend that speeds up CSV reads by over 30×, new Arrow dtypes, a nullable‑numpy dtype for missing values, a copy‑on‑write memory model, optional dependencies, and benchmark comparisons with ydata‑profiling—highlighting the library’s enhanced performance, flexibility, and interoperability for data‑intensive Python workflows.

Python Programming Learning Circle

Aug 13, 2024

What’s New in pandas 2.0: Arrow Backend, Copy‑On‑Write, and Performance Improvements

In April 2023 the pandas team released version 2.0.0, a milestone that sparked a wave of discussion in the data‑science community because of its extensive new features and performance enhancements.

The most visible change is the integration of an Apache Arrow backend for data storage, which dramatically speeds up I/O operations and reduces memory usage. A benchmark reading a 650 MB Hacker News CSV file shows a %timeit df = pd.read_csv("data/hn.csv") taking about 12 seconds, while using the Arrow engine (

%timeit df_arrow = pd.read_csv("data/hn.csv", engine='pyarrow', dtype_backend='pyarrow')

) completes in roughly 0.33 seconds—a more than 35× improvement.

%timeit df = pd.read_csv("data/hn.csv")
# 12 s ± 304 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df_arrow = pd.read_csv("data/hn.csv", engine='pyarrow', dtype_backend='pyarrow')
# 329 ms ± 65 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Beyond faster reads, Arrow introduces columnar dtypes that are shared across languages. String operations become markedly quicker; for example, checking whether the "Author" column starts with "phy" drops from 851 ms to 27.9 ms when using the Arrow backend.

%timeit df["Author"].str.startswith('phy')
# 851 ms ± 7.89 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df_arrow["Author"].str.startswith('phy')
# 27.9 ms ± 538 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

pandas 2.0 also adds a new dtype='numpy_nullable' option that preserves the original dtype (e.g., int64) while allowing true missing values, avoiding the automatic up‑cast to float64 that older versions performed.

df_null = pd.read_csv("data/hn.csv", dtype_backend='numpy_nullable')
points_null = df_null["Points"]
points_null.isna().sum()
# 0
points_null.iloc[0] = None
points_null.head()
# 0              
# 1      16
# 2       7
# 3       5
# 4       7

A lazy copy‑on‑write (CoW) mechanism has been introduced to defer copying of DataFrames and Series until a mutation occurs. When CoW is enabled, chained assignments raise a ChainedAssignmentError instead of silently modifying a view, encouraging the use of .loc for safe updates.

pd.options.mode.copy_on_write = True
# Enable CoW
pd.options.mode.copy_on_write = False
# Disable CoW (default in pandas 2.0)

Installation flexibility is improved through optional dependencies; users can install only the extras they need, for example:

pip install "pandas[postgresql, aws, spss]>=2.0.0"

Benchmarking with ydata‑profiling shows that reading data with the Arrow engine is faster, though the profiling step itself sees only marginal speed differences between pandas 1.5.3 and 2.0.2.

# Using pandas 1.5.3 and ydata‑profiling 4.2.0
%timeit df = pd.read_csv("data/hn.csv")
# 10.1 s ± 215 ms per loop
%timeit profile = ProfileReport(df, title="Pandas Profiling Report")
# 4.85 ms ± 77.9 µs per loop
%timeit profile.to_file("report.html")
# 18.5 ms ± 2.02 ms per loop

# Using pandas 2.0.2 and ydata‑profiling 4.3.1
%timeit df_arrow = pd.read_csv("data/hn.csv", engine='pyarrow')
# 3.27 s ± 38.1 ms per loop
%timeit profile_arrow = ProfileReport(df_arrow, title="Pandas Profiling Report")
# 5.24 ms ± 448 µs per loop
%timeit profile_arrow.to_file("report.html")
# 19 ms ± 1.87 ms per loop

In summary, pandas 2.0 delivers substantial performance gains through the Arrow backend and copy‑on‑write, greater flexibility via optional dependencies and nullable dtypes, and improved interoperability with other Arrow‑compatible tools, making it a valuable upgrade for both novice and experienced data practitioners.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

performance Python Copy-on-Write Pandas Apache Arrow

Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.