Introduction to cuDF: GPU‑Accelerated DataFrames and Dask Integration
This article introduces cuDF, a Python GPU DataFrame library with a pandas‑like API, compares it to pandas, explains when to use cuDF versus Dask‑cuDF for single‑GPU or multi‑GPU workloads, and provides practical code examples for common data operations.
cuDF is a Python GPU DataFrame library built on the Apache Arrow columnar memory format, offering a pandas‑like API for loading, joining, aggregating, filtering, and other data operations.
It integrates with Dask via Dask‑cuDF, allowing Dask DataFrame partitions to be processed on GPUs, enabling scalable multi‑GPU workloads.
Key differences between cuDF and pandas include supported operations, data types, handling of missing values, iteration restrictions, result ordering, floating‑point determinism, column name uniqueness, lack of a generic object dtype, and limitations of the .apply() function.
Use cuDF when the dataset fits within a single GPU’s memory; use Dask‑cuDF for workloads that exceed a single GPU’s capacity or require distributed processing across multiple GPUs.
Example code demonstrates creating cuDF Series and DataFrames, converting from pandas, performing head, sorting, column selection, row slicing, boolean indexing, and groupby aggregations.
import os
import pandas as pd
import cudf
# Creating a cudf.Series
s = cudf.Series([1, 2, 3, None, 4])
# Creating a cudf.DataFrame
df = cudf.DataFrame(
{
"a": list(range(20)),
"b": list(reversed(range(20))),
"c": list(range(20)),
}
)
# read data directly into a dask_cudf.DataFrame with read_csv
pdf = pd.DataFrame({"a": [0, 1, 2, 3], "b": [0.1, 0.2, None, 0.3]})
gdf = cudf.DataFrame.from_pandas(pdf)
gdf
# Viewing the top rows of a GPU dataframe.
ddf.head(2)
# Sorting by values.
df.sort_values(by="b")
# Selecting a single column
df["a"]
# Selecting rows from index 2 to 5 from columns ‘a’ and ‘b’.
df.loc[2:5, ["a", "b"]]
# Selecting via integers and integer slices, like numpy/pandas.
df.iloc[0:3, 0:2]
# Selecting rows in a DataFrame or Series by direct Boolean indexing.
df[df.b > 15]
# Grouping and then applying the sum function to the grouped data.
df.groupby("agg_col1").agg({"a": "max", "b": "mean", "c": "sum"})Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.