Big Data 7 min read

Introduction to cuDF: GPU‑Accelerated DataFrames and Dask Integration

This article introduces cuDF, a Python GPU DataFrame library with a pandas‑like API, compares it to pandas, explains when to use cuDF versus Dask‑cuDF for single‑GPU or multi‑GPU workloads, and provides practical code examples for common data operations.

Python Programming Learning Circle

Aug 9, 2024

Introduction to cuDF: GPU‑Accelerated DataFrames and Dask Integration

cuDF is a Python GPU DataFrame library built on the Apache Arrow columnar memory format, offering a pandas‑like API for loading, joining, aggregating, filtering, and other data operations.

It integrates with Dask via Dask‑cuDF, allowing Dask DataFrame partitions to be processed on GPUs, enabling scalable multi‑GPU workloads.

Key differences between cuDF and pandas include supported operations, data types, handling of missing values, iteration restrictions, result ordering, floating‑point determinism, column name uniqueness, lack of a generic object dtype, and limitations of the .apply() function.

Use cuDF when the dataset fits within a single GPU’s memory; use Dask‑cuDF for workloads that exceed a single GPU’s capacity or require distributed processing across multiple GPUs.

Example code demonstrates creating cuDF Series and DataFrames, converting from pandas, performing head, sorting, column selection, row slicing, boolean indexing, and groupby aggregations.

import os
import pandas as pd
import cudf

# Creating a cudf.Series
s = cudf.Series([1, 2, 3, None, 4])

# Creating a cudf.DataFrame
df = cudf.DataFrame(
    {
        "a": list(range(20)),
        "b": list(reversed(range(20))),
        "c": list(range(20)),
    }
)

# read data directly into a dask_cudf.DataFrame with read_csv
pdf = pd.DataFrame({"a": [0, 1, 2, 3], "b": [0.1, 0.2, None, 0.3]})
gdf = cudf.DataFrame.from_pandas(pdf)
gdf

# Viewing the top rows of a GPU dataframe.
ddf.head(2)

# Sorting by values.
df.sort_values(by="b")

# Selecting a single column
df["a"]

# Selecting rows from index 2 to 5 from columns ‘a’ and ‘b’.
df.loc[2:5, ["a", "b"]]

# Selecting via integers and integer slices, like numpy/pandas.
df.iloc[0:3, 0:2]

# Selecting rows in a DataFrame or Series by direct Boolean indexing.
df[df.b > 15]

# Grouping and then applying the sum function to the grouped data.
df.groupby("agg_col1").agg({"a": "max", "b": "mean", "c": "sum"})

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Python GPU dask DataFrames cuDF

Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.