Artificial Intelligence 21 min read

CuPy vs NumPy: Achieving Over 10× Speedup with GPU Acceleration

The article explains how replacing NumPy with the GPU‑compatible CuPy library can dramatically accelerate array computations, walks through installation prerequisites, demonstrates benchmark scripts showing up to ten‑fold speed improvements, discusses data type effects, custom kernels, and hybrid CPU‑GPU workflows for large‑scale data processing.

Data STUDIO

Sep 8, 2025

NumPy

NumPy is a C‑based Python library that provides fast numerical operations for multi‑dimensional arrays and matrices, together with a rich set of mathematical functions.

CuPy

CuPy is an open‑source library developed by Preferred Networks that offers a NumPy‑compatible API but executes computations on NVIDIA GPUs via CUDA. It is intended as a drop‑in replacement for NumPy, enabling parallel GPU execution with minimal code changes for speed‑critical scientific, data‑analysis, machine‑learning, deep‑learning, and image‑processing tasks.

Prerequisites

NVIDIA GPU (verify with >> nvidia-smi)

GPU compute capability ≥ 3.0 (verify with nvidia-smi --query-gpu=compute_cap --format=csv)

On Windows, install WSL Ubuntu Linux ( wsl --install)

Install the latest NVIDIA driver for the OS

Install Miniconda inside WSL

Install the appropriate CUDA Toolkit

CuPy installation

# Create test environment
conda create -n cupy_test python=3.11 -y
# Activate it
conda activate cupy_test
# Install libraries
conda install -c conda-forge cupy jupyter numpy pandas matplotlib -y
# Launch Jupyter Notebook
jupyter notebook

Running NumPy and CuPy repeatedly and recording the best times gives a fairly fair comparison, although the first CuPy call incurs a small overhead.

Example 1: Simple array arithmetic

This example creates large one‑dimensional arrays and adds a constant to each element using both NumPy (CPU) and CuPy (GPU). A loop executed on the GPU is slower because each iteration transfers data between CPU and GPU, but vectorized GPU operations are roughly twice as fast as their CPU counterparts.

import numpy as np
import cupy as cp
from timeit import default_timer as timer

def func1(a):
    for i in range(len(a)):
        a[i] += 1

def func2(a):
    for i in range(len(a)):
        a[i] += 2

def func3(a):
    a += 3

def func4(a):
    a += 4

if __name__ == "__main__":
    n1 = 300000000
    a1 = np.ones(n1, dtype=np.float64)
    n2 = 300000
    a2 = cp.ones(n2, dtype=cp.float64)
    n3 = 300000000
    a3 = np.ones(n3, dtype=np.float64)
    n4 = 300000000
    a4 = cp.ones(n2, dtype=cp.float64)
    start = timer(); func1(a1); print("without GPU/for loop:", timer()-start)
    start = timer(); func2(a2); cp.cuda.Stream.null.synchronize(); print("with GPU:/for loop", timer()-start)
    start = timer(); func3(a3); print("without GPU:vectorization", timer()-start)
    start = timer(); func4(a4); cp.cuda.Stream.null.synchronize(); print("with GPU:vectorization", timer()-start)

Output:

without GPU/for loop: 25.486853414004145
with GPU:/for loop 4.358431388995086
without GPU:vectorization 0.13804959499998404
with GPU:vectorization 0.07079174599994076

The GPU loop runs in about one‑seventh the time of the CPU loop, while vectorized GPU operations take roughly half the time of CPU vectorization.

Example 2: Large matrix multiplication

Both NumPy and CuPy perform a 10 000 × 10 000 matrix multiplication. With the default float64 dtype, CuPy is slower because GPUs handle 64‑bit values less efficiently. Converting the arrays to float32 yields a dramatic speedup.

# NumPy version (float64)
import numpy as np
from timeit import default_timer as timer
np.random.seed(0)
A = np.random.uniform(1.0, 100.0, size=(10000, 10000))
B = np.random.uniform(1.0, 100.0, size=(10000, 10000))
start = timer(); C = np.matmul(A, B); print("without GPU:", timer()-start)

# CuPy version (float64)
import cupy as cp
from timeit import default_timer as timer
A = cp.random.uniform(1.0, 100.0, size=(10000, 10000))
B = cp.random.uniform(1.0, 100.0, size=(10000, 10000))
start = timer(); C = cp.matmul(A, B); cp.cuda.Stream.null.synchronize(); print("with GPU:", timer()-start)

Results (float64): NumPy ≈ 3.21 s, CuPy ≈ 3.99 s (CuPy slower).

After converting to float32:

# NumPy float32
A = np.random.uniform(1.0, 100.0, size=(10000, 10000)).astype(np.float32)
B = np.random.uniform(1.0, 100.0, size=(10000, 10000)).astype(np.float32)
start = timer(); C = np.matmul(A, B); print("without GPU:", timer()-start)

# CuPy float32
A = cp.random.uniform(1.0, 100.0, size=(10000, 10000)).astype(cp.float32)
B = cp.random.uniform(1.0, 100.0, size=(10000, 10000)).astype(cp.float32)
start = timer(); C = cp.matmul(A, B); cp.cuda.Stream.null.synchronize(); print("with GPU:", timer()-start)

Results: NumPy ≈ 1.83 s, CuPy ≈ 0.14 s – more than a ten‑fold speedup.

The author notes that if a GPU only supports 32‑bit registers, using it for 64‑bit data may provide little advantage, and future GPUs may adopt 64‑bit registers.

Example 3: Hybrid CPU‑GPU workflow for CSV data

The task is to read a large CSV (~12 million rows), compute per‑date statistics, and plot them. The CPU version uses Pandas, NumPy and Matplotlib. The GPU version replaces the heavy numeric arrays with CuPy while still using NumPy for date handling and Matplotlib for plotting, requiring a copy back to CPU before visualization.

# CPU version (simplified)
import pandas as pd, numpy as np, matplotlib.pyplot as plt, datetime
from timeit import default_timer as timer

df = pd.read_csv('/mnt/d/test/D202.csv')
start = timer()
# date conversion, aggregation, etc.
... (code omitted) ...
plt.show()
print("Finished with CPU at", timer()-start)

# GPU version (simplified)
import pandas as pd, numpy as np, cupy as cp, matplotlib.pyplot as plt, datetime
from timeit import default_timer as timer

df = pd.read_csv('/mnt/d/test/D202.csv')
start = timer()
usage = cp.array(df['USAGE'].values)
# date handling stays in NumPy
... (aggregation loop using cp.max, cp.min, cp.mean) ...
# copy results back to CPU for plotting
max_usage = cp.asnumpy(max_usage)
min_usage = cp.asnumpy(min_usage)
mean_usage = cp.asnumpy(mean_usage)
... (Matplotlib plotting) ...
print("Finished with GPU at", timer()-start)

Performance: the GPU implementation is about 70 % faster than the CPU one, showing that the extra data‑copy steps are worthwhile for this workload.

Conclusion

CuPy provides a convenient, NumPy‑compatible path to leverage GPU acceleration. Significant speedups are observed for parallelizable operations, especially when using 32‑bit data types and avoiding frequent CPU‑GPU data transfers. Benefits vary with workload size, datatype, and GPU memory architecture, so thorough testing is essential before committing to a GPU‑based implementation.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python GPU acceleration CUDA benchmark numpy CuPy

Written by

Data STUDIO

Click to receive the "Python Study Handbook"; reply "benefit" in the chat to get it. Data STUDIO focuses on original data science articles, centered on Python, covering machine learning, data analysis, visualization, MySQL and other practical knowledge and project case studies.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

NumPy

CuPy

Prerequisites

CuPy installation

Example 1: Simple array arithmetic

Example 2: Large matrix multiplication

Example 3: Hybrid CPU‑GPU workflow for CSV data

Conclusion

Data STUDIO

How this landed with the community

Was this worth your time?

0 Comments

Example 1: Simple array arithmetic

Example 2: Large matrix multiplication

Example 3: Hybrid CPU‑GPU workflow for CSV data