CuPy vs NumPy: Achieving Over 10× Speedup with GPU Acceleration
The article explains how replacing NumPy with the GPU‑compatible CuPy library can dramatically accelerate array computations, walks through installation prerequisites, demonstrates benchmark scripts showing up to ten‑fold speed improvements, discusses data type effects, custom kernels, and hybrid CPU‑GPU workflows for large‑scale data processing.
NumPy
NumPy is a C‑based Python library that provides fast numerical operations for multi‑dimensional arrays and matrices, together with a rich set of mathematical functions.
CuPy
CuPy is an open‑source library developed by Preferred Networks that offers a NumPy‑compatible API but executes computations on NVIDIA GPUs via CUDA. It is intended as a drop‑in replacement for NumPy, enabling parallel GPU execution with minimal code changes for speed‑critical scientific, data‑analysis, machine‑learning, deep‑learning, and image‑processing tasks.
Prerequisites
NVIDIA GPU (verify with >> nvidia-smi)
GPU compute capability ≥ 3.0 (verify with nvidia-smi --query-gpu=compute_cap --format=csv)
On Windows, install WSL Ubuntu Linux ( wsl --install)
Install the latest NVIDIA driver for the OS
Install Miniconda inside WSL
Install the appropriate CUDA Toolkit
CuPy installation
# Create test environment
conda create -n cupy_test python=3.11 -y
# Activate it
conda activate cupy_test
# Install libraries
conda install -c conda-forge cupy jupyter numpy pandas matplotlib -y
# Launch Jupyter Notebook
jupyter notebookRunning NumPy and CuPy repeatedly and recording the best times gives a fairly fair comparison, although the first CuPy call incurs a small overhead.
Example 1: Simple array arithmetic
This example creates large one‑dimensional arrays and adds a constant to each element using both NumPy (CPU) and CuPy (GPU). A loop executed on the GPU is slower because each iteration transfers data between CPU and GPU, but vectorized GPU operations are roughly twice as fast as their CPU counterparts.
import numpy as np
import cupy as cp
from timeit import default_timer as timer
def func1(a):
for i in range(len(a)):
a[i] += 1
def func2(a):
for i in range(len(a)):
a[i] += 2
def func3(a):
a += 3
def func4(a):
a += 4
if __name__ == "__main__":
n1 = 300000000
a1 = np.ones(n1, dtype=np.float64)
n2 = 300000
a2 = cp.ones(n2, dtype=cp.float64)
n3 = 300000000
a3 = np.ones(n3, dtype=np.float64)
n4 = 300000000
a4 = cp.ones(n2, dtype=cp.float64)
start = timer(); func1(a1); print("without GPU/for loop:", timer()-start)
start = timer(); func2(a2); cp.cuda.Stream.null.synchronize(); print("with GPU:/for loop", timer()-start)
start = timer(); func3(a3); print("without GPU:vectorization", timer()-start)
start = timer(); func4(a4); cp.cuda.Stream.null.synchronize(); print("with GPU:vectorization", timer()-start)Output:
without GPU/for loop: 25.486853414004145
with GPU:/for loop 4.358431388995086
without GPU:vectorization 0.13804959499998404
with GPU:vectorization 0.07079174599994076The GPU loop runs in about one‑seventh the time of the CPU loop, while vectorized GPU operations take roughly half the time of CPU vectorization.
Example 2: Large matrix multiplication
Both NumPy and CuPy perform a 10 000 × 10 000 matrix multiplication. With the default float64 dtype, CuPy is slower because GPUs handle 64‑bit values less efficiently. Converting the arrays to float32 yields a dramatic speedup.
# NumPy version (float64)
import numpy as np
from timeit import default_timer as timer
np.random.seed(0)
A = np.random.uniform(1.0, 100.0, size=(10000, 10000))
B = np.random.uniform(1.0, 100.0, size=(10000, 10000))
start = timer(); C = np.matmul(A, B); print("without GPU:", timer()-start) # CuPy version (float64)
import cupy as cp
from timeit import default_timer as timer
A = cp.random.uniform(1.0, 100.0, size=(10000, 10000))
B = cp.random.uniform(1.0, 100.0, size=(10000, 10000))
start = timer(); C = cp.matmul(A, B); cp.cuda.Stream.null.synchronize(); print("with GPU:", timer()-start)Results (float64): NumPy ≈ 3.21 s, CuPy ≈ 3.99 s (CuPy slower).
After converting to float32:
# NumPy float32
A = np.random.uniform(1.0, 100.0, size=(10000, 10000)).astype(np.float32)
B = np.random.uniform(1.0, 100.0, size=(10000, 10000)).astype(np.float32)
start = timer(); C = np.matmul(A, B); print("without GPU:", timer()-start) # CuPy float32
A = cp.random.uniform(1.0, 100.0, size=(10000, 10000)).astype(cp.float32)
B = cp.random.uniform(1.0, 100.0, size=(10000, 10000)).astype(cp.float32)
start = timer(); C = cp.matmul(A, B); cp.cuda.Stream.null.synchronize(); print("with GPU:", timer()-start)Results: NumPy ≈ 1.83 s, CuPy ≈ 0.14 s – more than a ten‑fold speedup.
The author notes that if a GPU only supports 32‑bit registers, using it for 64‑bit data may provide little advantage, and future GPUs may adopt 64‑bit registers.
Example 3: Hybrid CPU‑GPU workflow for CSV data
The task is to read a large CSV (~12 million rows), compute per‑date statistics, and plot them. The CPU version uses Pandas, NumPy and Matplotlib. The GPU version replaces the heavy numeric arrays with CuPy while still using NumPy for date handling and Matplotlib for plotting, requiring a copy back to CPU before visualization.
# CPU version (simplified)
import pandas as pd, numpy as np, matplotlib.pyplot as plt, datetime
from timeit import default_timer as timer
df = pd.read_csv('/mnt/d/test/D202.csv')
start = timer()
# date conversion, aggregation, etc.
... (code omitted) ...
plt.show()
print("Finished with CPU at", timer()-start) # GPU version (simplified)
import pandas as pd, numpy as np, cupy as cp, matplotlib.pyplot as plt, datetime
from timeit import default_timer as timer
df = pd.read_csv('/mnt/d/test/D202.csv')
start = timer()
usage = cp.array(df['USAGE'].values)
# date handling stays in NumPy
... (aggregation loop using cp.max, cp.min, cp.mean) ...
# copy results back to CPU for plotting
max_usage = cp.asnumpy(max_usage)
min_usage = cp.asnumpy(min_usage)
mean_usage = cp.asnumpy(mean_usage)
... (Matplotlib plotting) ...
print("Finished with GPU at", timer()-start)Performance: the GPU implementation is about 70 % faster than the CPU one, showing that the extra data‑copy steps are worthwhile for this workload.
Conclusion
CuPy provides a convenient, NumPy‑compatible path to leverage GPU acceleration. Significant speedups are observed for parallelizable operations, especially when using 32‑bit data types and avoiding frequent CPU‑GPU data transfers. Benefits vary with workload size, datatype, and GPU memory architecture, so thorough testing is essential before committing to a GPU‑based implementation.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Data STUDIO
Click to receive the "Python Study Handbook"; reply "benefit" in the chat to get it. Data STUDIO focuses on original data science articles, centered on Python, covering machine learning, data analysis, visualization, MySQL and other practical knowledge and project case studies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
