How Mars Supercharges Numpy, Pandas, and Scikit‑Learn with Parallel and GPU Acceleration

This article explains how the Mars framework enables parallel and distributed execution of core Python data‑science libraries—Numpy, Pandas, and Scikit‑Learn—while integrating with RAPIDS for GPU acceleration, and demonstrates its performance advantages through code examples and benchmark results.

Alibaba Cloud Developer
Alibaba Cloud Developer
Alibaba Cloud Developer
How Mars Supercharges Numpy, Pandas, and Scikit‑Learn with Parallel and GPU Acceleration

In the data‑science world, Python dominates with libraries such as Numpy, Pandas, and Scikit‑Learn. Mars, created by the MaxCompute team, allows these libraries to run in parallel and distributed mode and can be combined with the RAPIDS platform to leverage GPU acceleration.

Numpy

Numpy provides the fundamental multi‑dimensional array structure for numerical computing. A Monte Carlo method for estimating π illustrates its simplicity:

import numpy as np
N = 10 ** 7  # 10 million points
data = np.random.uniform(-1, 1, size=(N, 2))
inside = (np.sqrt((data ** 2).sum(axis=1)) < 1).sum()
pi = 4 * inside / N
print('pi: %.5f' % pi)

The example shows how a few lines of Numpy code can perform large‑scale numerical calculations efficiently.

Pandas

Pandas offers a powerful DataFrame API for data analysis. Using the Movielens 20M dataset, a simple group‑by aggregation demonstrates its capabilities:

import pandas as pd
ratings = pd.read_csv('ml-20m/ratings.csv')
ratings.groupby('userId').agg({'rating': ['sum', 'mean', 'max', 'min']})

scikit‑learn

scikit‑learn provides high‑level machine‑learning interfaces. The following K‑Nearest Neighbors example shows a typical workflow:

import pandas as pd
from sklearn.neighbors import NearestNeighbors
df = pd.read_csv('data.csv')
nn = NearestNeighbors(n_neighbors=10)
nn.fit(df)
neighbors = nn.kneighbors(df)

Mars: Parallel and Distributed Accelerator

Traditional data‑science libraries often cannot exploit multi‑core CPUs, GPUs, or distributed resources, and their imperative style can lead to high memory consumption. Mars aims to be simple (reuse existing Numpy/Pandas/scikit‑Learn code), avoid reinventing wheels, combine declarative and imperative paradigms, and be robust for production use.

Mars Tensor (Numpy accelerator)

import mars.tensor as mt
N = 10 ** 10
data = mt.random.uniform(-1, 1, size=(N, 2))
inside = (mt.sqrt((data ** 2).sum(axis=1)) < 1).sum()
pi = (4 * inside / N).execute()
print('pi: %.5f' % pi)

Running the same Monte Carlo test with 10 billion points reduces memory usage dramatically and finishes in about 3 minutes 44 seconds, far faster than a pure Numpy implementation.

Mars DataFrame (Pandas accelerator)

import mars.dataframe as md
ratings = md.read_csv('ml-20m/ratings.csv')
ratings.groupby('userId').agg({'rating': ['sum', 'mean', 'max', 'min']}).execute()

Mars Learn (scikit‑learn accelerator)

import mars.dataframe as md
from mars.learn.neighbors import NearestNeighbors
df = md.read_csv('data.csv')
nn = NearestNeighbors(n_neighbors=10)
nn.fit(df)
neighbors = nn.kneighbors(df).fetch()

RAPIDS Integration

RAPIDS provides GPU‑accelerated equivalents: CuPy for Numpy, cuDF for Pandas, and cuML for scikit‑learn. Simple import replacement enables massive speedups.

import cupy as cp
N = 10 ** 7
data = cp.random.uniform(-1, 1, size=(N, 2))
inside = (cp.sqrt((data ** 2).sum(axis=1)) < 1).sum()
pi = 4 * inside / N
print('pi: %.5f' % pi)

On the same hardware, the GPU version reduces execution time from 757 ms to 36 ms (≈20× faster).

Performance Tests

Benchmarks were run on a 500 GB, 96‑core machine equipped with an NVIDIA V100 GPU. Group‑by and join workloads of 500 MB, 5 GB, and 20 GB were compared between Pandas, DASK, and Mars (both CPU and GPU). Mars scales almost linearly across multiple nodes and outperforms Pandas, especially when data size exceeds memory limits; GPU‑enabled Mars further reduces runtime.

Conclusion

RAPIDS brings Python data‑science to the GPU, while Mars adds parallel and distributed execution with low memory overhead. Their combination provides a powerful stack for large‑scale data analytics and machine‑learning tasks.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

PythonGPU Accelerationparallel computingMarspandasNumPyscikit-learndistributed data science
Alibaba Cloud Developer
Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.