Big Data 6 min read

Accelerating Python Function Execution with Multiprocessing: Achieving a 30× Speedup on Large Datasets

This article explains how to use Python's multiprocessing module to parallelize a custom preprocessing function on a 537k‑record dataset, demonstrating roughly a thirty‑fold reduction in execution time compared with single‑process execution.

Python Programming Learning Circle

Aug 28, 2021

Accelerating Python Function Execution with Multiprocessing: Achieving a 30× Speedup on Large Datasets

Introduction

Python is a popular programming language, especially in the data‑science community, but its dynamic nature and interpreted execution make it slower than compiled languages such as C.

Although C can run 10–100 times faster, Python offers much faster development speed, which is more valuable for data‑science research where rapid prototyping outweighs raw performance.

This article shows how to use the multiprocessing module to run a custom Python function in parallel and compares the runtime metrics.

Multiprocessing Basics

On a single‑core CPU, multiple tasks must be time‑sliced, causing frequent context switches. Multi‑core CPUs allow true parallel execution, which is called parallel processing.

Why It Matters

Data cleaning, feature engineering, and exploration are essential steps before feeding data into machine‑learning models; these steps become time‑consuming for large datasets.

Parallel processing is an effective way to improve Python program performance, and the multiprocessing module enables execution across multiple CPU cores.

Implementation

We use the Pool class from the multiprocessing module to apply a custom preprocess() function to different chunks of the dataset, achieving data parallelism.

import multiprocessing
from functools import partial
from QuoraTextPreprocessing import preprocess

BUCKET_SIZE = 50000

def run_process(df, start):
    df = df[start:start+BUCKET_SIZE]
    print(start, "to ",start+BUCKET_SIZE)
    temp = df["question"].apply(preprocess)

chunks  = [x for x in range(0,df.shape[0], BUCKET_SIZE)]  
pool = multiprocessing.Pool()
func = partial(run_process, df)
temp = pool.map(func,chunks)
pool.close()
pool.join()

The dataset contains 537,361 text questions; with a bucket size of 50,000 it is split into 11 chunks that can be processed in parallel.

Benchmark

The tests were run on a machine with 64 GB RAM and 10 CPU cores.

Results show that parallel processing speeds up the execution of the preprocess() function by nearly 30× compared with a single‑process run.

Conclusion

The article demonstrates that adding a few lines of multiprocessing code can reduce the execution time of a 537k‑record dataset by almost thirty times, making parallel processing a recommended approach for handling large data sets efficiently.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Big Data Benchmark parallel processing Multiprocessing

Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.