Accelerating Python Function Execution with Multiprocessing: Achieving a 30× Speedup on Large Datasets
This article explains how to use Python's multiprocessing module to parallelize a custom preprocessing function on a 537k‑record dataset, demonstrating roughly a thirty‑fold reduction in execution time compared with single‑process execution.
Introduction
Python is a popular programming language, especially in the data‑science community, but its dynamic nature and interpreted execution make it slower than compiled languages such as C.
Although C can run 10–100 times faster, Python offers much faster development speed, which is more valuable for data‑science research where rapid prototyping outweighs raw performance.
This article shows how to use the multiprocessing module to run a custom Python function in parallel and compares the runtime metrics.
Multiprocessing Basics
On a single‑core CPU, multiple tasks must be time‑sliced, causing frequent context switches. Multi‑core CPUs allow true parallel execution, which is called parallel processing.
Why It Matters
Data cleaning, feature engineering, and exploration are essential steps before feeding data into machine‑learning models; these steps become time‑consuming for large datasets.
Parallel processing is an effective way to improve Python program performance, and the multiprocessing module enables execution across multiple CPU cores.
Implementation
We use the Pool class from the multiprocessing module to apply a custom preprocess() function to different chunks of the dataset, achieving data parallelism.
import multiprocessing
from functools import partial
from QuoraTextPreprocessing import preprocess
BUCKET_SIZE = 50000
def run_process(df, start):
df = df[start:start+BUCKET_SIZE]
print(start, "to ",start+BUCKET_SIZE)
temp = df["question"].apply(preprocess)
chunks = [x for x in range(0,df.shape[0], BUCKET_SIZE)]
pool = multiprocessing.Pool()
func = partial(run_process, df)
temp = pool.map(func,chunks)
pool.close()
pool.join()The dataset contains 537,361 text questions; with a bucket size of 50,000 it is split into 11 chunks that can be processed in parallel.
Benchmark
The tests were run on a machine with 64 GB RAM and 10 CPU cores.
Results show that parallel processing speeds up the execution of the preprocess() function by nearly 30× compared with a single‑process run.
Conclusion
The article demonstrates that adding a few lines of multiprocessing code can reduce the execution time of a 537k‑record dataset by almost thirty times, making parallel processing a recommended approach for handling large data sets efficiently.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
