Big Data 7 min read

Accelerating Python Data Preprocessing with Multiprocessing in Three Lines of Code

This article demonstrates how to use Python's concurrent.futures module to parallelize image resizing, turning a single‑process script into a multi‑core solution with just three additional lines of code, achieving up to a six‑fold speed‑up on typical CPUs.

Python Programming Learning Circle

Jun 17, 2023

Accelerating Python Data Preprocessing with Multiprocessing in Three Lines of Code

Python is the preferred language for machine learning, but its default single‑process execution can waste CPU cores during data preprocessing, especially on multi‑core machines.

By leveraging the hidden features of the concurrent.futures module, you can convert a standard script into a parallel version with only three extra lines.

Standard Method

The typical approach reads each image, resizes it, and repeats this in a loop, as shown below:

import glob
import os
import cv2

for image_filename in glob.glob("*.jpg"):
    img = cv2.imread(image_filename)
    img = cv2.resize(img, (600, 600))

Running this on a folder of 1,000 JPEG files on an Intel i7‑8700K (6 cores) took about 8 seconds.

Faster Method

Using a process pool, the workload is split across all CPU cores, dramatically reducing execution time. The key code is:

import glob
import os
import cv2
import concurrent.futures

def load_and_resize(image_filename):
    img = cv2.imread(image_filename)
    img = cv2.resize(img, (600, 600))

with concurrent.futures.ProcessPoolExecutor() as executor:
    image_files = glob.glob("*.jpg")
    executor.map(load_and_resize, image_files)

When executed, the same 1,000‑image task completed in roughly 1.14 seconds, a near‑6× improvement.

Note that parallel pools introduce overhead, so speed‑ups may vary, and the method is best suited for tasks that can be processed independently without requiring a specific order.

Additionally, the data types passed to worker processes must be pickle‑able (e.g., numbers, strings, lists, dictionaries, top‑level functions, and classes), as outlined in Python's documentation.

Source: Towards Data Science article by George Seif.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Python parallel computing Data preprocessing concurrent.futures

Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.