Accelerating Python Data Preprocessing with Multiprocessing in Three Lines of Code
This article demonstrates how to use Python's concurrent.futures module to parallelize image resizing, turning a single‑process script into a multi‑core solution with just three additional lines of code, achieving up to a six‑fold speed‑up on typical CPUs.
Python is the preferred language for machine learning, but its default single‑process execution can waste CPU cores during data preprocessing, especially on multi‑core machines.
By leveraging the hidden features of the concurrent.futures module, you can convert a standard script into a parallel version with only three extra lines.
Standard Method
The typical approach reads each image, resizes it, and repeats this in a loop, as shown below:
<code>import glob
import os
import cv2
for image_filename in glob.glob("*.jpg"):
img = cv2.imread(image_filename)
img = cv2.resize(img, (600, 600))
</code>Running this on a folder of 1,000 JPEG files on an Intel i7‑8700K (6 cores) took about 8 seconds.
Faster Method
Using a process pool, the workload is split across all CPU cores, dramatically reducing execution time. The key code is:
<code>import glob
import os
import cv2
import concurrent.futures
def load_and_resize(image_filename):
img = cv2.imread(image_filename)
img = cv2.resize(img, (600, 600))
with concurrent.futures.ProcessPoolExecutor() as executor:
image_files = glob.glob("*.jpg")
executor.map(load_and_resize, image_files)
</code>When executed, the same 1,000‑image task completed in roughly 1.14 seconds, a near‑6× improvement.
Note that parallel pools introduce overhead, so speed‑ups may vary, and the method is best suited for tasks that can be processed independently without requiring a specific order.
Additionally, the data types passed to worker processes must be pickle‑able (e.g., numbers, strings, lists, dictionaries, top‑level functions, and classes), as outlined in Python's documentation.
Source: Towards Data Science article by George Seif.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.