Fundamentals 11 min read

Effective Python Parallelism with Thread Pools and the map() Function

This article critiques traditional Python threading tutorials and demonstrates how to replace verbose thread‑pool code with concise map‑based parallelism using multiprocessing and multiprocessing.dummy, providing practical examples, performance measurements, and guidelines for choosing pool sizes for I/O‑ and CPU‑bound tasks.

Python Programming Learning Circle

Apr 2, 2021

Effective Python Parallelism with Thread Pools and the map() Function

Python's reputation for difficult parallelism often stems from poor teaching rather than technical limitations; many tutorials focus on heavyweight thread‑pool patterns that are verbose and error‑prone.

Traditional examples typically involve creating a class, a queue, and explicit worker management, as shown below:

import os
import PIL
from multiprocessing import Pool
from PIL import Image

SIZE = (75, 75)
SAVE_DIRECTORY = 'thumbs'

def get_image_paths(folder):
    return (os.path.join(folder, f)
            for f in os.listdir(folder)
            if 'jpeg' in f)

def create_thumbnail(filename):
    im = Image.open(filename)
    im.thumbnail(SIZE, Image.ANTIALIAS)
    base, fname = os.path.split(filename)
    save_path = os.path.join(base, SAVE_DIRECTORY, fname)
    im.save(save_path)

if __name__ == '__main__':
    folder = os.path.abspath('11_18_2013_R000_IQM_Big_Sur_Mon__e10d1958e7b766c3e840')
    os.mkdir(os.path.join(folder, SAVE_DIRECTORY))
    images = get_image_paths(folder)
    pool = Pool()
    pool.map(create_thumbnail, images)
    pool.close()
    pool.join()

A more realistic thread‑pool example (adapted from IBM) uses explicit consumer/producer classes:

#Example2.py
'''A more realistic thread pool example'''
import time, threading, Queue, urllib2

class Consumer(threading.Thread):
    def __init__(self, queue):
        threading.Thread.__init__(self)
        self._queue = queue
    def run(self):
        while True:
            content = self._queue.get()
            if isinstance(content, str) and content == 'quit':
                break
            response = urllib2.urlopen(content)
        print 'Bye byes!'

def Producer():
    urls = ['http://www.python.org', 'http://www.yahoo.com',
            'http://www.scala.org', 'http://www.google.com']
    queue = Queue.Queue()
    worker_threads = build_worker_pool(queue, 4)
    start_time = time.time()
    for url in urls:
        queue.put(url)
    for worker in worker_threads:
        queue.put('quit')
    for worker in worker_threads:
        worker.join()
    print 'Done! Time taken: {}'.format(time.time() - start_time)

def build_worker_pool(queue, size):
    workers = []
    for _ in range(size):
        worker = Consumer(queue)
        worker.start()
        workers.append(worker)
    return workers

if __name__ == '__main__':
    Producer()

Both examples are lengthy and require manual thread management. A simpler approach leverages the built‑in map function, originally from functional languages, to apply a function over a sequence in parallel.

Using map with urllib2.urlopen looks like this:

urls = ['http://www.yahoo.com', 'http://www.reddit.com']
results = map(urllib2.urlopen, urls)

Which is equivalent to the explicit loop:

results = []
for url in urls:
    results.append(urllib2.urlopen(url))

The multiprocessing module provides a thread‑based clone called multiprocessing.dummy. Importing the pools is straightforward:

from multiprocessing import Pool
from multiprocessing.dummy import Pool as ThreadPool

Creating a thread pool is a single line: pool = ThreadPool() The pool’s processes (or thread count) defaults to the number of CPU cores, but for I/O‑bound work you should experiment to find the optimal size.

# Example of setting pool size
pool = ThreadPool(4)  # Sets the pool size to 4

Rewriting the earlier IBM example with a thread pool reduces the code to four essential lines:

import urllib2
from multiprocessing.dummy import Pool as ThreadPool

urls = ['http://www.python.org', 'http://www.python.org/about/', ...]
pool = ThreadPool(4)
results = pool.map(urllib2.urlopen, urls)
pool.close()
pool.join()

Performance measurements on the author’s machine show dramatic speedups:

# Single thread: 14.4 seconds
# 4‑thread pool: 3.1 seconds
# 8‑thread pool: 1.4 seconds
# 13‑thread pool: 1.3 seconds

Another real‑world case processes thousands of images to create thumbnails. The single‑process version takes 27.9 seconds for 6000 images:

# Single‑process version (excerpt)
import os, PIL
from multiprocessing import Pool
from PIL import Image

# ... same get_image_paths and create_thumbnail as above ...
if __name__ == '__main__':
    folder = os.path.abspath('...')
    os.mkdir(os.path.join(folder, SAVE_DIRECTORY))
    images = get_image_paths(folder)
    for image in images:
        create_thumbnail(image)

Replacing the loop with pool.map reduces the runtime to about 5.6 seconds:

# Parallel version using map
import os, PIL
from multiprocessing import Pool
from PIL import Image

# ... definitions unchanged ...
if __name__ == '__main__':
    folder = os.path.abspath('...')
    os.mkdir(os.path.join(folder, SAVE_DIRECTORY))
    images = get_image_paths(folder)
    pool = Pool()
    pool.map(create_thumbnail, images)
    pool.close()
    pool.join()

These examples illustrate that, for many everyday scripting tasks, a concise map call combined with the appropriate pool (process‑based for CPU‑bound, thread‑based for I/O‑bound) yields clean, debuggable, and high‑performance parallel code.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

concurrency map thread pool parallelism Multiprocessing

Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.