Boost Python Scripts with map() Parallelism: From Threads to ThreadPools
Python’s traditional multithreading tutorials often overcomplicate simple tasks, but by leveraging the built‑in map() function and the multiprocessing.dummy ThreadPool, developers can dramatically simplify and accelerate I/O‑bound and CPU‑bound scripts, reducing code from dozens of lines to just a few while achieving significant speedups.
Python has a reputation for being difficult to parallelize. Beyond technical issues like thread implementation and the GIL, the main problem is often misleading teaching that focuses on heavyweight class‑based and queue‑based examples that are not useful for everyday scripting.
Traditional examples
Typical "Python multithreading" tutorials show class and queue based code that looks more like Java.
import os
import PIL
from multiprocessing import Pool
from PIL import Image
SIZE = (75,75)
SAVE_DIRECTORY = 'thumbs'
def get_image_paths(folder):
return (os.path.join(folder, f)
for f in os.listdir(folder)
if 'jpeg' in f)
def create_thumbnail(filename):
im = Image.open(filename)
im.thumbnail(SIZE, Image.ANTIALIAS)
base, fname = os.path.split(filename)
save_path = os.path.join(base, SAVE_DIRECTORY, fname)
im.save(save_path)
if __name__ == '__main__':
folder = os.path.abspath('11_18_2013_R000_IQM_Big_Sur_Mon__e10d1958e7b766c3e840')
os.mkdir(os.path.join(folder, SAVE_DIRECTORY))
images = get_image_paths(folder)
pool = Pool()
pool.map(creat_thumbnail, images)
pool.close()
pool.join()These examples are verbose and error‑prone for routine tasks.
The problem…
You need a template class, a queue to pass objects, and methods on both ends of the channel (sometimes another queue for bidirectional communication or result collection).
More workers, more problems
Following that logic, you would need a thread‑pool for workers. Below is a classic IBM tutorial example that uses multithreading for web crawling.
#Example2.py
''
A more realistic thread pool example
''
import time
import threading
import Queue
import urllib2
class Consumer(threading.Thread):
def __init__(self, queue):
threading.Thread.__init__(self)
self._queue = queue
def run(self):
while True:
content = self._queue.get()
if isinstance(content, str) and content == 'quit':
break
response = urllib2.urlopen(content)
print 'Bye byes!'
def Producer():
urls = [
'http://www.python.org', 'http://www.yahoo.com',
'http://www.scala.org', 'http://www.google.com',
# etc..
]
queue = Queue.Queue()
worker_threads = build_worker_pool(queue, 4)
start_time = time.time()
# Add the urls to process
for url in urls:
queue.put(url)
# Add the poison pill
for worker in worker_threads:
queue.put('quit')
for worker in worker_threads:
worker.join()
print 'Done! Time taken: {}'.format(time.time() - start_time)
def build_worker_pool(queue, size):
workers = []
for _ in range(size):
worker = Consumer(queue)
worker.start()
workers.append(worker)
return workers
if __name__ == '__main__':
Producer()While functional, this code requires many lines to construct methods, track threads, and perform joins to avoid deadlocks.
Why not try map?
The map function, originating from Lisp, provides a concise way to parallelize Python programs by applying a function to each element of a sequence.
urls = ['http://www.yahoo.com', 'http://www.reddit.com']
results = map(urllib2.urlopen, urls)This is equivalent to:
results = []
for url in urls:
results.append(urllib2.urlopen(url))Using map handles sequence iteration, argument passing, and result collection in one step.
Two libraries provide a parallel map implementation: multiprocessing and its lesser‑known sibling multiprocessing.dummy. The latter is a thread‑based clone of multiprocessing, making it suitable for I/O‑bound tasks despite Python’s GIL.
Hands‑on
Import the parallel map libraries:
from multiprocessing import Pool
from multiprocessing.dummy import Pool as ThreadPoolInstantiate a pool (default size equals the number of CPU cores):
pool = ThreadPool()For network‑bound tasks you may want to set the pool size manually:
pool = ThreadPool(4) # Sets the pool size to 4Too many threads can cause overhead that outweighs the work; experiment to find the optimal size.
Rewriting the earlier example with a pool reduces the core code to four lines:
import urllib2
from multiprocessing.dummy import Pool as ThreadPool
urls = [
'http://www.python.org',
'http://www.python.org/about/',
'http://www.onlamp.com/pub/a/python/2003/04/17/metaclasses.html',
'http://www.python.org/doc/',
'http://www.python.org/download/',
'http://www.python.org/getit/',
'http://www.python.org/community/',
'https://wiki.python.org/moin/',
'http://planet.python.org/',
'https://wiki.python.org/moin/LocalUserGroups',
'http://www.python.org/psf/',
'http://docs.python.org/devguide/',
'http://www.python.org/community/awards/',
# etc..
]
pool = ThreadPool(4)
results = pool.map(urllib2.urlopen, urls)
pool.close()
pool.join()Benchmarking shows dramatic speedups:
# Single thread: 14.4 Seconds
# 4 Pool: 3.1 Seconds
# 8 Pool: 1.4 Seconds
# 13 Pool: 1.3 SecondsBeyond a pool size of about nine, gains diminish on the author’s machine.
Another real‑world example
Generating thumbnails for thousands of images is a CPU‑intensive task that benefits from parallelism.
Single‑process version
import os
import PIL
from multiprocessing import Pool
from PIL import Image
SIZE = (75,75)
SAVE_DIRECTORY = 'thumbs'
def get_image_paths(folder):
return (os.path.join(folder, f)
for f in os.listdir(folder)
if 'jpeg' in f)
def create_thumbnail(filename):
im = Image.open(filename)
im.thumbnail(SIZE, Image.ANTIALIAS)
base, fname = os.path.split(filename)
save_path = os.path.join(base, SAVE_DIRECTORY, fname)
im.save(save_path)
if __name__ == '__main__':
folder = os.path.abspath('11_18_2013_R000_IQM_Big_Sur_Mon__e10d1958e7b766c3e840')
os.mkdir(os.path.join(folder, SAVE_DIRECTORY))
images = get_image_paths(folder)
for image in images:
create_thumbnail(image)This version processes 6,000 images in about 27.9 seconds.
Parallel version using map
import os
import PIL
from multiprocessing import Pool
from PIL import Image
SIZE = (75,75)
SAVE_DIRECTORY = 'thumbs'
def get_image_paths(folder):
return (os.path.join(folder, f)
for f in os.listdir(folder)
if 'jpeg' in f)
def create_thumbnail(filename):
im = Image.open(filename)
im.thumbnail(SIZE, Image.ANTIALIAS)
base, fname = os.path.split(filename)
save_path = os.path.join(base, SAVE_DIRECTORY, fname)
im.save(save_path)
if __name__ == '__main__':
folder = os.path.abspath('11_18_2013_R000_IQM_Big_Sur_Mon__e10d1958e7b766c3e840')
os.mkdir(os.path.join(folder, SAVE_DIRECTORY))
images = get_image_paths(folder)
pool = Pool()
pool.map(create_thumbnail, images)
pool.close()
pool.join()The parallel version finishes in about 5.6 seconds, a clear improvement with only a few code changes.
Thus, with a single line using map and an appropriate pool, Python scripts can be parallelized efficiently, simplifying debugging and avoiding deadlocks.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
