Boost Python Performance: Multithreading, Multiprocessing, and Best Practices
This article explains how Python’s GIL affects concurrency, when to use multithreading versus multiprocessing, and provides practical tips on efficient inter‑process communication, iteration, string handling, sorting, file I/O, and leveraging the standard library to dramatically improve script performance.
1. Multithreading and Multiprocessing
Python cannot truly parallelize CPU‑bound tasks because of the Global Interpreter Lock (GIL), which allows only one thread to execute Python bytecode at a time. Threads are useful for I/O‑bound work, while separate processes bypass the GIL and can run on multiple cores.
Multithreading : Use when the program is limited by I/O (e.g., downloading files, network requests, disk reads). The GIL is released during I/O, allowing other threads to run.
Multiprocessing : Use for CPU‑bound tasks that benefit from true parallelism. Each process has its own interpreter and memory space, avoiding the GIL.
Proper use can reduce minutes‑long tasks to seconds, but misuse adds overhead without performance gains.
Best Practices
Prefer external libraries (NumPy, SciPy, PyTorch) for heavy numeric work; they release the GIL during C/Fortran/CUDA operations, so explicit multiprocessing is often unnecessary.
Use concurrent.futures.ThreadPoolExecutor for I/O‑bound tasks and concurrent.futures.ProcessPoolExecutor for CPU‑bound tasks.
Limit the number of processes to the number of CPU cores (use os.cpu_count()).
import time, concurrent.futures
def download_data(url):
print(f"Starting download from {url}...")
time.sleep(2)
print(f"Finished download from {url}.")
return f"Data from {url}"
def calculate_prime(number):
print(f"Calculating prime for {number}...")
is_prime = all(number % i for i in range(2, int(number**0.5) + 1))
print(f"Finished calculation for {number}.")
return is_prime
urls = ["http://url.com/1", "http://url.com/2", "http://url.com/3"]
print("
--- Using ThreadPoolExecutor (I/O bound)---")
with concurrent.futures.ThreadPoolExecutor(max_workers=3) as executor:
results = list(executor.map(download_data, urls))
print(f"ThreadPool results: {results}")
numbers = [10000003, 10000007, 10000009]
print("
--- Using ProcessPoolExecutor (CPU bound)---")
with concurrent.futures.ProcessPoolExecutor(max_workers=3) as executor:
results = list(executor.map(calculate_prime, numbers))
print(f"ProcessPool results: {results}")2. Efficient Inter‑Process Communication (IPC)
While multiprocessing provides true parallelism, communication between processes can become a bottleneck if large objects are passed or messages are sent frequently.
Passing large objects : Serializing and deserializing big data structures (e.g., large lists, NumPy arrays) can negate the benefits of parallelism.
Frequent small messages : Repeatedly sending tiny chunks adds overhead.
Optimizing these factors is crucial for scalable multiprocessing applications.
Best Practices
Use multiprocessing.Queue or Pipe for small messages.
Use multiprocessing.shared_memory (Python 3.8+) for truly large data shared across processes.
Use multiprocessing.Manager for shared data structures when appropriate.
Batch data before sending through queues or pipes.
import multiprocessing, numpy as np
def process_chunk(chunk_id, data_chunk):
result = np.sum(data_chunk) * 2
return f"Chunk {chunk_id} processed, sum doubled: {result}"
if __name__ == '__main__':
large_array = np.random.rand(1_000_000)
chunk_size = len(large_array) // 4
chunks = [large_array[i:i + chunk_size] for i in range(0, len(large_array), chunk_size)]
print("
--- Using ProcessPoolExecutor for efficient IPC (batch)---")
with multiprocessing.Pool(processes=4) as pool:
results = pool.starmap(process_chunk, enumerate(chunks))
print("All chunks processed.")3. Loops, Generators, and Efficient Iteration
Python offers highly optimized built‑in functions and iteration patterns. Ignoring them can severely degrade performance.
Best Practices
Direct iteration : Prefer for item in my_list over for i in range(len(my_list)) to avoid repeated indexing.
Use built‑in functions : sum(), min(), max(), any(), all(), zip(), enumerate() are usually faster than manual loops.
Generators for large or infinite data : Use generator expressions or yield to avoid MemoryError and reduce memory usage.
4. Proper String Concatenation
Using + or += inside loops creates many temporary strings because strings are immutable, leading to heavy memory allocation and copying. Instead, use " ".join() for efficient concatenation.
# Not recommended
long_string = ""
for str in list_of_strs:
long_string += str
# Recommended
long_string = "".join(list_of_strs)5. Efficient Sorting
Built‑in sorted() and list.sort() use Timsort (or Powersort in Python 3.11+), which is highly optimized for real‑world data. Use the key argument for custom objects.
Best Practices
Sorting custom objects : Use a lambda or operator.attrgetter as the key.
DSU (decorate‑sort‑undecorate) pattern : When the key function is expensive, pre‑compute sort keys.
Partial sorting : Use heapq.nsmallest() or heapq.nlargest() instead of sorting the entire list when only a few top elements are needed.
class Person:
def __init__(self, name, age):
self.name = name
self.age = age
def __repr__(self):
return f"Person('{self.name}', {self.age})"
people = [Person('Alice', 30), Person('Bob', 25), Person('Charlie', 35)]
people.sort(key=lambda p: p.age)
print(people)6. File I/O
Efficient file handling is essential for large data files.
Best Practices
Always use with open() as f to ensure proper closure.
Read small files with f.read(); for large files, iterate line‑by‑line or use generators.
Batch writes by collecting data in a list and writing once with f.write(''.join(lines)) or writing in large blocks.
7. Leveraging Standard Library Modules
The collections and itertools modules provide highly optimized containers and iteration utilities that often outperform generic types.
Best Practices
collections.dequefor O(1) appends/pops at both ends. collections.Counter for fast frequency counting. collections.defaultdict to avoid KeyError and simplify code. itertools.chain, cycle, permutations, combinations, groupby, islice for efficient iteration patterns.
from itertools import chain, cycle, permutations
list1 = [1, 2, 3]
list2 = [4, 5, 6]
for item in chain(list1, list2):
print(item, end=' ')
colors = cycle(['red', 'green', 'blue'])
for _ in range(5):
print(next(colors))
for p in permutations('ABC', 2):
print(''.join(p))8. Choosing the Right Data Structure
Selecting appropriate structures dramatically reduces algorithmic complexity and runtime.
Best Practices
Use set for fast membership tests and duplicate removal.
Use dict for key‑value lookups.
Use np.array() for large numeric datasets.
9. Avoid Dot‑Lookup Overhead
Calling module functions via the module name incurs an extra attribute lookup. Import the function directly to eliminate this overhead.
# Not recommended
import math
a = math.sqrt(50)
# Recommended
from math import sqrt
a = sqrt(50)10. Avoid Global Variables
Accessing globals follows the LEGB rule and is slower than local variable access. In tight loops or frequently called functions, prefer locals to reduce lookup cost.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Code Mala Tang
Read source code together, write articles together, and enjoy spicy hot pot together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
