How CPython 3.13’s Free‑Threading Boosts Parallel Performance (and What It Means for Your Code)
The article examines CPython 3.13’s new free‑threading mode, its impact on the Global Interpreter Lock, benchmark results using a PageRank example, and practical multithreaded and multiprocessing implementations to show how performance can dramatically improve on modern multicore CPUs.
CPython 3.13 was released two weeks ago as the most performance‑focused version to date. Its key performance‑related changes are:
Free‑threading mode that can run without the Global Interpreter Lock (GIL)
A brand‑new Just‑In‑Time (JIT) compiler
Bundled mimalloc memory allocator
This article concentrates on the free‑threading mode, showing how to exploit it and how it affects Python application performance when measured with CodSpeed.
Free‑Threading in CPython
Free‑threading is an experimental feature in Python 3.13 that allows CPython to run without the GIL. The GIL is a mutex that prevents multiple threads from executing Python bytecode simultaneously, simplifying memory management but becoming a major bottleneck on modern multicore processors.
Multiprocessing as a Traditional Work‑around
Historically, developers used the multiprocessing module, which spawns separate processes. This approach incurs significant memory overhead, inter‑process communication costs, and slower startup times.
Real‑World Impact: PageRank Example
PageRank is a compute‑intensive, matrix‑heavy algorithm that benefits greatly from parallelisation. In CPython 3.12 and earlier, a naïve multithreaded implementation is throttled by the GIL, while a multiprocessing version suffers from the overheads mentioned above.
Single‑Threaded Reference Implementation
def pagerank_single(matrix: np.ndarray, num_iterations: int) -> np.ndarray:
"""Single‑threaded PageRank implementation"""
size = matrix.shape[0]
scores = np.ones(size) / size
for _ in range(num_iterations):
new_scores = np.zeros(size)
for i in range(size):
incoming = np.where(matrix[:, i])[0]
for j in incoming:
new_scores[i] += scores[j] / np.sum(matrix[j])
scores = (1 - DAMPING) / size + DAMPING * new_scores
return scoresThe two most compute‑intensive loops are highlighted above. Parallelising the first loop (computing contributions from incoming nodes) yields the greatest speed‑up.
Multithreaded Implementation
We split the matrix into chunks and let each thread process a chunk, updating a shared new_scores array under a lock.
chunk_size = size // num_threads
chunks = [(i, min(i + chunk_size, size)) for i in range(0, size, chunk_size)] def _thread_worker(matrix: np.ndarray, scores: np.ndarray, new_scores: np.ndarray,
start_idx: int, end_idx: int, lock: threading.Lock):
size = matrix.shape[0]
local_scores = np.zeros(size)
for i in range(start_idx, end_idx):
incoming = np.where(matrix[:, i])[0]
for j in incoming:
local_scores[i] += scores[j] / np.sum(matrix[j])
with lock:
new_scores += local_scores new_scores = np.zeros(size)
lock = threading.Lock()
with concurrent.futures.ThreadPoolExecutor(max_workers=num_threads) as executor:
futures = executor.map(
lambda args: _thread_worker(*args),
[(matrix, scores, new_scores, s, e, lock) for s, e in chunks]
)
new_scores = (1 - DAMPING) / size + DAMPING * new_scores
scores = new_scoresThe lock protects new_scores from race conditions; in practice this section scales well.
Multiprocessing Implementation
Because processes cannot share memory directly, each worker returns a local score array that is later summed in the main process.
# Combine results
new_scores = sum(chunk_results) with multiprocessing.Pool(processes=num_processes) as pool:
chunk_results = pool.starmap(_process_chunk, chunks)
new_scores = sum(chunk_results)
new_scores = (1 - DAMPING) / size + DAMPING * new_scores
scores = new_scoresMeasuring Performance
We generate reproducible test graphs and benchmark the three implementations with pytest‑codspeed.
def create_test_graph(size: int) -> np.ndarray:
np.random.seed(0)
matrix = np.random.choice([0, 1], size=(size, size), p=[1 - 5/size, 5/size])
zero_outdegree = ~matrix.any(axis=1)
zero_indices = np.where(zero_outdegree)[0]
if len(zero_indices) > 0:
random_targets = np.random.randint(0, size, size=len(zero_indices))
matrix[zero_indices, random_targets] = 1
return matrix @pytest.mark.parametrize(
"pagerank",
[pagerank_single,
partial(pagerank_multiprocess, num_processes=8),
partial(pagerank_multithread, num_threads=8)],
ids=["single", "8-processes", "8-threads"]
)
@pytest.mark.parametrize(
"graph",
[create_test_graph(100), create_test_graph(1000), create_test_graph(2000)],
ids=["XS", "L", "XL"]
)
def test_pagerank(benchmark, pagerank, graph):
benchmark(pagerank, graph, num_iterations=10)A GitHub Actions workflow runs these benchmarks on CodSpeed macro runners (ARM64, 16 cores, 32 GB RAM) for Python 3.12, 3.13, and the free‑threading builds (3.13t) with both GIL enabled and disabled.
on:
push:
jobs:
codspeed:
runs-on: codspeed-macro
strategy:
matrix:
python-version: ["3.12", "3.13"]
include:
- { python-version: "3.13t", gil: "1" }
- { python-version: "3.13t", gil: "0" }
env:
UV_PYTHON: ${{ matrix.python-version }}
steps:
- uses: actions/checkout@v4
- name: Install uv
uses: astral-sh/setup-uv@v3
- name: Install CPython & dependencies
run: uv sync --all-extras
- name: Run benchmarks
uses: CodSpeedHQ/action@v3
env:
PYTHON_GIL: ${{ matrix.gil }}
with:
run: uv run pytest --codspeed --codspeed-max-time 10 -vs src/tests.pyResults show that without the new build options, Python 3.12 and 3.13 perform similarly, while multiprocessing is slower than the single‑threaded version due to inter‑process communication overhead. The free‑threading build without GIL delivers the best performance, confirming that the GIL no longer limits parallel execution.
However, the free‑threading build also slows down other implementations because it disables the adaptive interpreter, a penalty expected to diminish in Python 3.14 when a thread‑safe adaptive interpreter is introduced.
Overall, CPython 3.13’s free‑threading mode can dramatically improve parallel workloads, offering a compelling alternative to multiprocessing. It remains experimental and not yet production‑ready, but it points toward a promising future for multithreaded Python code.
Postscript
The benchmark does not include sub‑interpreters, another way to achieve parallelism without the GIL introduced in Python 3.12. In many cases sub‑interpreters are slower due to data‑sharing and communication costs, but they could become a viable alternative once those issues are resolved.
Future articles will cover JIT and mimalloc performance.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
