Fundamentals 14 min read

How CPython 3.13’s Free‑Threading Boosts Parallel Performance (and What It Means for Your Code)

The article examines CPython 3.13’s new free‑threading mode, its impact on the Global Interpreter Lock, benchmark results using a PageRank example, and practical multithreaded and multiprocessing implementations to show how performance can dramatically improve on modern multicore CPUs.

21CTO

Nov 13, 2024

How CPython 3.13’s Free‑Threading Boosts Parallel Performance (and What It Means for Your Code)

CPython 3.13 was released two weeks ago as the most performance‑focused version to date. Its key performance‑related changes are:

Free‑threading mode that can run without the Global Interpreter Lock (GIL)

A brand‑new Just‑In‑Time (JIT) compiler

Bundled mimalloc memory allocator

This article concentrates on the free‑threading mode, showing how to exploit it and how it affects Python application performance when measured with CodSpeed.

Free‑Threading in CPython

Free‑threading is an experimental feature in Python 3.13 that allows CPython to run without the GIL. The GIL is a mutex that prevents multiple threads from executing Python bytecode simultaneously, simplifying memory management but becoming a major bottleneck on modern multicore processors.

Multiprocessing as a Traditional Work‑around

Historically, developers used the multiprocessing module, which spawns separate processes. This approach incurs significant memory overhead, inter‑process communication costs, and slower startup times.

Real‑World Impact: PageRank Example

PageRank is a compute‑intensive, matrix‑heavy algorithm that benefits greatly from parallelisation. In CPython 3.12 and earlier, a naïve multithreaded implementation is throttled by the GIL, while a multiprocessing version suffers from the overheads mentioned above.

Single‑Threaded Reference Implementation

def pagerank_single(matrix: np.ndarray, num_iterations: int) -> np.ndarray:
    """Single‑threaded PageRank implementation"""
    size = matrix.shape[0]
    scores = np.ones(size) / size
    for _ in range(num_iterations):
        new_scores = np.zeros(size)
        for i in range(size):
            incoming = np.where(matrix[:, i])[0]
            for j in incoming:
                new_scores[i] += scores[j] / np.sum(matrix[j])
        scores = (1 - DAMPING) / size + DAMPING * new_scores
    return scores

The two most compute‑intensive loops are highlighted above. Parallelising the first loop (computing contributions from incoming nodes) yields the greatest speed‑up.

Multithreaded Implementation

We split the matrix into chunks and let each thread process a chunk, updating a shared new_scores array under a lock.

chunk_size = size // num_threads
chunks = [(i, min(i + chunk_size, size)) for i in range(0, size, chunk_size)]

def _thread_worker(matrix: np.ndarray, scores: np.ndarray, new_scores: np.ndarray,
                 start_idx: int, end_idx: int, lock: threading.Lock):
    size = matrix.shape[0]
    local_scores = np.zeros(size)
    for i in range(start_idx, end_idx):
        incoming = np.where(matrix[:, i])[0]
        for j in incoming:
            local_scores[i] += scores[j] / np.sum(matrix[j])
    with lock:
        new_scores += local_scores

new_scores = np.zeros(size)
lock = threading.Lock()
with concurrent.futures.ThreadPoolExecutor(max_workers=num_threads) as executor:
    futures = executor.map(
        lambda args: _thread_worker(*args),
        [(matrix, scores, new_scores, s, e, lock) for s, e in chunks]
    )
new_scores = (1 - DAMPING) / size + DAMPING * new_scores
scores = new_scores

The lock protects new_scores from race conditions; in practice this section scales well.

Multiprocessing Implementation

Because processes cannot share memory directly, each worker returns a local score array that is later summed in the main process.

# Combine results
new_scores = sum(chunk_results)

with multiprocessing.Pool(processes=num_processes) as pool:
    chunk_results = pool.starmap(_process_chunk, chunks)
    new_scores = sum(chunk_results)
    new_scores = (1 - DAMPING) / size + DAMPING * new_scores
    scores = new_scores

Measuring Performance

We generate reproducible test graphs and benchmark the three implementations with pytest‑codspeed.

def create_test_graph(size: int) -> np.ndarray:
    np.random.seed(0)
    matrix = np.random.choice([0, 1], size=(size, size), p=[1 - 5/size, 5/size])
    zero_outdegree = ~matrix.any(axis=1)
    zero_indices = np.where(zero_outdegree)[0]
    if len(zero_indices) > 0:
        random_targets = np.random.randint(0, size, size=len(zero_indices))
        matrix[zero_indices, random_targets] = 1
    return matrix

@pytest.mark.parametrize(
    "pagerank",
    [pagerank_single,
     partial(pagerank_multiprocess, num_processes=8),
     partial(pagerank_multithread, num_threads=8)],
    ids=["single", "8-processes", "8-threads"]
)
@pytest.mark.parametrize(
    "graph",
    [create_test_graph(100), create_test_graph(1000), create_test_graph(2000)],
    ids=["XS", "L", "XL"]
)
def test_pagerank(benchmark, pagerank, graph):
    benchmark(pagerank, graph, num_iterations=10)

A GitHub Actions workflow runs these benchmarks on CodSpeed macro runners (ARM64, 16 cores, 32 GB RAM) for Python 3.12, 3.13, and the free‑threading builds (3.13t) with both GIL enabled and disabled.

on:
  push:
jobs:
  codspeed:
    runs-on: codspeed-macro
    strategy:
      matrix:
        python-version: ["3.12", "3.13"]
        include:
          - { python-version: "3.13t", gil: "1" }
          - { python-version: "3.13t", gil: "0" }
    env:
      UV_PYTHON: ${{ matrix.python-version }}
    steps:
      - uses: actions/checkout@v4
      - name: Install uv
        uses: astral-sh/setup-uv@v3
      - name: Install CPython & dependencies
        run: uv sync --all-extras
      - name: Run benchmarks
        uses: CodSpeedHQ/action@v3
        env:
          PYTHON_GIL: ${{ matrix.gil }}
        with:
          run: uv run pytest --codspeed --codspeed-max-time 10 -vs src/tests.py

Results show that without the new build options, Python 3.12 and 3.13 perform similarly, while multiprocessing is slower than the single‑threaded version due to inter‑process communication overhead. The free‑threading build without GIL delivers the best performance, confirming that the GIL no longer limits parallel execution.

However, the free‑threading build also slows down other implementations because it disables the adaptive interpreter, a penalty expected to diminish in Python 3.14 when a thread‑safe adaptive interpreter is introduced.

Overall, CPython 3.13’s free‑threading mode can dramatically improve parallel workloads, offering a compelling alternative to multiprocessing. It remains experimental and not yet production‑ready, but it points toward a promising future for multithreaded Python code.

Postscript

The benchmark does not include sub‑interpreters, another way to achieve parallelism without the GIL introduced in Python 3.12. In many cases sub‑interpreters are slower due to data‑sharing and communication costs, but they could become a viable alternative once those issues are resolved.

Future articles will cover JIT and mimalloc performance.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

multithreading benchmark GIL CPython Python 3.13 Free Threading

Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.