Mojo vs Python: Does the New AI Language Really Deliver 36,000× Speedup?

The article examines Modular's new Mojo language, its claim of massive performance gains over Python for AI workloads, presents benchmark code and results, discusses its origins, investment interest, and current beta status, concluding that while impressive, the 36,000× claim is overstated.

21CTO
21CTO
21CTO
Mojo vs Python: Does the New AI Language Really Deliver 36,000× Speedup?

Modular Mojo is a new programming language designed for AI developers, promising to combine Python’s ease of use with C‑level performance.

The developers claim that Mojo can achieve more than a 36,000‑fold speedup over Python on a matrix‑multiplication workload.

Mojo was not part of Modular AI’s original roadmap; it emerged as a side project while founders Chris Lattner and Tim Davis were building a unified ML/AI infrastructure platform.

The "36,000×" claim is based on comparing two different scripts: a Python script that multiplies a 128×128 matrix at 0.00215 GFLOP/s, and a Mojo script that performs a vectorized, parallel 512×512 multiplication at 79.636 GFLOP/s.

Investors have shown strong interest, with Monitor reportedly committing $100 million to Modular.

Mojo does not threaten Python; it enhances Python’s capabilities and gives Python programmers “superpowers.”

Because Mojo is a superset of Python, it runs on any Linux platform, but the SDK will not be released until September, though documentation and a matrix‑multiplication example are already available.

Community members have reproduced parts of the benchmark. Below is the original Python matrix‑multiplication function:

def matmul_python(C, A, B):
    for m in range(C.rows):
        for k in range(A.cols):
            for n in range(C.cols):
                C[m, n] += A[m, k] * B[k, n]

The full Python benchmark script used for comparison:

import numpy as np
from timeit import timeit
class Matrix:
    def __init__(self, value, rows, cols):
        self.value = value
        self.rows = rows
        self.cols = cols
    def __getitem__(self, idxs):
        return self.value[idxs[0]][idxs[1]]
    def __setitem__(self, idxs, value):
        self.value[idxs[0]][idxs[1]] = value

def benchmark_matmul_python(M, N, K):
    A = Matrix(list(np.random.rand(M, K)), M, K)
    B = Matrix(list(np.random.rand(K, N)), K, N)
    C = Matrix(list(np.zeros((M, N))), M, N)
    secs = timeit(lambda: matmul_python(C, A, B), number=2)/2
    gflops = ((2*M*N*K)/secs) / 1e9
    print(gflops, "GFLOP/s")
    return gflops
python_gflops = benchmark_matmul_python(128, 128, 128).to_float64()

The Mojo script imports several language‑specific modules and defines analogous data structures and a benchmark function:

from benchmark import Benchmark
from sys.intrinsics import strided_load
from utils.list import VariadicList
from math import div_ceil, min
from memory import memset_zero
from memory.unsafe import DTypePointer
from random import rand, random_float64
from sys.info import simdwidthof
fn matrix_getitem(self: object, i: object) raises -> object:
    return self.value[i]
fn matrix_setitem(self: object, i: object, value: object) raises -> object:
    self.value[i] = value
    return None
fn matrix_append(self: object, value: object) raises -> object:
    self.value.append(value)
    return None
fn matrix_init(rows: Int, cols: Int) raises -> object:
    let value = object([])
    return object(
        Attr("value", value), Attr("__getitem__", matrix_getitem), Attr("__setitem__", matrix_setitem), 
        Attr("rows", rows), Attr("cols", cols), Attr("append", matrix_append),
    )

def benchmark_matmul_untyped(M: Int, N: Int, K: Int, python_gflops: Float64):
    C = matrix_init(M, N)
    A = matrix_init(M, K)
    B = matrix_init(K, N)
    for i in range(M):
        c_row = object([])
        b_row = object([])
        a_row = object([])
        for j in range(N):
            c_row.append(0.0)
            b_row.append(random_float64(-5, 5))
            a_row.append(random_float64(-5, 5))
        C.append(c_row)
        B.append(b_row)
        A.append(a_row)
    @parameter
    fn test_fn():
        try:
            _ = matmul_untyped(C, A, B)
        except:
            pass
    let secs = Float64(Benchmark().run[test_fn]()) / 1_000_000_000
    _ = (A, B, C)
    let gflops = ((2*M*N*K)/secs) / 1e9
    let speedup : Float64 = gflops / python_gflops
    print(gflops, "GFLOP/s, a", speedup.value, "x speedup over Python")
benchmark_matmul_untyped(128, 128, 128, python_gflops)

The Mojo benchmark produced the following output:

0.029258 GFLOP/s, a 17.501798 x speedup over Python

When using the same 128×128 matrix size, Mojo was about 17.5× faster than Python—significantly less than the advertised 36,000×, likely because Mojo leverages multithreading.

Since Mojo is a Python superset, it should run on any Linux system. The language is still in beta, with a GA release expected in the coming weeks.

Official Mojo website: https://www.modular.com/mojo

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Pythonperformance benchmarkMatrix MultiplicationAI programmingMojo
21CTO
Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.