Mojo vs Python: Does the New AI Language Really Deliver 36,000× Speedup?
The article examines Modular's new Mojo language, its claim of massive performance gains over Python for AI workloads, presents benchmark code and results, discusses its origins, investment interest, and current beta status, concluding that while impressive, the 36,000× claim is overstated.
Modular Mojo is a new programming language designed for AI developers, promising to combine Python’s ease of use with C‑level performance.
The developers claim that Mojo can achieve more than a 36,000‑fold speedup over Python on a matrix‑multiplication workload.
Mojo was not part of Modular AI’s original roadmap; it emerged as a side project while founders Chris Lattner and Tim Davis were building a unified ML/AI infrastructure platform.
The "36,000×" claim is based on comparing two different scripts: a Python script that multiplies a 128×128 matrix at 0.00215 GFLOP/s, and a Mojo script that performs a vectorized, parallel 512×512 multiplication at 79.636 GFLOP/s.
Investors have shown strong interest, with Monitor reportedly committing $100 million to Modular.
Mojo does not threaten Python; it enhances Python’s capabilities and gives Python programmers “superpowers.”
Because Mojo is a superset of Python, it runs on any Linux platform, but the SDK will not be released until September, though documentation and a matrix‑multiplication example are already available.
Community members have reproduced parts of the benchmark. Below is the original Python matrix‑multiplication function:
def matmul_python(C, A, B):
for m in range(C.rows):
for k in range(A.cols):
for n in range(C.cols):
C[m, n] += A[m, k] * B[k, n]The full Python benchmark script used for comparison:
import numpy as np
from timeit import timeit
class Matrix:
def __init__(self, value, rows, cols):
self.value = value
self.rows = rows
self.cols = cols
def __getitem__(self, idxs):
return self.value[idxs[0]][idxs[1]]
def __setitem__(self, idxs, value):
self.value[idxs[0]][idxs[1]] = value
def benchmark_matmul_python(M, N, K):
A = Matrix(list(np.random.rand(M, K)), M, K)
B = Matrix(list(np.random.rand(K, N)), K, N)
C = Matrix(list(np.zeros((M, N))), M, N)
secs = timeit(lambda: matmul_python(C, A, B), number=2)/2
gflops = ((2*M*N*K)/secs) / 1e9
print(gflops, "GFLOP/s")
return gflops
python_gflops = benchmark_matmul_python(128, 128, 128).to_float64()The Mojo script imports several language‑specific modules and defines analogous data structures and a benchmark function:
from benchmark import Benchmark
from sys.intrinsics import strided_load
from utils.list import VariadicList
from math import div_ceil, min
from memory import memset_zero
from memory.unsafe import DTypePointer
from random import rand, random_float64
from sys.info import simdwidthof
fn matrix_getitem(self: object, i: object) raises -> object:
return self.value[i]
fn matrix_setitem(self: object, i: object, value: object) raises -> object:
self.value[i] = value
return None
fn matrix_append(self: object, value: object) raises -> object:
self.value.append(value)
return None
fn matrix_init(rows: Int, cols: Int) raises -> object:
let value = object([])
return object(
Attr("value", value), Attr("__getitem__", matrix_getitem), Attr("__setitem__", matrix_setitem),
Attr("rows", rows), Attr("cols", cols), Attr("append", matrix_append),
)
def benchmark_matmul_untyped(M: Int, N: Int, K: Int, python_gflops: Float64):
C = matrix_init(M, N)
A = matrix_init(M, K)
B = matrix_init(K, N)
for i in range(M):
c_row = object([])
b_row = object([])
a_row = object([])
for j in range(N):
c_row.append(0.0)
b_row.append(random_float64(-5, 5))
a_row.append(random_float64(-5, 5))
C.append(c_row)
B.append(b_row)
A.append(a_row)
@parameter
fn test_fn():
try:
_ = matmul_untyped(C, A, B)
except:
pass
let secs = Float64(Benchmark().run[test_fn]()) / 1_000_000_000
_ = (A, B, C)
let gflops = ((2*M*N*K)/secs) / 1e9
let speedup : Float64 = gflops / python_gflops
print(gflops, "GFLOP/s, a", speedup.value, "x speedup over Python")
benchmark_matmul_untyped(128, 128, 128, python_gflops)The Mojo benchmark produced the following output:
0.029258 GFLOP/s, a 17.501798 x speedup over PythonWhen using the same 128×128 matrix size, Mojo was about 17.5× faster than Python—significantly less than the advertised 36,000×, likely because Mojo leverages multithreading.
Since Mojo is a Python superset, it should run on any Linux system. The language is still in beta, with a GA release expected in the coming weeks.
Official Mojo website: https://www.modular.com/mojo
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
