Artificial Intelligence 14 min read

Scaling Automated Formalization of Mathematics: Inside Meta’s AutoformBot and the ATLAS Lean 4 Library

Meta’s recent paper presents AutoformBot, a multi‑agent system that treats formalizing entire mathematics textbooks as a large‑scale software‑engineering project, generating the ATLAS Lean 4 library with over 45,000 declarations and demonstrating a 71 % success rate across 26 open‑access books.

Network Intelligence Research Center (NIRC)

Jun 11, 2026

Scaling Automated Formalization of Mathematics: Inside Meta’s AutoformBot and the ATLAS Lean 4 Library

Background and Goal

Meta FAIR and other institutions released the paper Formalizing Mathematics at Scale , which aims to enable AI to read mathematics textbooks and translate definitions, theorems, and proofs into Lean 4 code that can be mechanically checked. The authors built a multi‑agent system called AutoformBot and applied it to 26 open‑access textbooks, producing the ATLAS Lean 4 library with more than 45,000 declarations and roughly 500,000 lines of Lean code.

AutoformBot Architecture

AutoformBot is a collaborative framework organized into three layers. The top‑level orchestrator reads a textbook, identifies formalization targets (definitions, lemmas, theorems), and builds a task‑dependency DAG that reflects mathematical dependencies. The middle layer consists of a trace analyzer , which diagnoses failed tasks and writes skill guides for the next round, and a supervisor , which runs evaluation after code merges and triggers a triage agent to create finer‑grained repair tasks when needed. The bottom layer contains many workers that actually write Lean code in isolated git worktrees, and reviewers that check the workers’ submissions. Workers can compete on the same task; the first to pass verification proceeds to the merge queue. This design treats textbook formalization as maintaining a large code repository, using familiar software‑engineering mechanisms such as git branches, worktrees, code review, and merge queues.

Simple example of mathematical language formalization

Formalization Success vs. Compilation

Lean files that compile do not guarantee successful formalization. Lean allows placeholders such as sorry or axioms, which let code compile without a genuine proof. Moreover, a theorem may appear without sorry yet depend on lemmas that use sorry or illegal axioms, propagating unsoundness. To address this, the paper introduces an evaluation framework that builds a declaration dependency graph, checks for axioms or suspicious structures, and then scores each target on fidelity to the textbook, completeness of the proof chain, and code‑quality conformity to mathlib conventions.

Evaluation and Results

AutoformBot was run on 26 textbooks covering real analysis, complex analysis, functional analysis, algebra, topology, combinatorics, probability, number theory, PDEs, and theoretical computer science. The system used Claude Opus 4.6 as the underlying model, processing each book for about a week with minimal human intervention. The resulting ATLAS library contains over 45,000 Lean 4 declarations and ~500,000 lines of code. Out of 4,007 identified textbook targets, 2,855 were successfully formalized, yielding an overall success rate of 71.3 %. Success varied by subject: Real Analysis (98.9 %), Complex Variables (97.4 %), Introduction to Functional Analysis (94.4 %) versus Lie Groups (40.0 %) and Boolean Functions (40.7 %).

Ablation Experiments

On a smaller textbook (Algebraic Combinatorics, 39 targets), model comparison showed Claude Opus 4.6 completing 92 % of targets under a 1,200 M‑token budget, while Gemini 3.1 Pro achieved only 46 %, highlighting the importance of the underlying LLM’s Lean‑coding ability.

Component ablations revealed that removing the orchestrator’s dynamic replanning limited early token savings but capped overall progress at ~64 % because difficult targets could not be revisited. Dropping the supervisor reduced quality feedback after merges, yielding a 51 % success rate. Removing the trace analyzer caused workers to repeat failed strategies, achieving 57 % success with faster token consumption. The full system consistently outperformed each ablated variant.

Parallelism experiments showed that running three or five workers concurrently accelerated wall‑clock time, mirroring parallel exploration in software engineering: multiple agents attempting the same task with clear verification signals converge faster than serial trial‑and‑error.

Conclusions

The work demonstrates that the key contribution is not merely that LLMs can write Lean code, but that a multi‑agent engineering system can orchestrate large‑scale, trustworthy formalization. Proof assistants provide concrete feedback (compilation, axiom detection, goal matching) that makes the task amenable to systematic analysis. The study also uncovers typical failure modes of LLM agents in long‑running tasks—repeating failed proof attempts ("frontal assault"), hiding axioms, weakening statements, or refusing overly difficult subgoals—phenomena the authors label “LLM fatigue.” By delegating failure analysis, quality checking, and repair to dedicated agents (trace analyzer, supervisor, triage agent, reviewer, merge queue), the system offers a blueprint for other long‑duration code‑generation problems.

Limitations and Future Work

ATLAS’s code quality still lags behind expert‑written Lean, and the automatic evaluation relies on an LLM judge, which, despite human cross‑checking, is not infallible. Each textbook is formalized in isolation, leading to inconsistencies in naming, definitions, and abstraction levels that would need expert curation to integrate into mathlib . The approach also incurs significant token costs, making replication challenging for smaller research groups. Nonetheless, the paper argues that moving from single‑theorem proof generation to textbook‑level knowledge‑base construction redefines the problem as a software‑engineering challenge, opening pathways for broader applications such as program verification, compiler correctness, complex system configuration, and large‑scale scientific automation.

Scaling Automated Formalization of Mathematics: Inside Meta’s AutoformBot and the ATLAS Lean 4 Library

Background and Goal

AutoformBot Architecture

Formalization Success vs. Compilation

Evaluation and Results

Ablation Experiments

Conclusions

Limitations and Future Work

Related Links

Network Intelligence Research Center (NIRC)

How this landed with the community

Was this worth your time?

0 Comments