Scaling Automated Formalization of Mathematics: Inside Meta’s AutoformBot and the ATLAS Lean 4 Library
Meta’s recent paper presents AutoformBot, a multi‑agent system that treats formalizing entire mathematics textbooks as a large‑scale software‑engineering project, generating the ATLAS Lean 4 library with over 45,000 declarations and demonstrating a 71 % success rate across 26 open‑access books.
Background and Goal
Meta FAIR and other institutions released the paper Formalizing Mathematics at Scale , which aims to enable AI to read mathematics textbooks and translate definitions, theorems, and proofs into Lean 4 code that can be mechanically checked. The authors built a multi‑agent system called AutoformBot and applied it to 26 open‑access textbooks, producing the ATLAS Lean 4 library with more than 45,000 declarations and roughly 500,000 lines of Lean code.
AutoformBot Architecture
AutoformBot is a collaborative framework organized into three layers. The top‑level orchestrator reads a textbook, identifies formalization targets (definitions, lemmas, theorems), and builds a task‑dependency DAG that reflects mathematical dependencies. The middle layer consists of a trace analyzer , which diagnoses failed tasks and writes skill guides for the next round, and a supervisor , which runs evaluation after code merges and triggers a triage agent to create finer‑grained repair tasks when needed. The bottom layer contains many workers that actually write Lean code in isolated git worktrees, and reviewers that check the workers’ submissions. Workers can compete on the same task; the first to pass verification proceeds to the merge queue. This design treats textbook formalization as maintaining a large code repository, using familiar software‑engineering mechanisms such as git branches, worktrees, code review, and merge queues.
Formalization Success vs. Compilation
Lean files that compile do not guarantee successful formalization. Lean allows placeholders such as sorry or axioms, which let code compile without a genuine proof. Moreover, a theorem may appear without sorry yet depend on lemmas that use sorry or illegal axioms, propagating unsoundness. To address this, the paper introduces an evaluation framework that builds a declaration dependency graph, checks for axioms or suspicious structures, and then scores each target on fidelity to the textbook, completeness of the proof chain, and code‑quality conformity to mathlib conventions.
Evaluation and Results
AutoformBot was run on 26 textbooks covering real analysis, complex analysis, functional analysis, algebra, topology, combinatorics, probability, number theory, PDEs, and theoretical computer science. The system used Claude Opus 4.6 as the underlying model, processing each book for about a week with minimal human intervention. The resulting ATLAS library contains over 45,000 Lean 4 declarations and ~500,000 lines of code. Out of 4,007 identified textbook targets, 2,855 were successfully formalized, yielding an overall success rate of 71.3 %. Success varied by subject: Real Analysis (98.9 %), Complex Variables (97.4 %), Introduction to Functional Analysis (94.4 %) versus Lie Groups (40.0 %) and Boolean Functions (40.7 %).
Ablation Experiments
On a smaller textbook (Algebraic Combinatorics, 39 targets), model comparison showed Claude Opus 4.6 completing 92 % of targets under a 1,200 M‑token budget, while Gemini 3.1 Pro achieved only 46 %, highlighting the importance of the underlying LLM’s Lean‑coding ability.
Component ablations revealed that removing the orchestrator’s dynamic replanning limited early token savings but capped overall progress at ~64 % because difficult targets could not be revisited. Dropping the supervisor reduced quality feedback after merges, yielding a 51 % success rate. Removing the trace analyzer caused workers to repeat failed strategies, achieving 57 % success with faster token consumption. The full system consistently outperformed each ablated variant.
Parallelism experiments showed that running three or five workers concurrently accelerated wall‑clock time, mirroring parallel exploration in software engineering: multiple agents attempting the same task with clear verification signals converge faster than serial trial‑and‑error.
Conclusions
The work demonstrates that the key contribution is not merely that LLMs can write Lean code, but that a multi‑agent engineering system can orchestrate large‑scale, trustworthy formalization. Proof assistants provide concrete feedback (compilation, axiom detection, goal matching) that makes the task amenable to systematic analysis. The study also uncovers typical failure modes of LLM agents in long‑running tasks—repeating failed proof attempts ("frontal assault"), hiding axioms, weakening statements, or refusing overly difficult subgoals—phenomena the authors label “LLM fatigue.” By delegating failure analysis, quality checking, and repair to dedicated agents (trace analyzer, supervisor, triage agent, reviewer, merge queue), the system offers a blueprint for other long‑duration code‑generation problems.
Limitations and Future Work
ATLAS’s code quality still lags behind expert‑written Lean, and the automatic evaluation relies on an LLM judge, which, despite human cross‑checking, is not infallible. Each textbook is formalized in isolation, leading to inconsistencies in naming, definitions, and abstraction levels that would need expert curation to integrate into mathlib . The approach also incurs significant token costs, making replication challenging for smaller research groups. Nonetheless, the paper argues that moving from single‑theorem proof generation to textbook‑level knowledge‑base construction redefines the problem as a software‑engineering challenge, opening pathways for broader applications such as program verification, compiler correctness, complex system configuration, and large‑scale scientific automation.
Related Links
https://arxiv.org/abs/2605.29955v1
https://github.com/facebookresearch/atlas-lean
https://github.com/facebookresearch/autoform-bot
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Network Intelligence Research Center (NIRC)
NIRC is based on the National Key Laboratory of Network and Switching Technology at Beijing University of Posts and Telecommunications. It has built a technology matrix across four AI domains—intelligent cloud networking, natural language processing, computer vision, and machine learning systems—dedicated to solving real‑world problems, creating top‑tier systems, publishing high‑impact papers, and contributing significantly to the rapid advancement of China's network technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
