Scaling Automated Formalization of Mathematics: Inside Meta’s AutoformBot and the ATLAS Lean 4 Library

Meta’s recent paper presents AutoformBot, a multi‑agent system that treats formalizing entire mathematics textbooks as a large‑scale software‑engineering project, generating the ATLAS Lean 4 library with over 45,000 declarations and demonstrating a 71 % success rate across 26 open‑access books.

Network Intelligence Research Center (NIRC)
Network Intelligence Research Center (NIRC)
Network Intelligence Research Center (NIRC)
Scaling Automated Formalization of Mathematics: Inside Meta’s AutoformBot and the ATLAS Lean 4 Library

Background and Goal

Meta FAIR and other institutions released the paper Formalizing Mathematics at Scale , which aims to enable AI to read mathematics textbooks and translate definitions, theorems, and proofs into Lean 4 code that can be mechanically checked. The authors built a multi‑agent system called AutoformBot and applied it to 26 open‑access textbooks, producing the ATLAS Lean 4 library with more than 45,000 declarations and roughly 500,000 lines of Lean code.

AutoformBot Architecture

AutoformBot is a collaborative framework organized into three layers. The top‑level orchestrator reads a textbook, identifies formalization targets (definitions, lemmas, theorems), and builds a task‑dependency DAG that reflects mathematical dependencies. The middle layer consists of a trace analyzer , which diagnoses failed tasks and writes skill guides for the next round, and a supervisor , which runs evaluation after code merges and triggers a triage agent to create finer‑grained repair tasks when needed. The bottom layer contains many workers that actually write Lean code in isolated git worktrees, and reviewers that check the workers’ submissions. Workers can compete on the same task; the first to pass verification proceeds to the merge queue. This design treats textbook formalization as maintaining a large code repository, using familiar software‑engineering mechanisms such as git branches, worktrees, code review, and merge queues.

Simple example of mathematical language formalization
Simple example of mathematical language formalization

Formalization Success vs. Compilation

Lean files that compile do not guarantee successful formalization. Lean allows placeholders such as sorry or axioms, which let code compile without a genuine proof. Moreover, a theorem may appear without sorry yet depend on lemmas that use sorry or illegal axioms, propagating unsoundness. To address this, the paper introduces an evaluation framework that builds a declaration dependency graph, checks for axioms or suspicious structures, and then scores each target on fidelity to the textbook, completeness of the proof chain, and code‑quality conformity to mathlib conventions.

Dependency graph example
Dependency graph example

Evaluation and Results

AutoformBot was run on 26 textbooks covering real analysis, complex analysis, functional analysis, algebra, topology, combinatorics, probability, number theory, PDEs, and theoretical computer science. The system used Claude Opus 4.6 as the underlying model, processing each book for about a week with minimal human intervention. The resulting ATLAS library contains over 45,000 Lean 4 declarations and ~500,000 lines of code. Out of 4,007 identified textbook targets, 2,855 were successfully formalized, yielding an overall success rate of 71.3 %. Success varied by subject: Real Analysis (98.9 %), Complex Variables (97.4 %), Introduction to Functional Analysis (94.4 %) versus Lie Groups (40.0 %) and Boolean Functions (40.7 %).

Ablation Experiments

On a smaller textbook (Algebraic Combinatorics, 39 targets), model comparison showed Claude Opus 4.6 completing 92 % of targets under a 1,200 M‑token budget, while Gemini 3.1 Pro achieved only 46 %, highlighting the importance of the underlying LLM’s Lean‑coding ability.

Component ablations revealed that removing the orchestrator’s dynamic replanning limited early token savings but capped overall progress at ~64 % because difficult targets could not be revisited. Dropping the supervisor reduced quality feedback after merges, yielding a 51 % success rate. Removing the trace analyzer caused workers to repeat failed strategies, achieving 57 % success with faster token consumption. The full system consistently outperformed each ablated variant.

Parallelism experiments showed that running three or five workers concurrently accelerated wall‑clock time, mirroring parallel exploration in software engineering: multiple agents attempting the same task with clear verification signals converge faster than serial trial‑and‑error.

Conclusions

The work demonstrates that the key contribution is not merely that LLMs can write Lean code, but that a multi‑agent engineering system can orchestrate large‑scale, trustworthy formalization. Proof assistants provide concrete feedback (compilation, axiom detection, goal matching) that makes the task amenable to systematic analysis. The study also uncovers typical failure modes of LLM agents in long‑running tasks—repeating failed proof attempts ("frontal assault"), hiding axioms, weakening statements, or refusing overly difficult subgoals—phenomena the authors label “LLM fatigue.” By delegating failure analysis, quality checking, and repair to dedicated agents (trace analyzer, supervisor, triage agent, reviewer, merge queue), the system offers a blueprint for other long‑duration code‑generation problems.

Limitations and Future Work

ATLAS’s code quality still lags behind expert‑written Lean, and the automatic evaluation relies on an LLM judge, which, despite human cross‑checking, is not infallible. Each textbook is formalized in isolation, leading to inconsistencies in naming, definitions, and abstraction levels that would need expert curation to integrate into mathlib . The approach also incurs significant token costs, making replication challenging for smaller research groups. Nonetheless, the paper argues that moving from single‑theorem proof generation to textbook‑level knowledge‑base construction redefines the problem as a software‑engineering challenge, opening pathways for broader applications such as program verification, compiler correctness, complex system configuration, and large‑scale scientific automation.

Related Links

https://arxiv.org/abs/2605.29955v1

https://github.com/facebookresearch/atlas-lean

https://github.com/facebookresearch/autoform-bot

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

software engineeringLLM Agentsformal verificationLean 4AutoformBotautomated theorem proving
Network Intelligence Research Center (NIRC)
Written by

Network Intelligence Research Center (NIRC)

NIRC is based on the National Key Laboratory of Network and Switching Technology at Beijing University of Posts and Telecommunications. It has built a technology matrix across four AI domains—intelligent cloud networking, natural language processing, computer vision, and machine learning systems—dedicated to solving real‑world problems, creating top‑tier systems, publishing high‑impact papers, and contributing significantly to the rapid advancement of China's network technology.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.