OpenMythos: Rebuilding Claude Mythos with Recursive Transformers and MoE

OpenMythos is an open‑source PyTorch reimplementation of Anthropic's Claude Mythos that uses a mixed‑expert routed recurrent Transformer, introduces Recursive Depth Transformers, Multi‑Latent Attention, and several stability mechanisms, and demonstrates parameter‑efficient scaling backed by empirical studies.

PaperAgent
PaperAgent
PaperAgent
OpenMythos: Rebuilding Claude Mythos with Recursive Transformers and MoE

Anthropic announced Claude Mythos, a powerful yet unreleased large model; a 22‑year‑old developer reverse‑engineered it and released OpenMythos, an open‑source PyTorch implementation built from first principles.

The architecture instantiates a mixed‑expert (MoE) routed recurrent Transformer, using weight sharing and cross‑expert conditional computation to achieve iterative depth.

The author hypothesizes that recursively applying a fixed‑parameter block together with sparse expert activation can improve the efficiency‑performance trade‑off and give rise to multi‑step reasoning. This leads to the definition of a Recursive Depth Transformer (RDT), a class of recurrent Transformers where a fixed weight set is applied across T cycles in a single forward pass.

Inference happens entirely in a continuous latent space with no intermediate token outputs, distinguishing it from Chain‑of‑Thought approaches; this formulation has been formally analyzed by Saunshi et al. (2025) and COCONUT (2024).

The recurrent block runs a shared TransformerBlock up to T=16 iterations. Each step injects the frozen encoding e via a stable LTI update rule. The block’s feed‑forward network is a MoE layer following DeepSeekMoE’s design: many fine‑grained routing experts where each token activates a sparse top‑K subset plus a few always‑active shared experts.

Crucially, the router selects a different expert subset at each depth, so each iteration performs a distinct computation. MoE supplies breadth across domains, while recurrence supplies depth of reasoning.

The full architecture is Prelude → Recurrent Block → Coda . Prelude and Coda are standard Transformer layers executed once; the recurrent block is the computational core. Attention defaults to Multi‑Latent Attention (DeepSeek‑V2), which compresses KV into low‑rank latent variables, reducing KV memory by 10‑20× at production scale.

Three mechanisms stabilize the recurrence:

LTI constraint injection (ensuring spectral radius ρ(A) < 1);

Adaptive Computation Time (ACT) for dynamic per‑position stopping;

Depth‑wise LoRA adapters that give each iteration expressive power without extra parameters.

Regarding parameter efficiency, a k‑layer model run for L cycles attains the quality of a k·L‑layer standard Transformer while using only k‑layer parameters. Empirically, Parcae, Prairie et al. (2026) show that a 770 M‑parameter RDT matches a 1.3 B‑parameter standard model on the same training data. The key insight is that inference depth is a function of compute, not of parameter count.

This reframes scaling debates: the critical dimension is inference‑time recurrence depth rather than training‑time model size.

OpenMythos contributions:

Full open‑source, configurable PyTorch implementation of the RDT hypothesis, including MoE FFN and Multi‑Latent Attention.

LTI‑stable recurrence injection integrated as a first‑class training primitive.

Depth‑wise LoRA adapters that differentiate behavior across iterations without extra parameters.

Reproducible research baseline for studying dynamic, scalable recurrent Transformers and inference depth.

Repository links:

https://x.com/KyeGomezB/status/2045659150340723107
https://github.com/kyegomez/OpenMythos
MoEPyTorchAI ArchitectureClaude MythosOpenMythosRecursive Transformer
PaperAgent
Written by

PaperAgent

Daily updates, analyzing cutting-edge AI research papers

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.