OpenMythos: Rebuilding Claude Mythos with Recursive Transformers and MoE
OpenMythos is an open‑source PyTorch reimplementation of Anthropic's Claude Mythos that uses a mixed‑expert routed recurrent Transformer, introduces Recursive Depth Transformers, Multi‑Latent Attention, and several stability mechanisms, and demonstrates parameter‑efficient scaling backed by empirical studies.
Anthropic announced Claude Mythos, a powerful yet unreleased large model; a 22‑year‑old developer reverse‑engineered it and released OpenMythos, an open‑source PyTorch implementation built from first principles.
The architecture instantiates a mixed‑expert (MoE) routed recurrent Transformer, using weight sharing and cross‑expert conditional computation to achieve iterative depth.
The author hypothesizes that recursively applying a fixed‑parameter block together with sparse expert activation can improve the efficiency‑performance trade‑off and give rise to multi‑step reasoning. This leads to the definition of a Recursive Depth Transformer (RDT), a class of recurrent Transformers where a fixed weight set is applied across T cycles in a single forward pass.
Inference happens entirely in a continuous latent space with no intermediate token outputs, distinguishing it from Chain‑of‑Thought approaches; this formulation has been formally analyzed by Saunshi et al. (2025) and COCONUT (2024).
The recurrent block runs a shared TransformerBlock up to T=16 iterations. Each step injects the frozen encoding e via a stable LTI update rule. The block’s feed‑forward network is a MoE layer following DeepSeekMoE’s design: many fine‑grained routing experts where each token activates a sparse top‑K subset plus a few always‑active shared experts.
Crucially, the router selects a different expert subset at each depth, so each iteration performs a distinct computation. MoE supplies breadth across domains, while recurrence supplies depth of reasoning.
The full architecture is Prelude → Recurrent Block → Coda . Prelude and Coda are standard Transformer layers executed once; the recurrent block is the computational core. Attention defaults to Multi‑Latent Attention (DeepSeek‑V2), which compresses KV into low‑rank latent variables, reducing KV memory by 10‑20× at production scale.
Three mechanisms stabilize the recurrence:
LTI constraint injection (ensuring spectral radius ρ(A) < 1);
Adaptive Computation Time (ACT) for dynamic per‑position stopping;
Depth‑wise LoRA adapters that give each iteration expressive power without extra parameters.
Regarding parameter efficiency, a k‑layer model run for L cycles attains the quality of a k·L‑layer standard Transformer while using only k‑layer parameters. Empirically, Parcae, Prairie et al. (2026) show that a 770 M‑parameter RDT matches a 1.3 B‑parameter standard model on the same training data. The key insight is that inference depth is a function of compute, not of parameter count.
This reframes scaling debates: the critical dimension is inference‑time recurrence depth rather than training‑time model size.
OpenMythos contributions:
Full open‑source, configurable PyTorch implementation of the RDT hypothesis, including MoE FFN and Multi‑Latent Attention.
LTI‑stable recurrence injection integrated as a first‑class training primitive.
Depth‑wise LoRA adapters that differentiate behavior across iterations without extra parameters.
Reproducible research baseline for studying dynamic, scalable recurrent Transformers and inference depth.
Repository links:
https://x.com/KyeGomezB/status/2045659150340723107
https://github.com/kyegomez/OpenMythosHow this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
