How a 22‑Year‑Old Reversed‑Engineered Mythos into OpenMythos Using MoE and DeepSeek‑Inspired Attention

OpenMythos re‑creates the Claude Mythos architecture as a Recurrent‑Depth Transformer with MoE routing, achieving comparable performance to larger Transformers while using roughly half the parameters, and demonstrates systematic generalization and depth extrapolation through looped inference in latent space.

Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
How a 22‑Year‑Old Reversed‑Engineered Mythos into OpenMythos Using MoE and DeepSeek‑Inspired Attention

Core Design of the Recurrent‑Depth Transformer (RDT)

Same weight set can be executed up to 16 times.

Each pass follows a different expert path.

Inference runs entirely in the hidden‑state latent space, producing an answer only after the final loop.

Mixture‑of‑Experts routing

The MoE component follows the DeepSeekMoE design: a large number of fine‑grained routing experts combined with a small pool of always‑online shared experts, providing breadth of domain knowledge across loops.

LTI stable loop injection

Stability is ensured by the Linear Time‑Invariant (LTI) stable loop injection described in the Parcae paper (UCSD & Together AI), which prevents divergence of each recurrent iteration.

Empirical results

Experiments show a 770 M‑parameter RDT matches the performance of a standard 1.3 B‑parameter Transformer, reducing parameter count by nearly 50 % without loss of accuracy.

Systematic generalization

Reproducing a study from Ohio State University, the RDT correctly answered queries involving knowledge combinations never seen during training, whereas a conventional Transformer failed, demonstrating systematic generalization.

Depth extrapolation

When trained on 20‑step reasoning chains and tested on 30‑step chains, the RDT handled the longer chains by adding extra loops, while the standard Transformer collapsed, indicating effective depth extrapolation.

Implications

These findings suggest that the bottleneck for large language models lies in knowledge composition rather than raw parameter count, and that future scaling may prioritize deeper inference loops over ever‑larger models.

Resources

GitHub: https://github.com/kyegomez/OpenMythos#the-central-hypothesis
Reference 1: https://x.com/KyeGomezB/status/2045660378844024994
Reference 2: https://arxiv.org/abs/2604.07822
Reference 3: https://arxiv.org/abs/2604.12946
Mixture of Expertsscaling lawsAI ArchitectureOpenMythosLooped Language ModelsRecurrent-Depth Transformer
Machine Learning Algorithms & Natural Language Processing
Written by

Machine Learning Algorithms & Natural Language Processing

Focused on frontier AI technologies, empowering AI researchers' progress.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.