Meet OpenMythos: An Open-Source PyTorch Reconstruction of Claude Mythos Where 770M Parameters Match a 1.3B Transformer

0


Anthropic has never published a technical paper on Claude Mythos. That has not stopped the research community from theorizing. A new open-source project called OpenMythos, released on GitHub by Kye Gomez, attempts something ambitious: a first-principles theoretical reconstruction of what the Claude Mythos architecture might actually be, built entirely in PyTorch and grounded in peer-reviewed research.

The project is not a leaked model, a fine-tune, or a distillation. It is a hypothesis rendered in code — and the hypothesis is specific enough to be falsifiable, which is what makes it interesting.

The Main Claim: Claude Mythos Is a Recurrent-Depth Transformer

OpenMythos proposes that Claude Mythos belongs to a class of architectures called Recurrent-Depth Transformers (RDTs), also referred to in the literature as Looped Transformers. The concept is meaningfully different from standard transformer stacks.

In a conventional transformer — GPT, LLaMA, Mistral — the model passes input through a sequence of unique layers, one after another, each with its own independent weights. More capability generally means more layers and more parameters. In a Recurrent-Depth Transformer, a fixed set of weights is applied iteratively across T loop steps within a single forward pass. The same weights run multiple times. Reasoning depth is not a function of how many parameters are stored, but of how many iterations are run at inference time.

Think of it less like reading a book and more like refining a draft: the model returns to the same computational block again and again, improving its internal representation with each pass.

How the Architecture is Structured

OpenMythos instantiates this as a three-part structure: Prelude → Recurrent Block → Coda. The Prelude and Coda are standard transformer layers that run exactly once. The Recurrent Block is the computational core, looped up to T=16 times.

At each loop step t, the hidden state is updated using the following rule:

ht+1 = A·ht + B·e + Transformer(ht, e)

Here, ht is the hidden state after loop iteration t, and e is the encoded input from the Prelude — re-injected at every step. The re-injection is deliberate: without it, the hidden state would drift away from the original input signal across deep loops. The learned matrices A and B govern how much of the previous hidden state and the encoded input carry forward at each step.

The FFN inside the Recurrent Block is not a standard feedforward layer. OpenMythos replaces it with a Mixture-of-Experts (MoE) layer following the design introduced in DeepSeekMoE: a large pool of fine-grained routed experts, with only a sparse top-K subset activated per token, alongside a small set of always-active shared experts that absorb common cross-domain patterns. Crucially, the router selects distinct expert subsets at each loop depth, meaning each iteration is computationally distinct despite sharing the same base weights. MoE provides domain breadth; looping provides reasoning depth.

Attention defaults to Multi-Latent Attention from DeepSeek-V2, which caches a compressed low-rank KV latent rather than full key/value tensors, yielding a 10–20× reduction in KV memory at production scale.

Reasoning in Continuous Latent Space

One of the most important properties of this architecture is that reasoning occurs entirely in continuous latent space. There is no intermediate token emission between loop steps — the model does not produce text mid-thought and then re-read it. This is structurally distinct from chain-of-thought prompting, where reasoning is externalized as token sequences, and has been formally analyzed in both Saunshi et al. (2025) and COCONUT (2024).

Saunshi et al. (2025) formally show that each loop iteration in an RDT is functionally equivalent to one step of chain-of-thought, but operating over real-valued vectors rather than discrete tokens. Continuous latent thoughts can also encode multiple alternative next steps simultaneously, enabling something closer to breadth-first search over the reasoning space within a single forward pass.

This also explains a concrete capability advantage. A standard transformer trained on 5-hop reasoning chains fails when tested on 10-hop chains at inference time — it has no mechanism to extend its depth beyond what it saw during training. A Recurrent-Depth Transformer handles this naturally: running more inference-time loops extends the reasoning chain without any retraining. Harder problems receive more compute; simpler ones exit early.

Solving the Stability Problem

Training looped models has historically been brittle. The hidden state ht can grow unboundedly across iterations — a failure mode called residual explosion. OpenMythos addresses this using a Linear Time-Invariant (LTI) injection constraint borrowed from the Parcae architecture (Prairie et al., 2026): the spectral radius of A, denoted ρ(A), is enforced to be less than 1 by construction, guaranteeing stability regardless of learning rate or gradient noise.

A second failure mode also exists at the other extreme: beyond a certain loop depth, excessive recurrence degrades predictions — the hidden state drifts past the solution and into noise. This is the ‘overthinking’ problem. Adaptive Computation Time (ACT) halting addresses it with a learned scalar per position that dynamically decides when to stop looping. Positions that are harder to process receive more computation; tokens that have already converged halt early.

Finally, Depth-Wise LoRA adapters introduce a small rank-r adaptation matrix at each iteration depth, giving each loop step slightly distinct behavior without adding substantial parameters — bridging the gap between pure weight-tying and fully distinct layers.

Why Parameter Efficiency Matters

The Parcae paper (Prairie et al., 2026) provides empirical grounding for the efficiency claim. At 770M parameters, an RDT matches a 1.3B standard transformer trained on identical data — roughly half the parameters for equivalent downstream quality. Optimal recurrence and optimal token count both follow power laws with consistent exponents across scales, establishing the first predictable scaling laws for looped training.

The implication is significant: reasoning depth scales with inference-time compute, not stored parameter count. This reframes one of the dominant assumptions in the scaling debate. The relevant axis may not be parameter count at training, but loop depth at inference.

What OpenMythos Contributes

OpenMythos provides four concrete research artifacts: a fully configurable PyTorch implementation of the RDT hypothesis with MoE FFN and Multi-Latent Attention; LTI-stable recurrent injection integrated as a first-class training primitive; depth-wise LoRA adapters enabling per-iteration behavioral differentiation; and a reproducible research baseline for studying looped transformer dynamics and inference-time reasoning depth.

Whether or not Mythos is actually an RDT, OpenMythos gives the research community something concrete and runnable — an implementation of an architecture class the literature increasingly suggests is underexplored, and one that may represent a fundamentally different path to capable AI than simply training bigger models.

Check out the Full Codes with Notebook here. Also, feel free to follow us on Twitter and don’t forget to join our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us





Source link

You might also like
Leave A Reply

Your email address will not be published.