The 3-Layer Hack That Doubles LLM Reasoning (Without Retraining)

What if you could double your AI's reasoning capability without spending a dollar on compute? A new open-source discovery reveals that large language models contain hidden "thinking circuits" — and activating them is as simple as duplicating three specific layers.

The Discovery: LLMs Have Built-In Reasoning Circuits

Researchers have discovered that transformer models organize themselves during training into what they call "functional circuits" — contiguous blocks of 3-4 layers that act as indivisible cognitive units. These aren't just arbitrary groupings. They're discrete processing modules that perform complete reasoning operations, similar to how different brain regions handle specific cognitive tasks.

The breakthrough finding: duplicating the right circuit gives the model a second pass through its reasoning pipeline. Same weights. No retraining. No fine-tuning. Just route the hidden states through the same circuit twice and watch the benchmarks jump.

This discovery builds on David Ng's RYS method, which originally topped the HuggingFace Open LLM Leaderboard without changing a single weight. The new research validates and extends this approach, revealing that these circuits exist across different model architectures — from 24B parameter models to 72B giants.

The Results: From 0.22 to 0.76 on Logical Reasoning

The numbers tell a striking story. On the BBH Logical Deduction benchmark, Devstral-24B went from scoring 0.22 to 0.76 — a 245% improvement — simply by duplicating layers 12-14. No gradient descent. No data curation. Just architectural surgery.

Here's what happened across multiple benchmarks when the reasoning circuit was duplicated:

The crucial detail: nothing degraded. When you fine-tune a model, you often trade capabilities — better at math, worse at creative writing. But circuit duplication appears to add pure reasoning horsepower without sacrificing other skills. The model just thinks longer.

How It Works: Finding Your Model's Hidden Circuits

Different models store their reasoning circuits in different places. Devstral-24B (40 layers) keeps its circuit at layers 12-14. Qwen2.5-32B (64 layers) hides theirs at layers 7-9. The boundaries are razor-sharp — shift by one layer in either direction and the improvement disappears or inverts.

The researchers built a complete toolkit for finding and exploiting these circuits in any GGUF model. The workflow is surprisingly straightforward:

The entire discovery — sweep, validation, and all — took one evening on two consumer AMD GPUs (RX 7900 XT + RX 6950 XT). No data center required.

Different Patterns, Different Cognitive Modes

Here's where it gets really interesting. The same model weights can produce different cognitive profiles depending on how you route through the circuits:

Same weights on disk. Same VRAM footprint for the base model. Just different routing through the architecture produces measurably different capabilities. It's like discovering your AI has multiple personalities, and you can dial them up at will.

What This Means for Indie Developers

For developers building with open-source models, this discovery changes the optimization calculus. Instead of defaulting to "bigger is better," you can now ask: "Can I make my current model smarter through architectural tweaks?"

The trade-offs are clear and manageable. Duplicating 3 layers on a 40-layer model adds roughly 1.5 GB of VRAM and slows inference by about 7.5%. But for applications where reasoning quality matters — coding assistants, analysis tools, research agents — that overhead might be trivial compared to the capability gains.

More importantly, this is orthogonal to fine-tuning. You can do both. In fact, David Ng's original RYS models were later fine-tuned by the community and topped the HuggingFace leaderboard. Layer duplication changes the architecture; fine-tuning changes the weights. Stack them.

The Bigger Picture: Functional Neuroanatomy for AI

This research points toward something deeper than a simple optimization trick. It suggests that transformers naturally develop specialized structures during training — that the "black box" of neural networks might be more organized than we thought.

David Ng, who pioneered this approach, describes it as "LLM neuroanatomy" — treating transformer layers like brain regions that can be mapped, studied, and selectively enhanced. The implications extend beyond duplication. If models have discrete reasoning circuits, what other specialized structures might they contain? Memory circuits? Creativity circuits? Safety circuits?

For the open-source AI community, this is a significant democratizing force. You don't need a training cluster to improve model performance. You need understanding — and now you have the tools to develop it.

FAQ

Does this work with any LLM?

Testing so far validates Mistral-architecture models (like Devstral) and Qwen2 architecture. The circuits exist in all transformers — the question is where and how large they are. The included sweep.py tool can find circuits in any GGUF model.

How much VRAM does the duplicated model need?

The duplicated layers are physical copies in the GGUF, so expect roughly 1.5 GB additional for 3 extra layers on a 24B model. A future llama.cpp patch could eliminate this overhead by using pointers instead of copies.

Does this slow down inference?

Yes, proportionally to the extra layers. Three additional layers on a 40-layer model equals approximately 7.5% slower inference. The reasoning improvement typically outweighs this cost for quality-sensitive applications.

Why not just fine-tune instead?

Fine-tuning changes weights; circuit duplication changes architecture. They're complementary. You can duplicate circuits, then fine-tune the result. The RYS models proved this stack works — community fine-tuning on top of layer duplication produced leaderboard-topping results.

How do I try this on my own models?

The complete toolkit is available at github.com/alainnothere/llm-circuit-finder. You'll need Python 3.10+, llama.cpp built for your hardware (CPU, CUDA, Vulkan, or Metal), and enough VRAM to run your model plus a few extra layers. The README includes complete installation and usage instructions.

Can I break my model by duplicating the wrong layers?

Shifting the duplicated block by even one layer from the optimal circuit typically causes the improvement to disappear or invert. The model won't break — it just won't get smarter. The sweep tools help identify the correct boundaries before you commit to a modified architecture.