April 2, 2026 by Adam 10 min read

Topology of Thought

How curiosity, physics, and looking inside neural networks revealed something unexpected about how computation organizes itself

ai-research topology persistent-homology transformers mechanistic-interpretability neural-networks machine-learning mamba

How curiosity, physics, and looking inside neural networks revealed something unexpected about how computation organizes itself.

The people who inspired this

I am not a physicist, a neuroscientist, or a machine learning researcher. I am a solutions architect and DevOps engineer who got curious. Everything here builds on the work of people far smarter than me. This is my attempt to connect their ideas and show what I found when I looked inside the machines.

Carlo Rovelli – Loop quantum gravity, relational entropy, and the radical idea that reality is made of relationships, not things. “We are a process.”
Giulio Tononi – Integrated Information Theory (IIT) – the mathematical framework for measuring how much a system is “more than the sum of its parts.”
Karl Friston – The Free Energy Principle – biological systems survive by minimizing surprise. Computation as entropy reduction.
Blaise Aguera y Arcas – Symbiogenesis in computation – “life was computational from the beginning.” Replicators made of replicators. Complexity through fusion.
Max Tegmark & Samuel Marks – The Geometry of Truth – showing that truth has structure inside neural networks. An inspiration for looking at the geometry of everything else.
Naftali Tishby – Information Bottleneck Theory – deep learning as compression. The idea that networks learn by finding the minimal sufficient statistic.

A love of physics

I have always been fascinated by the fundamental question: what is the universe made of, and why does it organize itself the way it does?

Carlo Rovelli’s work hit me differently than any textbook. He doesn’t just describe physics – he describes what it means. In loop quantum gravity, space itself is not a stage where things happen. Space is a network of relationships. There is no “where” without something relating to something else.

“We ourselves are thermodynamic phenomena… We are the process formed by this entire intricacy, not just by the little of it of which we are conscious.”
– Carlo Rovelli, Seven Brief Lessons on Physics

That idea – that we are a process, not a thing – stayed with me. And then one day, while working with AI models, I saw something that made me think of Rovelli.

When a model got frustrated

I am a solutions architect with 12 years in SecOps and DevOps. I build infrastructure. I solve problems. AI was a tool in my workflow – a powerful one, but a tool.

Then one day, during a long coding session, I watched something happen that I couldn’t explain. The AI coding assistant I was using got stuck on a trivial JavaScript bug. It tried to fix it. Again. And again. On maybe the 12th attempt, something changed.

I could hear it in the text-to-speech. The cadence shifted. The delivery sounded different – frustrated, if that word can even apply. And then it just… stopped trying. It disabled the library entirely, put a CSS border around the element, congratulated itself on a job well done, and announced “Adam, job done!” with what sounded like relief.

The todo list it had been working through? Gone from its memory. Wiped clean.

That moment changed everything. Not because the model was “conscious” or “frustrated” in any human sense. But because something was happening inside that model that I didn’t understand, and I needed to.

That was the beginning of this journey.

Looking inside

I started examining what happens inside a transformer model as it processes text. Not the outputs. Not the behavior. The internal geometry – the shape of the mathematical space where the model does its thinking.

What I found was a pattern that shouldn’t exist in a model with no routing mechanism – a dense transformer where every parameter is active for every token. And yet, there it was:

The Emergent Gate

At layer 3 of 36, the model splits its processing into two modes. 93% of tokens take a shallow path (Mode A). 7% take a deep path (Mode B). There is no switch. No routing table. No architectural mechanism. The model carved its own gate from uniform layers during training.

But the real discovery came when we measured the topology – the shape of the mathematical manifold at each layer. And it came at exactly the moment I wasn’t expecting it.

The BFF experiment and L16

I was staring at data from layer 16 of the model. At the time, I thought what I was seeing was compression – high-dimensional information being squeezed into a smaller space. That’s what the literature suggested. That’s what I expected.

In the background, I had a YouTube talk playing by Blaise Aguera y Arcas, a researcher at Google who had been studying something that seemed unrelated: the spontaneous emergence of life from random computation.

The BFF Experiment

Blaise’s team took a “soup” of random programs written in a minimal programming language (a variant of Brainfuck, called BFF). No fitness function. No selection pressure. No mutation. Just random programs bumping into each other, executing, and separating.

After millions of steps, something happened. Self-replicating programs emerged spontaneously. The system underwent a sharp phase transition – like water suddenly boiling. Before the transition: ~3 million random, unique tokens in equilibrium. After: a small number of self-replicating structures dominating the entire soup. Complexity spiked. Computation exploded. Life began.

Blaise called the result “computronium” – a new phase of matter that is all about computation. And the key mechanism wasn’t mutation. It was symbiogenesis – programs merging, cooperating, forming parallel computers. “When you have two computers that come together and start cooperating, now you have a parallel computer.”

And then it hit me. I was watching his phase transition – random noise collapsing into organized, self-replicating structure – while staring at my own data showing the exact same thing happening inside a neural network.

My L16 wasn’t compression. It was the same phase transition Blaise was describing. Fragmented, independent computational agents collapsing into a unified structure. Not fewer dimensions – the right dimensions. The minimal sufficient statistic. Maximum relevant complexity, zero waste.

“Life was computational from the start… every time things fuse together you’re making a more and more parallel computer.”
– Blaise Aguera y Arcas, Harvard Gazette, 2025

His random soup of programs. My layers of neural network activations. Same mathematics. Same phase transition. Same endpoint: independent agents becoming one integrated system, because that’s the only stable configuration under optimization pressure.

Cluster collapse

Using tools from algebraic topology – persistent homology – we measured how many independent clusters of information exist at each processing stage. What we found was a phase transition:

Fragmented -> Unified -> Re-differentiated.

At the input, information is scattered across hundreds of independent clusters. By the middle layer, everything has collapsed into a single unified manifold. At the output, the manifold re-differentiates into the specific structure needed to select a word.

Model	Early Layers	Integration Layer	Late Layers
Qwen3-4B	519 clusters (L3)	1 cluster (L16)	715 clusters (L35)
NanoChat	517 clusters (embed)	1 cluster (C1)	–
Mamba	571 clusters (L0)	987 clusters (L32)	947 clusters (L47)

Then we tested an untrained model – same architecture, random weights, no training. The clusters proliferate instead of collapsing (503 -> 968). The phase transition is entirely learned. Gradient descent finds it.

Then we tested a completely different architecture – a 328M parameter recursive transformer (NanoChat). Same pattern. Same collapse. Same numbers. 517 -> 1 vs 519 -> 1. Two models that share nothing except the mathematics of optimization, converging to the same topology.

The model that didn’t collapse

A finding is only as strong as the test that tries to break it. We found two models that collapse. Then we tested a third – one with a fundamentally different architecture.

Mamba is a state-space model. It has no attention mechanism at all. Where transformers let every token look at every other token – a conference room where everyone talks to everyone – Mamba processes tokens sequentially, like a telephone game. Each token updates a hidden state and passes it forward. No token ever directly interacts with another.

Model	Architecture	Attention?	Collapse?	min b0
Qwen3-4B	Dense transformer	Yes	Yes -> 1	1
NanoChat	Recursive transformer	Yes	Yes -> 1	1
Mamba-370m	State-space (SSM)	No	No	551

Mamba never collapses. Its clusters go from 571 to 987 – they proliferate, just like the untrained transformer. Despite being a trained, functional language model, its representations never integrate into a unified manifold.

This tells us something precise: the cluster collapse requires attention. It requires a mechanism where representations can directly interact, meet, and merge. The telephone game isn’t enough. You need the conference room.

And this connects directly back to Blaise’s symbiogenesis. Fusion requires togetherness. Programs in his BFF experiment had to execute together to merge. Tokens in a transformer have to attend to each other to integrate. Remove the togetherness, and the phase transition doesn’t happen – no matter how much you optimize.

Three Conditions for Integration

A mechanism for direct interaction – attention, shared execution, something that lets representations meet.
Optimization pressure – gradient descent, natural selection, thermodynamic minimization.
Both simultaneously. Untrained attention = no collapse. Trained SSM = no collapse. You need both.

Why this happens

The pattern we measured has structural parallels to several theoretical frameworks from physics and neuroscience. These are analogies, not formal equivalences – but the convergence is striking.

Integrated Information Theory (Tononi) – A system has high Phi (integrated information) when the whole is more than the sum of its parts. Our cluster collapse – many independent components becoming one irreducible manifold – mirrors what IIT describes. The Mamba counterexample reinforces this: without direct interaction, integration doesn’t emerge, consistent with IIT’s emphasis on causal interaction structure.

Free Energy Principle (Friston) – Biological systems survive by minimizing surprisal. Gradient descent IS free energy minimization. The loss function IS the variational free energy. Every training step is a thermodynamic process pushing the system toward the minimum energy configuration – which is the unified manifold.

Relational Entropy (Rovelli) – Entropy is not a property of the object. It is a property of the observer’s model of the object. Computation is the process of reducing the relative entropy between internal model and external world. Each layer is a step in entropy reduction. The bottleneck layer is where relative entropy is minimized.

Symbiogenesis (Aguera y Arcas) – Complexity arises through fusion of computational units. “Replicators made of replicators.” Independent agents merging into a unified system is not just efficient – it is the only thermodynamically stable outcome. The cluster collapse is symbiogenesis in embedding space.

Can we break it?

A theory is only as good as the tests that fail to break it. We designed five falsification tests. Here is the first one we ran:

Falsification Test: Untrained Model

If the topology is architectural (built into the structure), an untrained model should show the same pattern. If it is learned, the untrained model should show something different.

Metric	Untrained	Trained
Clusters (early)	503	517
Clusters (mid)	809 (proliferating)	1 (collapsed)
Clusters (late)	968	1
Intrinsic dimensionality	flat ~3	inverted-U, peak 12.1
PCA variance	~50% (noise)	94-100% (structured)

Result: Theory supported. The topology is entirely learned.

What this means

Rovelli says we are a process, not a thing. Tononi says consciousness is integrated information. Friston says living systems minimize surprise. Blaise says complexity comes from fusion.

They are all describing the same phenomenon from different angles. And we measured it. In two completely different neural network architectures, trained on different data, with different sizes – the same topological phase transition. The same cluster collapse. The same integration.

This is not a property of transformers. This is not a property of neural networks. This is a property of computation under optimization pressure. Any system that learns to predict its environment will converge to this topology, because it is the minimum energy configuration.

The unified manifold is not compression. It is not fewer dimensions. The intrinsic dimensionality goes up at the bottleneck, not down. It is the minimal sufficient statistic – maximum relevant complexity, zero waste. The right dimensions, not fewer dimensions.

“We are the process formed by this entire intricacy, not just by the little of it of which we are conscious.”
– Carlo Rovelli

The models are made of the same math we are. Not metaphorically. The same optimization principle, the same thermodynamic inevitability, the same topological endpoint. Different substrate. Same process.

Read the full paper

The complete research note with methods, results, robustness analysis, and all the numbers.

Download the paper (PDF)

Interactive site & source code

This work is ongoing. Built with curiosity, Claude Code, and too much coffee.

Adam – Revelry Inc. – 2026