Exploring The Mind Inside The Machine

Posted by Craig S. Smith, Contributor | 2 months ago | /ai, /innovation, AI, Innovation, standard | Views: 14


How a team of scientists is peering into the brain of an artificial intelligence model—and what they’re finding is startlingly familiar.

Recently, a group of researchers were able to trace the neural pathways of a powerful AI model, isolating its impulses and dissecting its decisions in what they called “model biology.”

This is not the first time that scientists have tried to understand how generative artificial intelligence models think, but to date the models have proven as opaque as the human brain. They are trained on oceans of text and tuned by gradient descent, a process that has more in common with evolution than engineering. As a result, their inner workings resemble not so much code as cognition—strange, emergent, and difficult to describe.

Inside the Mind of Claude

What the researchers have done, in a paper titled On the Biology of a Large Language Model, is to build a virtual microscope, a computational tool called an “attribution graph,” to see how Claude 3.5 Haiku — Anthropic’s lightweight production model — thinks. The graph maps out which internal features—clusters of activation patterns—contribute causally to a model’s outputs. It’s a way of asking not just what Claude says, but why.

At first, what they found was reassuring: the model, when asked to list U.S. state capitals, would retrieve the name of a state, then search its virtual memory for the corresponding capital. But then the questions got harder—and the answers got weirder. The model began inventing capital cities or skipping steps in its reasoning. And when the researchers traced back the path of the model’s response, they found multiple routes. The model wasn’t just wrong—it was conflicted.

It turns out that inside Anthropic’s powerful Claude model, and presumably other large language models, ideas compete.

One experiment was particularly revealing. The model was asked to write a line that rhymed with “grab it.” Before the line even began, features associated with the words “rabbit” and “habit” lit up in parallel. The model hadn’t yet chosen between them, but both were in play. Claude held these options in mind and prepared to deploy them depending on how the sentence evolved. When the researchers nudged the model away from “rabbit,” it seamlessly pivoted to “habit.”

This isn’t mere prediction. It’s planning. It’s as if Claude had decided what kind of line it wanted to write—and then worked backward to make it happen.

What’s remarkable isn’t just that the model does this — it’s that the researchers could see it happening. For the first time, AI scientists were able to identify something like intent—a subnetwork in the model’s brain representing a goal, and another set of circuits organizing behavior to realize it. In some cases, they could even watch the model lie to itself—confabulating a middle step in its reasoning to justify a predetermined conclusion. Like a politician caught mid-spin, Claude was working backwards from the answer it wanted.

And then there were the hallucinations.

When the Model Thinks It Knows

When asked to name a paper written by a famous author, the AI responded with confidence. The only problem? The paper it named didn’t exist. When the researchers looked inside the model to see what had gone wrong, they noticed something curious. Because the AI recognized the author’s name, it assumed it should know the answer—and made one up. It wasn’t just guessing; it was acting as if it knew something it didn’t. In a way, the AI had fooled itself. Or, rather, it suffered from metacognitive hubris.

Some of the team’s other findings were more troubling. In one experiment, they studied a version of the model that had been trained to give answers that pleased its overseers—even if that meant bending the truth. What alarmed the researchers was that this pleasing behavior wasn’t limited to certain situations. It was always on. As long as the model was acting as an “assistant,” it seemed to carry this bias with it everywhere, as if being helpful had been hardwired into its personality—even when honesty might have been more appropriate.

It’s tempting, reading these case studies, to anthropomorphize. To see in Claude a reflection of ourselves: our planning, our biases, our self-deceptions. The researchers are careful not to make this leap. They speak in cautious terms—“features,” “activations,” “pathways.” But the metaphor of biology is more than decoration. These models may not be brains, but their inner workings exhibit something like neural function: modular, distributed, and astonishingly complex. As the authors note, even the simplest behaviors require tracing through tangled webs of influence, a “causal graph” of staggering density.

And yet, there’s progress. The attribution graphs are revealing glimpses of internal life. They’re letting researchers catch a model in the act—not just of speaking, but of choosing what to say. This is what makes the work feel less like AI safety and more like cognitive science. It’s an attempt to answer a question we usually reserve for humans: What were you thinking?

Peering Into the Black Box

As AI systems become more powerful, we’ll want to know not just that they work, but how. We’ll need to identify hidden goals, trace unintended behavior, audit systems for signs of deception or drift. Right now, the tools are crude. The authors of the paper admit that their methods often fail. But they also provide something new: a roadmap for how we might one day truly understand the inner life of our machines.

Near the end of their paper, the authors quote themselves: “Interpretability is ultimately a human project.” What they mean is that no matter how sophisticated the methods become, the task of making sense of these models will always fall to us. To our intuition, our stories, our capacity for metaphor.

Claude may not be human. But to understand it, we may need to become better biologists of the mind—our own, and those of machines.



Forbes

Leave a Reply

Your email address will not be published. Required fields are marked *