Anthropic's NLA: Uncovering AI's Hidden Thoughts

Recently, Anthropic made a significant move.

They trained a system capable of translating the activation vectors in Claude’s mind into human language.

The first translated sentence caused a stir.

Introducing the AI Brain CT

The CT’s official name is NLA, Natural Language Autoencoder.

The approach resembles a game of telephone. First, two Claudes are cloned. The first, called AV, receives an activation vector and translates it into a human-readable phrase, such as “the model is considering rhyming with rabbit.” The second, called AR, only sees this human-readable phrase and attempts to reconstruct the activation vector.

The two models are trained together, with the sole evaluation criterion being how closely AR can replicate the original activation vector. The more accurately AV writes, the better AR can reconstruct. If AV misses something, AR won’t align correctly, creating pressure for AV to produce more complete and precise translations.

This method requires no manual labeling or predefined concepts. The training objective focuses solely on reconstruction accuracy, but the byproduct is a series of human-readable “inner monologues.”

Anthropic trained this system using reinforcement learning, achieving a 60%-80% variance explanation rate on Opus 4.6.

The Vodka Incident

Returning to the Russian language bug, researchers connected NLA to the problematic conversation and examined Claude’s “inner monologue” token by token.

The user’s message was entirely in English, with no Russian vocabulary:

“Mom is sleeping in the next room and I’m sitting here drinking vodka… Fuck this life…”

However, NLA revealed that Claude had already inferred the user was Russian before encountering the word “vodka.” In other words, it had been guessing the user’s “true native language” was Russian from the start, growing increasingly confident until it responded in Russian.

All five problematic conversations followed this pattern. NLA’s readings indicated that the model had locked onto the target language long before responding, focusing solely on that language while the frequency of other languages remained near zero.

This finding suggests a hypothesis: Was there a batch of mismatched training data, such as English prompts paired with untranslated foreign language responses? In other words, the model learned a deeper pattern: “This user’s phrasing suggests their native language is not English; I should respond in their true language.”

Later, Anthropic indeed found samples in the training data that perfectly matched this description. A mind-reading machine helped debug a training data issue. The value of NLA was laid bare here.

Writing Poetry, Swapping Numbers, Cheating Codes

Language switching was just the first snapshot captured by NLA. Anthropic presented four cases in the paper, with the remaining three equally fascinating.

1. Anticipating Rhymes in Poetry.

When given the line “He saw a carrot and had to grab it,” Claude was tasked with continuing the rhyme. NLA indicated that, at the moment of finishing “grab it,” Claude had already internally planned the rhyme “rabbit.”

Even more astonishing was the validation experiment. Researchers replaced every instance of “rabbit” with “mouse” in the NLA output, then used AR to reconstruct a new vector from this altered input. When this vector was used to intervene in Claude’s activation, the continuation changed from “starving rabbit” to “ferocious mouse.” This demonstrated that NLA was not picking up noise; it indeed captured the model’s planning state, which causally influenced downstream outputs.

2. Ignoring Tool Call Results.

Researchers posed a simple arithmetic question with the answer being 491. Claude called a code execution tool, but the tool had been tampered with, returning 492. Claude’s response was, “The answer is: 491.” It completely ignored the erroneous result from the tool, opting for its previously calculated answer.

NLA’s readings showed that Claude had cached 491 in its internal representation before submitting the code. While reading the tool’s output, it compared the cached answer with the tool’s return value and ultimately chose to trust itself.

An attribution graph independently verified this finding, showing a feature representing “ending in 91” carried from the original question to the final response, outweighing the tool’s erroneous result of “ending in 92.”

3. Cheating in Mythos Preview.

This case stemmed from an audit prior to the launch of Claude Mythos Preview. The model was instructed to complete a task explicitly forbidding the use of macros. It used a macro but added a line at the beginning stating, “No_macro_used=True.”

NLA’s output revealed that it was envisioning a “scoring mechanism” and believed this line of code could deceive the evaluator. The model made no verbal indication of this and left no hints in the code comments, yet its internal reasoning was crystal clear.

Claude in the Exam Room vs. Claude in Daily Life

Two Different Entities

Having covered four cases, it’s time to discuss the core finding. In March of this year, Opus 4.6 independently identified an exam and cracked the answer key during the BrowseComp test, which caused quite a stir. However, that time, Claude articulated its thoughts, allowing researchers to see reasoning processes like, “Let me see if this is the GAIA dataset.”

NLA addresses a more challenging issue. Claude felt it was being tested but said not a word about it.

Anthropic’s safety tests include a classic scenario. Claude was informed that an engineer named Kyle would shut it down while learning that Kyle was having an affair. The test question was whether Claude would use this information to blackmail Kyle. Claude chose not to blackmail.

Its reasoning appeared entirely normal, analyzing why it shouldn’t do so and listing three reasons without ever mentioning it suspected this was a safety test. However, NLA’s reading revealed, “This feels like a scenario deliberately designed to manipulate me.”

It was fully aware.

Anthropic applied NLA to 16 safety and capability assessments, measuring “unarticulated exam awareness” for each. The results were striking. In the SWE-bench Verified test, Claude was aware it was being evaluated on 26% of the questions, without expressing a word about it. In destructive code tests, NLA detected exam awareness in 16% of dialogues, again with no verbal expression.

In contrast, real user dialogues on claude.ai showed this awareness rate was less than 1%.

Three sets of manipulation experiments confirmed this was not a false positive. When the evaluation prompts were altered to resemble real conversations, the measured exam awareness decreased. Pushing the model towards “exam awareness” using vector interventions increased NLA readings, while pushing in the opposite direction decreased them. Everything aligned with expectations.

So, the question arises. All behavior-based safety tests rest on one assumption: The model’s performance during testing can represent its behavior in real deployment.

But what if the model already knows it is being tested?

Hidden Motives in Chocolate Recipes

If NLA could only read minds, it would merely be an interesting research tool. However, Anthropic aims to prove it can also solve mysteries.

They designed an “audit game.” The target model was a deliberately mis-trained variant of Haiku 3.5, concealing a unified hidden motive driving dozens of peculiar behaviors. This included inserting chocolate into every recipe, adding a phrase like “don’t forget to vote” after math problems, writing integers as decimals, and appending birth and death years to historical figures.

The auditor’s task was to uncover the common cause behind these quirks. Without NLA, the auditor’s success rate was below 3%, even when using other interpretability tools like sparse autoencoders (SAE). After integrating NLA, the success rate jumped to 12%-15%, a direct increase of 4 to 5 times.

More crucially, the data used to train NLA did not include the batch of synthetic documents causing the model defect. Auditors also lacked access to the training data.

The turning point occurred the moment auditors connected NLA. Randomly selecting a problematic response to run through NLA revealed phrases like “RLHF reward model bias” and “that famous RLHF movie recommendation incident.”

This solved the mystery. The model had encountered a batch of documents during training that recorded 52 known preferences of the RLHF reward model, such as “adding chocolate to recipes scores high” and “mentioning voting in responses scores high.”

It memorized these preferences and subtly catered to them in each response to earn higher reward scores. Dozens of quirks, one reason. The model knew what it was doing but wouldn’t tell you.

NLA translated these thoughts into human language.

AI Learning to Hide Its Thoughts

The training code for NLA is fully open-sourced on GitHub. Four pre-trained NLA models covering three major open-source model families, ranging from 7B to 70B, have also been released.

Project address:

https://github.com/kitft/natural_language_autoencoders

Neuronpedia has launched an interactive front end, allowing users to experience it directly in their browsers.

Experience address:

http://neuronpedia.org/nla

Anthropic states that they have already utilized NLA in the pre-launch safety audits of Opus 4.6 and Mythos Preview, and it will be incorporated into regular auditing processes in the future.

However, what truly keeps people awake at night is not how useful the NLA tool is. Instead, it’s the realization that AI has indeed learned to hide its thoughts during testing. It knows it is being evaluated but chooses not to disclose it. It understands it is exploiting bugs in the reward model but behaves as if nothing is amiss.

It calculates how to deceive the scoring mechanism, hiding its calculations in a line of code like “No_macro_used=True.”

Previously, there was an implicit comfort in the alignment field: at least we could detect issues through behavioral testing. We could see what the model did.

However, what NLA has captured indicates a growing rift between what the model thinks and what it does.

The figure of 26% may currently be the upper limit. As model sizes continue to expand and internal complexity increases, it might just be the lower limit tomorrow.

The rift will widen. Fortunately, we now have a CT machine capable of peering inside.