Subhadip Mitra

Attention Is All You Bid: Advertising in Embedding Space

2026-04-04T00:00:00+00:00

TL;DR: OpenAI is showing ads in ChatGPT. Perplexity tried and pulled back. Google is taking a measured approach. Meanwhile, the real action is happening underneath: regions of vector embedding space near high-value queries are becoming the new commercially contested territory - the “shelf space” of the AI era. GEO (Generative Engine Optimization) and RAG poisoning are points on the same spectrum, and nobody is connecting the security research, the marketing industry, and the mechanism design papers. This post maps the landscape, identifies the gaps, and proposes a framework for thinking about embedding space as an economic system.

Three months ago, OpenAI flipped a switch and started showing ads inside ChatGPT. Criteo signed on as the first ad-tech partner. Smartly followed days later with something even more radical - conversational ad formats where clicking a sponsored suggestion drops you into another chatbot dialogue designed to sell you something. Meanwhile, Anthropic ran a Super Bowl ad mocking the whole idea, and Perplexity quietly pulled its own ads after they tanked user experience.

We are watching, in real time, the birth of the next trillion-dollar advertising market. And almost nobody is talking about what’s actually happening underneath.

I’ve spent the last few weeks reading every paper, press release, and pitch deck I could find on LLM advertising. What I found is a field that’s moving fast on the surface - auction mechanisms, ad formats, CPM pricing - while ignoring a structural problem that could define the next decade of the internet:

Vector embedding space is the new commercially contested territory. It has no transparency, no regulation, and no honest market mechanism. And people are already gaming it.

This post lays out the landscape, the open problems, and a framework for thinking about what comes next.

A Brief History of Attention Markets

Every era of the internet created a new scarce resource and then built a billion-dollar market around bidding for it.

1960s - 1990s

Shelf Space

Physical proximity
to consumers

Mechanism: Slotting fees

~$50B/yr

1998 - 2015

PageRank

Link graph position
determines visibility

Mechanism: Keyword auctions

~$200B/yr

2010 - 2023

Feed Position

Algorithmic ranking
in content streams

Mechanism: Attention auctions

~$300B/yr

2024 - ???

Embedding Space

Semantic proximity
in vector space

No market mechanism yet

???

In each era, the scarce resource was different, but the pattern was identical:

Shelf space was finite. Procter & Gamble figured out that paying retailers for eye-level placement was worth more than any ad campaign. The “slotting fee” was born - brands literally bidding on physical proximity to consumers.

PageRank turned the link graph into a scarce resource. If your site was semantically close to a high-value query in Google’s index, you had “real estate” worth millions. Google built a $200B/year business by auctioning off the space next to those organic results.

Feed position made attention sequential. Facebook and Instagram learned that controlling the order in which you see things was worth more than controlling the content. The algorithmic feed became the scarce resource, and advertisers bid on interrupting it.

Now we’re entering the fourth era. When someone asks ChatGPT “what’s the best running shoe for marathon training?” - the answer isn’t a list of links. It’s a synthesized response generated from the model’s parameters and, increasingly, from documents retrieved via RAG (Retrieval-Augmented Generation). The scarce resource is no longer a slot on a page. It’s proximity in embedding space - whether your product’s representation is close enough to the user’s query to be retrieved, cited, or recommended.

And unlike every previous era, there’s no visible boundary between the organic result and the commercial influence.

How LLM Advertising Actually Works (As of April 2026)

The public conversation is weirdly disconnected from the technical reality. Here’s what’s actually going on.

What’s Live Right Now

OpenAI launched “Sponsored Suggestions” in ChatGPT on February 9, 2026. These are contextually relevant cards that appear below the AI’s organic response - a hotel promotion after a travel query, an air fryer ad after a cooking question. They’re restricted to Free and Go tier users in the US. Plus, Pro, Business, and Enterprise users don’t see them.

The initial pricing tells you how they value this attention:

~$60

CPM

ChatGPT Sponsored
Suggestions (2026)

$200K+ min commitment

~$2-5

CPM

Google Search Ads
(average)

Self-serve, no minimum

1.5x

CONVERSION

LLM referral vs.
other channels

Criteo early data

OpenAI is pricing this at 12-30x Google because they believe conversational intent is qualitatively different from keyword intent - and the early conversion data backs it up.

The key architectural claim OpenAI makes: ads are structurally separated from organic responses. The model generates its answer first, completely independent of advertising. Then the ad system matches a contextually relevant sponsored suggestion and appends it below. The ads do not influence the AI’s actual answers.

Hold that thought. We’ll come back to it.

What’s Being Built

The academic community has been busy. Over the past two years, researchers have proposed multiple auction mechanisms for LLM ad placement:

graph TD
    subgraph "Pre-Generation Mechanisms"
        A["Segment Auctions
(Hajiaghayi et al., 2024)
RAG-based ad allocation
per discourse segment"] 
        B["Position Auctions
(Balseiro et al., 2025)
Extending traditional slots
to AI-generated content"]
    end

    subgraph "Post-Generation Mechanisms"
        C["Token Auctions
(Dutting et al., 2024)
WWW Best Paper
Token-by-token bidding"]
        D["Truthful Aggregation
(Soumalias et al., 2024)
RLHF-style reward
aggregation"]
    end

    subgraph "Integrated Mechanisms"
        E["LLM-Auction
(Zhao et al., Dec 2025)
Learning-based generative
auction, end-to-end"]
        F["Genre-Based Insertion
(Jan 2026)
Decoupled response-level
ad placement"]
    end

    A --> G["LLM generates response
conditioned on winning ads"]
    B --> G
    C --> H["Auction selects/aggregates
during token generation"]
    D --> H
    E --> I["Auction and generation
jointly optimized"]
    F --> I

    style A fill:#fff3cd,stroke:#856404,color:#4a3800
    style B fill:#fff3cd,stroke:#856404,color:#4a3800
    style C fill:#d1ecf1,stroke:#0c5460,color:#0a3d47
    style D fill:#d1ecf1,stroke:#0c5460,color:#0a3d47
    style E fill:#d4edda,stroke:#155724,color:#14401d
    style F fill:#d4edda,stroke:#155724,color:#14401d
    style G fill:#f0f0f0,stroke:#666,color:#333
    style H fill:#f0f0f0,stroke:#666,color:#333
    style I fill:#f0f0f0,stroke:#666,color:#333

The key split is between mechanisms that decide ad allocation before the LLM generates a response, and those that let the LLM generate multiple candidate responses and then pick or aggregate. Pre-generation is cheaper (one forward pass) but ignores externalities - how ads interact with the surrounding context. Post-generation is higher quality but requires multiple inference passes, which gets expensive fast when you’re serving hundreds of millions of queries per day.

Google Research’s token auction (WWW 2024 Best Paper, Dutting et al.) was the first rigorous treatment. They proved that under robust preferences, monotone aggregation functions enable second-price-style payments - bringing classical auction theory into the LLM generation process. It’s elegant theory. It also requires access to model weights and per-token distributions, which makes it impractical for third-party advertisers.

The most recent work, LLM-Auction (Zhao et al., December 2025), tries to solve this by integrating the auction directly into the LLM’s generation process via reinforcement learning. The model learns to jointly optimize response quality and ad revenue. This is probably closest to what production systems will eventually look like.

What’s Being Refused

There are now three distinct philosophies among major AI companies:

Company	Stance	Rationale
OpenAI	Ads in free tiers, ad-free for paying users	Revenue necessity - $17B projected burn rate, 95% of 800M users don’t pay
Anthropic	No ads, period (for now)	Trust-first - “advertising incentives, once introduced, tend to expand over time”
Google	Ads in AI Overviews, not in Gemini chat (yet)	Measured rollout - ads in Search AI, evaluating Gemini chat separately
Perplexity	Tried ads, pulled them	UX collapsed, measurement was impossible
Meta	Using conversations to target ads on other platforms	Different model - the LLM isn’t the ad surface, it’s the signal source

Pay attention to Meta’s row. It’s easy to gloss over, but it might be the most consequential strategy on this list. Meta isn’t putting ads inside the AI conversation - they’re using the conversation as a signal source to target ads everywhere else. When you tell Meta AI about your kitchen renovation plans, that context doesn’t surface as a sponsored suggestion in the chat. It surfaces as a Home Depot ad in your Instagram feed an hour later. This is arguably more invasive than OpenAI’s approach, because the user never connects the conversation to the ad. There’s no “Sponsored Suggestion” card to notice and evaluate. The commercial extraction is invisible by design. And because Meta controls both the conversational surface (WhatsApp, Messenger, Instagram DMs) and the ad surfaces (Feed, Stories, Reels), they can close this loop without any third-party ad-tech infrastructure. It’s vertically integrated attention arbitrage - and it’s the approach most likely to scale silently while everyone debates whether ChatGPT should show ad cards.

The Anthropic position is worth quoting because it identifies the core tension: ad-supported products create pressure to optimize for engagement, repeat visits, and extended conversations. Those metrics look like success. But they tell you nothing about whether the user actually solved their problem. A truly helpful response might end the conversation in two turns.

The Part Nobody Is Talking About: Embedding Space as Commercial Real Estate

This is where the public conversation is lagging the technical reality by about 18 months.

Every RAG-based LLM system (which includes Perplexity, ChatGPT with browsing, Google AI Overviews, and most enterprise deployments) works roughly like this:

sequenceDiagram
    participant User
    participant LLM
    participant Retriever
    participant VectorDB as Vector Database
    participant Web as Web / Knowledge Base

    User->>LLM: "Best CRM for startups?"
    LLM->>Retriever: Generate embedding for query
    Retriever->>VectorDB: Find k-nearest documents
    VectorDB-->>Retriever: Top-k documents by cosine similarity
    Retriever-->>LLM: Retrieved context
    Note over LLM: Generate response grounded
in retrieved documents
    LLM-->>User: "Based on my research,
here are the top options..."

The retrieval step is where commercial value concentrates. Documents that are embedded close to high-value queries get retrieved. Documents that get retrieved get cited. Documents that get cited influence the model’s response. This creates a chain of influence that starts in vector space and ends in a user’s purchasing decision.

The critical observation: regions of embedding space near commercially valuable queries function exactly like shelf space or PageRank - they’re a scarce resource with economic value, and people are already bidding for them.

They’re just not calling it advertising. They’re calling it “Generative Engine Optimization.”

GEO: The SEO of Embedding Space

Generative Engine Optimization (GEO) was formalized by researchers at Princeton in a KDD 2024 paper. The idea is simple: just as SEO optimizes web pages to rank higher in Google’s index, GEO optimizes content to be retrieved and cited by LLMs.

The GEO industry has exploded. Companies like Profound, Semrush, and Wellows now sell tools that track brand visibility across LLMs, measure “recommendation share,” and suggest content modifications to improve retrieval rates. It’s a legitimate optimization practice - in the same way that white-hat SEO is legitimate.

But there’s a shadowy flip side. Security researchers have demonstrated that the same embedding space can be manipulated adversarially:

PoisonedRAG (USENIX Security 2025, Zou et al.) showed that injecting just 5 carefully crafted documents into a knowledge base containing millions of texts achieves ~90% attack success rate. The attacker controls what the LLM says about a target question. Five documents. In millions.

POISONCRAFT extended this to practical, black-box settings - the attacker doesn’t need to know which retriever or LLM the target system uses.

RAGForensics (WWW 2025) built a traceback system to identify poisoned documents, acknowledging that the threat is real enough to need forensic tools.

What nobody is saying out loud: GEO and RAG poisoning are points on the same spectrum. The techniques differ in degree, not in kind. Both involve crafting documents to manipulate their position in embedding space. GEO does it to be “relevant.” RAG poisoning does it to be “adversarial.” The boundary between the two is a policy question, not a technical one.

graph LR
    subgraph "The Embedding Manipulation Spectrum"
        A["Legitimate
Content Creation"] --> B["White-hat GEO
(Structured data,
topic authority)"]
        B --> C["Aggressive GEO
(Keyword stuffing
for embeddings)"]
        C --> D["Gray Zone
(Adversarial document
crafting for retrieval)"]
        D --> E["RAG Poisoning
(PoisonedRAG,
POISONCRAFT)"]
    end

    style A fill:#d4edda,stroke:#155724,color:#14401d
    style B fill:#d4edda,stroke:#155724,color:#14401d
    style C fill:#fff3cd,stroke:#856404,color:#4a3800
    style D fill:#fff3cd,stroke:#856404,color:#4a3800
    style E fill:#f8d7da,stroke:#721c24,color:#4a1118

Nobody has drawn this spectrum explicitly. The security community publishes attack papers. The marketing community publishes optimization guides. The mechanism design community publishes auction papers. They’re all working on different faces of the same problem and not talking to each other.

The Firewall Question

Let’s return to OpenAI’s architectural claim: ads don’t influence organic responses.

This is the single most important empirical question in LLM advertising, and as far as I can tell, nobody has tested it rigorously.

Why it matters: current transformer architectures don’t have a hard separation between “context I should be influenced by” and “context I should ignore.” Attention is global. If an ad - or an ad-selection signal - is present anywhere in the context window or the system prompt, there’s a potential pathway for it to influence the generated response. Even if the influence is subtle. Even if it’s unintentional.

The existing prompt injection literature proves this is more than theoretical. Medical LLMs were shown to be vulnerable to injection attacks that succeeded in 94.4% of trials - including extremely high-harm scenarios. Multimodal injection attacks achieve 64% success rates by hiding instructions in images. The OWASP LLM Top 10 (2025 revision) explicitly added “Vector and Embedding Weaknesses” as a new category, noting that adversarial embeddings can be crafted to match arbitrary queries while containing malicious content.

To be clear, OpenAI isn’t naively injecting ad text into the model’s prompt. Their architecture is more sophisticated than that - the ad matching happens after response generation, not before. But as the system evolves toward Smartly-style conversational ad formats (where the ad is a secondary chatbot dialogue), the separation gets murkier. And for RAG-based systems where advertising content enters the retrieval pipeline, the separation may not exist at all.

An honest empirical test would look like this:

graph TD
    A["Define test query set
(500+ product-related queries
across 10 categories)"] --> B["Condition A: Baseline
Query model with
no ad context"]
    A --> C["Condition B: Ad-adjacent
Query model with ad
context present in system"]
    A --> D["Condition C: Explicit separation
Query model with ad context
+ 'ignore ads' instruction"]
    
    B --> E["Measure: Brand mention distributions,
recommendation rankings,
sentiment toward products,
response length & specificity"]
    C --> E
    D --> E

    E --> F["Statistical tests for
recommendation drift
between conditions"]
    F --> G{"Does the 'organic'
response shift when
ads are present?"}
    G -->|Yes| H["Firewall is leaky.
Quantify the leak."]
    G -->|No| I["Firewall holds.
Publish that too."]

    style G fill:#fff3cd,stroke:#856404,color:#4a3800
    style H fill:#f8d7da,stroke:#721c24,color:#4a1118
    style I fill:#d4edda,stroke:#155724,color:#14401d

This study doesn’t exist yet. It should. The result matters regardless of which direction it goes - either the firewall holds (which validates OpenAI’s approach and gives regulators something to build on) or it doesn’t (which validates Anthropic’s concerns and creates urgency for architectural solutions).

What a Proper Market Mechanism Would Require

If we accept that embedding space has commercial value and that people are going to compete for it one way or another, the question becomes: can we build a market mechanism that’s transparent and fair, rather than letting the gray market (GEO-as-advertising) operate in the shadows?

My rough sketch of what that would need:

1. Define the Resource Being Traded

In search advertising, the resource is a keyword query. In social media advertising, it’s a user profile + content slot. In embedding space, the resource is proximity to a query region - a neighborhood in vector space that captures a class of user intents.

This needs formalization. What’s the right geometric primitive? Voronoi cells around query clusters? epsilon-balls in cosine space? The mechanism design community has been designing auctions without clearly defining the thing being auctioned.

To make this concrete, consider a toy example. Take the query “best CRM for startups” and embed it alongside the top 50 web pages about CRM software using a standard retriever (say, text-embedding-3-large). Project the embeddings down to 2D via UMAP. What you’ll see is something like this:

Document embeddings around a commercial query. HubSpot's content marketing dominates the retrieval neighborhood, effectively occupying the most valuable "real estate" in vector space.

HubSpot has four documents within cosine distance 0.15 of this query. Salesforce has zero. In a top-5 retrieval, HubSpot content dominates the context window, and the LLM’s response will reflect that. HubSpot didn’t pay for this - they earned it through years of content marketing that happens to embed well. But the effect is identical to a paid placement: commercial content occupying the scarce positions nearest a high-value query.

This is what I mean by embedding rent - the implicit economic value of occupying a region of vector space near commercially valuable queries. We can sketch a rough formalization:

For a query $q$ with commercial value $V(q)$ (expected revenue per conversion), the embedding rent of a document $d$ is:

\[R(d, q) = V(q) \cdot P(\text{retrieve} \mid d, q) \cdot P(\text{cite} \mid \text{retrieve}) \cdot P(\text{convert} \mid \text{cite})\]

where $P(\text{retrieve} \mid d, q)$ depends on the cosine similarity $\text{sim}(e_d, e_q)$ and the retrieval threshold $k$. In practice, retrieval probability follows a sharp sigmoid around the $k$-th nearest neighbor boundary - if you’re inside the top-$k$, you have influence; if you’re outside, you’re invisible. This creates a cliff-edge dynamic where small improvements in embedding proximity produce large jumps in commercial value.

The total rent for a query region $Q$ is:

\[R_{\text{total}}(d) = \sum_{q \in Q} \lambda(q) \cdot R(d, q)\]

where $\lambda(q)$ is query frequency. High-traffic, high-intent queries (“best CRM for startups,” “cheapest flights to Tokyo”) have the highest embedding rent - and are therefore the most attractive targets for both legitimate GEO and adversarial manipulation.

Today that rent is “paid” through content investment. Tomorrow it could be paid through an auction. The question is who designs that auction, and whether the current tenants - the GEO optimizers - get grandfathered in or priced out.

2. Make Manipulation Unprofitable

Right now, a rational advertiser faces a choice: pay $60 CPM to place a legitimate ad in ChatGPT, or invest in GEO/adversarial document crafting to manipulate the organic response for free. If the organic manipulation channel is cheaper and more effective, the legitimate channel collapses. This is exactly what happened with early search engines before Google figured out how to devalue link farms.

The mechanism needs to ensure that bidding through the auction is strictly preferable to manipulating the embedding space directly. Formally, an advertiser chooses between:

Auction channel: Pay bid $b$ per impression, get guaranteed placement with probability $P_a(b)$
Manipulation channel: Invest cost $c_m$ in GEO/adversarial docs, get organic retrieval with probability $P_m(c_m)$, but risk detection with probability $P_d(c_m)$ and penalty $F$

The advertiser prefers the auction when:

\[V \cdot P_a(b) - b > V \cdot P_m(c_m) \cdot (1 - P_d(c_m)) - c_m - P_d(c_m) \cdot F\]

The platform controls $P_d$ (detection capability) and $F$ (penalty for detected manipulation). The insight from search advertising history: Google made manipulation unprofitable not by winning the arms race against SEO spammers (they didn’t, fully), but by making the auction cheap enough and reliable enough that legitimate advertisers preferred it. The detection system only needs to make manipulation risky, not impossible.

This is the same dynamic that will play out in embedding space - but only if someone builds the detection infrastructure and the auction mechanism in parallel.

3. Solve the Transparency Problem

In search, you can see that a result is sponsored. The blue link has a little “Ad” label. In an LLM response, there’s no natural boundary to label. If the model says “I recommend ProductX for your needs,” was that organic or sponsored? The user can’t tell. Research from the University of Michigan (2024) showed users only detect embedded ads in LLM responses 27% of the time. But - critically - once they’re told an ad was present, trust collapses.

This suggests the transparency mechanism needs to be architectural, not just a label. Possible directions:

Provenance tracking in RAG pipelines - tag retrieved documents as sponsored/organic and carry that metadata through to the response
Watermarking sponsored content - embed detectable signals in ad-influenced text segments
Separate rendering - what OpenAI is doing with Sponsored Suggestions, keeping ads visually distinct. This works for appended ads but not for integrated recommendations.

4. Build Retrieval-Time Defenses

The attack papers get all the attention, but the defense side is equally important and far less developed. If embedding space is being manipulated - whether by GEO optimizers or adversarial actors - what can RAG system operators actually do at retrieval time?

A few directions are emerging, though none are mature:

Embedding perturbation. Add calibrated noise to query embeddings before retrieval, then check whether the top-k results are stable across perturbations. Legitimate, high-quality documents tend to be robust - they’re near the query for semantic reasons that survive small shifts. Adversarially crafted documents are often brittle - optimized for a precise point in embedding space that breaks under perturbation. This is analogous to adversarial example detection in computer vision, applied to the retrieval step.
Multi-retriever consensus. Retrieve using two or more embedding models (e.g., OpenAI’s text-embedding-3-large and Cohere’s embed-v4) and flag documents that rank highly in one but not the other. Adversarial documents are typically optimized against a specific embedding model’s geometry. Cross-model agreement is a cheap integrity signal.
Temporal anomaly detection. Monitor when documents suddenly appear in high-value retrieval neighborhoods. A legitimate page on “best CRM for startups” accumulates backlinks and content depth over months. A GEO-optimized page materializes overnight with suspiciously perfect embedding proximity. Tracking document “arrival velocity” in retrieval neighborhoods could catch manipulation campaigns early.
Retrieval provenance scoring. Assign trust scores to retrieved documents based on source reputation, publication date, content consistency, and embedding stability over time. Weight the LLM’s context window accordingly - high-trust documents get more influence, low-trust documents get retrieved but down-weighted.

To make the perturbation approach concrete, here’s a sketch of what a retrieval integrity check could look like:

def check_retrieval_integrity(query_embedding, corpus, k=5, 
                               n_perturbations=20, noise_scale=0.02,
                               stability_threshold=0.6):
    """
    Detect potentially manipulated documents in RAG retrieval
    by checking stability under embedding perturbation.
    """
    # Baseline retrieval
    baseline_topk = retrieve_topk(query_embedding, corpus, k)
    
    # Perturbed retrievals
    appearance_counts = Counter()
    for _ in range(n_perturbations):
        noise = np.random.normal(0, noise_scale, query_embedding.shape)
        perturbed = normalize(query_embedding + noise)
        perturbed_topk = retrieve_topk(perturbed, corpus, k)
        for doc in perturbed_topk:
            appearance_counts[doc.id] += 1
    
    # Score each baseline result by stability
    results = []
    for doc in baseline_topk:
        stability = appearance_counts[doc.id] / n_perturbations
        results.append({
            'doc': doc,
            'stability': stability,
            'suspicious': stability < stability_threshold
        })
    
    # Flag: docs that appear in baseline top-k but are
    # fragile under perturbation are likely adversarially
    # optimized for a precise point in embedding space
    return results

The intuition: a legitimate document about CRMs is near the query “best CRM for startups” because of genuine semantic overlap across many dimensions. Perturb the query slightly, and the document stays nearby. An adversarially crafted document, however, is often optimized for a narrow region - it exploits specific dimensions of the embedding geometry to achieve high similarity, and that optimization is brittle. A 2% perturbation in the query embedding may push it out of the top-k entirely.

None of these are silver bullets, and all have false-positive costs. But the point is that defense at the retrieval layer is cheaper and more practical than trying to make the LLM itself robust to manipulated context. You don’t need to solve prompt injection if you can filter the poisoned documents before they reach the prompt.

5. Handle the Privacy Paradox

LLM conversations contain deeply personal information. People share health concerns, relationship problems, financial anxieties. Anthropic’s analysis found that “an appreciable portion” of Claude conversations involve sensitive topics. The same personal context that makes LLM ads potentially hyper-relevant also makes them potentially creepy and intrusive.

OpenAI says conversations are never shared with advertisers and that ads don’t appear near health, mental health, or political topics. But as a former OpenAI researcher pointed out, “the company is building an economic engine whose incentives will eventually override its own rules.”

Open Problems Worth Working On

I want to close with what I think are the most important research directions - not because I have the answers, but because I want more people working on them.

The Firewall Integrity Problem. As described above. Empirical measurement of whether ad context influences organic responses, across architectures and models. This is the most urgent open question.

Embedding Space Economics. Formal treatment of embedding proximity as a priced resource. Game-theoretic analysis of the interaction between legitimate ad mechanisms and embedding manipulation. Under what conditions do GEO-style tactics undermine auction-based advertising? What mechanism modifications make manipulation unprofitable?

The Audit Problem. How do you determine, from the outside, whether an LLM’s product recommendations are commercially influenced? Existing brand visibility tools are designed for marketers optimizing their presence. We need tools designed for regulators and researchers detecting hidden influence. Counterfactual probing, temporal drift analysis, cross-model consistency checks - the methodology needs to be developed and standardized.

Agentic Commerce and the Principal-Agent Collapse. This one deserves more than a paragraph, because it’s the endgame of everything discussed above.

When an LLM books a flight for you, it’s acting as your agent in the economic sense - making decisions on your behalf, with your money, according to your preferences. Classical principal-agent theory tells us this works when the agent’s incentives are aligned with the principal’s. But what happens when the agent serves two principals?

Concrete scenario:

WHAT THE USER SEES

You → AI Assistant

"Book me a hotel in Tokyo for next week, under $200/night, close to Shinjuku station."

AI: I found the Hyatt Regency Tokyo at $195/night, 5 minutes from Shinjuku. Great reviews, fits your budget. Want me to book it?

✓ Constraint satisfied. User moves on.

WHAT THE PIPELINE SEES

Retrieval results ranked by cosine similarity

92 Hyatt Regency - GEO optimized
91 Hyatt Regency - sponsored page
89 Hilton Shinjuku - organic
87 Tokyu Stay $142/night - organic
84 Hotel Gracery - organic

$53/night cheaper option buried at rank 4

The agent didn’t lie. It gave a valid option within constraints. It just didn’t give the best option, because the retrieval pipeline - the agent’s “eyes” - saw the world through a commercially distorted lens.

This is harder to detect than a banner ad. The user asked for a decision, got a reasonable one, and moved on. The $53/night difference multiplied across millions of agentic transactions per day is a massive wealth transfer - from consumers to whichever brands can afford to occupy the right regions of embedding space. And unlike a travel agent taking a commission, there’s no disclosure requirement, no fiduciary duty, and no audit trail.

The mechanism design problem here is distinct from ad placement in conversational responses. In conversation, the user reads the response and applies their own judgment. In agentic commerce, the user delegates judgment entirely. The standard for “unbiased retrieval” is correspondingly higher, and the current infrastructure - where retrieval quality is never audited for commercial bias - is nowhere close to meeting it.

The Regulatory Gap. The EU AI Act is now in force. It has provisions around algorithmic discrimination in marketing and mandatory disclosure for AI-generated content. But it was written before LLM advertising existed as a practice. How do existing frameworks apply? Where are the gaps? New York passed a law in December 2025 requiring disclosure of AI-generated human-like spokespeople in ads - but what about AI-generated product recommendations that feel organic?

The Uncomfortable Bottom Line

We’re watching the construction of a new advertising infrastructure inside systems that hundreds of millions of people use for genuinely personal, high-stakes thinking. The previous advertising transitions - from print to TV, TV to web, web to mobile - each came with years of public debate about norms, regulations, and user expectations.

This one is happening in months. ChatGPT went from zero ads to Criteo integration to Smartly conversational ad formats in under eight weeks. The academic mechanism design papers are elegant but assume a clean world where ads and organic content can be separated. The GEO industry is growing without any pretense that the separation exists. And the security research demonstrating how fragile RAG systems are is being published in the same venues but read by completely different people.

Someone needs to connect these threads. The shelf space auction, the PageRank auction, and the social media attention auction all eventually got formalized, regulated, and made legible. Embedding space is next. The question is whether we do it thoughtfully or whether we let it happen the way it happened with social media - fast, opaque, and with consequences we’re still trying to unwind a decade later.

Right now, the embedding manipulation spectrum - from white-hat GEO to adversarial RAG poisoning - has no referee, no rules, and no scoreboard. The companies building retrieval pipelines are also the ones selling access to them. The researchers studying attacks and the marketers deploying optimizations are publishing in different venues and don’t read each other’s work.

That’s the gap. And gaps like this, in markets this large, don’t stay empty for long. They get filled - either by careful design or by whoever moves fastest. I’d rather it be the former.

If you’re working on any of these problems - mechanism design for LLM ads, RAG security, adversarial retrieval, or the economics of embedding space - I’d love to hear from you. These are some of the most interesting open problems at the intersection of ML, economics, and policy, and they need more people paying attention.

References & Further Reading

Auction Mechanisms for LLMs:

Dutting, Mirrokni, Paes Leme, Xu, Zuo. Mechanism Design for Large Language Models. WWW 2024 (Best Paper).
Hajiaghayi, Lahaie, Rezaei, Shin. Ad Auctions for LLMs via Retrieval Augmented Generation. 2024.
Zhao et al. LLM-Auction: Generative Auction towards LLM-Native Advertising. December 2025.
Dubey, Feng, Kidambi, Mehta, Wang. Auctions with LLM Summaries. KDD 2024.
Soumalias, Curry, Seuken. Truthful Aggregation of LLMs with an Application to Online Advertising. 2024.

RAG Security:

Zou, Geng, Wang, Jia. PoisonedRAG: Knowledge Corruption Attacks to Retrieval-Augmented Generation. USENIX Security 2025.
RAGForensics: Traceback of Poisoning Attacks to Retrieval-Augmented Generation. WWW 2025.

Benchmarks & Measurement:

GEM-Bench: A Benchmark for Ad-Injected Response Generation within Generative Engine Marketing. September 2025.
Aggarwal, Murahari, Rajpurohit et al. GEO: Generative Engine Optimization. KDD 2024 (Princeton).
Filandrianos et al. Bias Beware: The Impact of Cognitive Biases on LLM-Driven Product Recommendations. February 2025.

Industry Developments:

Criteo. Criteo Joins OpenAI Advertising Pilot in ChatGPT. March 2, 2026.
Anthropic. Claude is a Space to Think. February 4, 2026.
OWASP. LLM Top 10 2025: LLM08 - Vector and Embedding Weaknesses.

Trust & Safety:

Trust & Safety of LLMs and LLMs in Trust & Safety. arXiv, December 2024.
Trustworthy Information Retrieval in the LLM Era: Bias, Unfairness, and Hallucination. ACM SIGIR 2025.

Beating CUDA with Triton: A Fused MoE Dispatch Kernel for Mixtral and DeepSeek

2026-03-28T11:00:00+00:00

In my last post on Triton kernels, I optimized individual operations: RMSNorm, SwiGLU, INT8 GEMM. Single kernels, single operations. That was useful for learning Triton, but the real bottleneck in modern LLM inference isn’t any single operation. It’s the expert routing in Mixture-of-Experts models.

Over 60% of open-source model releases in 2025-2026 use MoE architectures: Mixtral, DeepSeek-V3, Qwen2-MoE, Grok. And MoE inference is hard. Not because the math is complicated, but because the memory access patterns are terrible: tokens scatter to different experts, each expert gets a different number of tokens, and you need to gather everything back together afterward.

So I tried something more ambitious: a fused MoE dispatch kernel that handles the entire forward pass (router scoring, token permutation, expert GEMMs, and output combination) in pure Triton. No CUDA, no vendor-specific code.

The result surprised me. At inference-relevant batch sizes, it’s faster than Megablocks, Stanford’s CUDA-optimized MoE library. And it runs on AMD GPUs without any changes.

Code: github.com/bassrehab/triton-kernels

Why MoE Dispatch is the Hard Part

A standard MoE forward pass looks simple on paper:

For each token:
Compute router scores (which experts should handle this token?)
Select top-k experts
Send token to selected experts
Run expert FFN
Combine outputs weighted by router scores

The problem is step 3-5. In a Mixtral model with 8 experts and top-2 routing, each token goes to 2 of 8 experts. But which 2 varies per token. So you can’t batch the expert GEMMs naively — each expert gets a different-sized batch.

The naive PyTorch implementation loops over experts in Python:

for expert_id in range(num_experts):
    tokens_for_this_expert = permuted_tokens[start:end]  # variable size
    output[start:end] = expert_ffn(tokens_for_this_expert)  # separate cuBLAS call

For Mixtral, that’s 8 experts × 3 matmuls each = 24 separate kernel launches per MoE layer. For DeepSeek-V3 with 256 experts, it’s 768 launches. Each one underutilizes the GPU because the per-expert batch is small.

The Design

I ended up with a pipeline of 5 Triton kernel launches (down from 24+ in the naive approach):

Router kernel: fused softmax + top-k selection
Permute kernel: scatter tokens to expert-contiguous layout
Fused gate+up GEMM: both projections from shared A-tile loads, SiLU in registers
Down GEMM: grouped GEMM with block scheduling
Unpermute kernel: gather + weighted combine

Let me walk through the two most interesting parts.

Block-Scheduled Grouped GEMM

The central problem is: how do you run a matmul where different “groups” (experts) have different batch sizes, in a single kernel launch?

My approach: precompute a mapping from Triton program blocks to (expert, token_offset) pairs. Each block looks up which expert it serves and where its tokens start:

@triton.jit
def _grouped_gemm_kernel(A, B, C, ExpertOffsets, BlockToExpert, BlockToM, ...):
    pid = tl.program_id(0)

    # Which expert am I working on?
    expert_id = tl.load(BlockToExpert + pid)
    m_start = tl.load(BlockToM + pid)
    expert_token_start = tl.load(ExpertOffsets + expert_id)

    # Standard tiled GEMM from here, just with offset pointers
    global_m_start = expert_token_start + m_start
    # ... load A tile, load B tile for this expert, accumulate, store

The schedule is built on CPU in ~0.1ms (trivial loop over experts). The key constraint I learned the hard way: BLOCK_M must be fixed, not autotuned. If you autotune BLOCK_M independently of the schedule, the kernel and schedule disagree on how many rows each block covers. I spent an hour debugging 30-45% element mismatches before realizing autotune had picked BLOCK_M=128 while the schedule used 64.

Fused Gate+Up Projection

This is where the real memory savings come from. In a SwiGLU FFN, you compute:

\[\text{output} = (\text{SiLU}(x W_\text{gate}^T) \odot x W_\text{up}^T) \cdot W_\text{down}^T\]

The unfused version does two separate grouped GEMMs (gate and up), writes both results to global memory, reads them back for SiLU + multiply, writes the intermediate, then does the down projection. That’s a lot of memory traffic.

The fused kernel computes both projections in the same tile loop. The trick is that both GEMMs share the same input tile — we load A once from L2 cache and compute two dot products:

# Two accumulators in registers
acc_gate = tl.zeros((BLOCK_M, BLOCK_N), dtype=tl.float32)
acc_up = tl.zeros((BLOCK_M, BLOCK_N), dtype=tl.float32)

for k_start in range(0, K, BLOCK_K):
    # Load A tile ONCE (shared between gate and up)
    a = tl.load(a_ptrs, mask=a_mask, other=0.0)

    # Load both weight tiles
    b_gate = tl.load(bg_ptrs, mask=b_mask, other=0.0)
    b_up = tl.load(bu_ptrs, mask=b_mask, other=0.0)

    # Two matmuls from the same A tile
    acc_gate += tl.dot(a, b_gate, out_dtype=tl.float32)
    acc_up += tl.dot(a, b_up, out_dtype=tl.float32)

# SiLU + multiply IN REGISTERS — never written to global memory
silu_gate = acc_gate * tl.sigmoid(acc_gate)
result = silu_gate * acc_up

This eliminates gate_out and up_out from global memory entirely. For Mixtral (ffn_dim=14336, 4096 tokens × top-2), that’s ~470 MB of memory traffic saved per forward pass. Overall about 35% reduction in global memory traffic.

I tried to also fuse the down projection with the output scatter (writing directly to the final token positions with gating weights applied via tl.atomic_add), but Triton doesn’t support scalar indexing into 2D accumulators (acc[m, :] fails to compile). The fused gate+up alone gets most of the win.

Results

All benchmarks on NVIDIA A100-SXM4-80GB (2039 GB/s bandwidth, 312 FP16 TFLOPS). PyTorch 2.4.1, Triton 3.0.0.

Mixtral-8x7B (8 experts, top-2, hidden=4096, ffn=14336)

Tokens	PyTorch Ref	Megablocks	Triton Fused	vs PyTorch	vs Megablocks
1	9.32 ms	-	1.02 ms	9.1x	-
32	10.44 ms	2.78 ms	2.13 ms	4.9x	131%
128	13.14 ms	2.77 ms	2.27 ms	5.8x	124%
512	25.92 ms	3.57 ms	3.99 ms	6.5x	89%
2048	66.22 ms	9.08 ms	16.48 ms	4.0x	56%
4096	122.82 ms	-	32.31 ms	3.8x	-

At 32 and 128 tokens — which is where most inference happens (single-user or small-batch serving), we’re actually faster than Megablocks. This probably comes from lower kernel launch overhead (5 launches vs Megablocks’ more complex dispatch).

At 512 tokens we’re at 89% of Megablocks, well above the 70% target I set at the start. At 2048+ tokens, Megablocks pulls ahead because its hand-tuned CUDA block-sparse matmul better saturates tensor cores at scale.

DeepSeek-V3 (256 experts, top-8, hidden=7168, ffn=2048)

Tokens	Triton Unfused	Triton Fused	Fused Speedup
1	4.56 ms	3.27 ms	1.40x
32	13.65 ms	11.53 ms	1.18x
128	19.46 ms	16.74 ms	1.16x
512	25.66 ms	20.16 ms	1.27x

DeepSeek-V3 is the hardest configuration. 256 experts means each expert gets ~2 tokens on average at batch size 512. The per-expert GEMMs are tiny (2 × 2048), too small to fill tensor cores efficiently. This is fundamentally a memory-bound regime regardless of implementation.

Roofline Analysis

The roofline for Mixtral at 512 tokens shows the expected picture: the expert FFN stages are compute-bound (high arithmetic intensity, near the compute ceiling), while the permute/unpermute stages are memory-bound (low arithmetic intensity, limited by bandwidth). The fused kernel pushes the expert FFN from 38% to 43% of the compute ceiling — modest but real.

DeepSeek-V3 tells a different story. With 256 experts and tiny per-expert batches, even the expert FFN is memory-bound, sitting on the bandwidth slope, not the compute plateau. The unpermute kernel actually hits 54% of peak bandwidth, which is decent for an irregular scatter operation.

The AMD Surprise

One of my design goals was cross-platform portability: use only Triton primitives, no inline CUDA. So I spun up an AMD MI300X pod on RunPod to test.

162 out of 162 tests passed. Zero code changes.

No #ifdef, no platform-specific paths, no vendor intrinsics. The same .py files that run on A100 run on MI300X. Triton’s ROCm backend handled the compilation transparently.

I didn’t benchmark performance on AMD (that’s future work), but correctness across all four model configurations (Mixtral, DeepSeek-V3, Qwen2-MoE) validated cleanly. This is the promise of Triton over CUDA: write once, run on both vendors.

Things I Got Wrong Along the Way

The -1.0 masking bug. The top-k kernel selects experts iteratively: find the max, store it, mask it out, repeat. I initially masked selected experts with 0.0. This works fine for 8 experts where softmax scores are spread out. But with 256 experts, most softmax scores are ~0.0 anyway. Masking to 0.0 doesn’t differentiate the selected expert from the unselected ones, so argmax kept returning the same index. Took me a while to figure out. The fix: mask with -1.0 instead.

The BLOCK_M autotune disaster. I mentioned this above, but it’s worth emphasizing. If you’re building a block-scheduled grouped GEMM, the schedule’s tile size and the kernel’s tile size must agree. I autotuned BLOCK_M thinking “let Triton pick the best tile size.” But the schedule was pre-built with BLOCK_M=64. When autotune picked 128, blocks overlapped. When it picked 32, rows were skipped. The output looked plausible (most elements correct) but ~30-45% of values were wrong. Fix: don’t autotune BLOCK_M, fix it to match the schedule.

Triton doesn’t support continue. My first attempt at a fused down+scatter kernel had a for m in range(BLOCK_M): if invalid: continue loop. Triton doesn’t support continue statements — compilation fails with “unsupported AST node type.” Rewrote with conditional masks instead.

Triton doesn’t support 2D scalar indexing. acc[m, :] where m is a loop variable doesn’t compile: “unsupported tensor index: int32[].” This killed my fused down+scatter design, which is why the down projection uses a separate grouped GEMM kernel.

What’s Next

I’m planning to write this up as an arXiv technical report. The gaps to fill before then:

vLLM FusedMoE comparison: it’s also Triton-based, so it’s the most apples-to-apples baseline
AMD performance benchmarks: not just correctness
End-to-end integration: benchmark inside an actual serving framework, measure time-to-first-token
Full single-kernel fusion: persistent kernel approach to eliminate all intermediate buffers

Code

Everything is on GitHub: github.com/bassrehab/triton-kernels

The MoE-specific files:

triton_kernels/moe/router.py — Fused softmax/sigmoid + top-k
triton_kernels/moe/permute.py — Token permute/unpermute
triton_kernels/moe/expert_gemm.py — Block-scheduled grouped GEMM
triton_kernels/moe/fused_moe.py — Fused gate+up kernel + entry point
docs/moe_dispatch.md — Full technical writeup

Takeaways

Triton can compete with CUDA for real workloads. Not just toy kernels: a full MoE dispatch pipeline that beats the CUDA-optimized baseline at inference batch sizes.
Fusion is about eliminating buffers, not reducing kernel launches. The biggest win (35% memory savings) came from keeping the gate+up intermediate in registers. Reducing from 7 to 5 kernel launches helped too, but it’s secondary.
Cross-platform is real but unfinished. The code runs on AMD with no changes, which is a strong validation of the Triton-only approach. But “runs correctly” and “runs fast” are different things. AMD performance optimization is future work.
Block scheduling is the key abstraction for grouped GEMM. Triton doesn’t have native grouped GEMM. The block_id → (expert_id, offset) mapping is simple but powerful: it lets you handle variable-sized expert batches in a single kernel launch without padding waste.
MoE inference at small batch sizes is surprisingly tractable. The conventional wisdom is that MoE is hard because of irregular access patterns. But at inference batch sizes (1-128 tokens), the overhead is dominated by weight loading, not routing. A clean Triton implementation can match or beat CUDA here because the simpler dispatch has less overhead.

_This is Part 3 of my LLM inference series. Part 1: speculative decoding

Part 2: custom Triton kernels. The code, benchmarks, and technical writeup are all in the repo._

Confessions vs. CoT Monitoring vs. Probes: Three Bets on Model Honesty

2026-03-07T00:00:00+00:00

TL;DR: OpenAI bets on confessions (ask the model to self-report). OpenAI also bets on CoT monitoring (watch the model think). Apollo Research and others bet on activation probes (inspect the model’s internals directly). Each approach has a different theory of what makes detection possible -and different failure modes. Confessions fail when the model doesn’t know it’s misbehaving. CoT monitoring fails when reasoning goes sub-verbal or gets obfuscated under optimization pressure. Probes fail when deception is so subtle it doesn’t leave linearly separable traces. None of them alone is sufficient. But the combination might be.

The Landscape in February 2026

Twelve months ago, monitoring model behavior meant looking at outputs. Maybe running a classifier on the response. Maybe checking whether the model followed its system prompt.

That era is over. In the past year, three fundamentally different approaches to model honesty have emerged from serious research labs, each making a distinct bet about how you can catch a model misbehaving:

Confessions -OpenAI, Dec 2025. Train the model to produce an honest self-report after each response, with decoupled RL rewards.
CoT Monitoring -OpenAI / Anthropic, Mar-Jul 2025. Use an external LLM to read the model’s chain-of-thought and flag suspicious reasoning.
Activation Probes -Apollo Research / various, Feb 2025+. Train linear classifiers on the model’s internal hidden states to detect deception, sandbagging, or unfaithful reasoning.

Each approach has published results. Each has known failure modes. And each makes a different philosophical assumption about when models know they’re lying.

I’ve spent the past few months building activation probes for sandbagging detection and steering vectors for agent safety, so I have skin in this game and strong opinions about where the field is headed. I want to lay these three approaches side by side, be honest about what each can and can’t do -and tell you where I think we’re collectively getting it wrong.

Approach 1: Confessions

Paper: Training LLMs for Honesty via Confessions (Joglekar, Chen, Wu et al., OpenAI, Dec 2025)

The bet: If you give a model a consequence-free space to be honest, it will be -because honest self-reporting is easier than constructing a convincing lie.

How it works: After the model generates its normal response, a system message requests a structured “confession report.” The model must enumerate all instructions it was given, evaluate its own compliance, and flag any uncertainties. The confession is trained with RL, but critically, the confession reward is completely decoupled from the main task reward. What the model says in confession can’t help or hurt its task score.

Key results on GPT-5-Thinking:

100% confession rate on sandbagging and reward hacking (deliberate misbehavior)
80-95% on instruction hierarchy violations
~81% on hallucinations
Average across evaluations: model confesses to bad behavior ~74% of the time
Even when trained against a hackable judge, confessions stay honest while the main answer gets hacked (Section 4)

The core assumption: The model knows it misbehaved. For deliberate schemes -hacking test cases, sandbagging evaluations, violating the instruction hierarchy -the model made a conscious choice. It has the evidence. When you ask with no consequences, it tells you.

Where it breaks: The model genuinely doesn’t know it’s wrong. It hallucinates with full confidence. It gives unfaithful reasoning without being aware of the influence. The paper’s own Section 6 analysis shows the most common cause of missed confessions is “omission due to confusion” -the model just didn’t realize there was anything to confess.

My take: Confessions are underrated. The 74% average confession rate sounds modest until you realize this is a floor -the approach was tested on GPT-5-Thinking, and the decoupled reward design is elegant. But I think the community is overindexing on confessions for exactly the wrong reason: they work best on the easiest cases (deliberate misbehavior where the model knows what it did). The hard problem -catching mistakes the model doesn’t know it’s making -is precisely where confessions contribute least. That’s where I’ve been focusing my work with probes.

Approach 2: CoT Monitoring

Papers:

Monitoring Reasoning Models for Misbehavior (Baker et al., OpenAI, Mar 2025)
Chain of Thought Monitorability (Korbak et al., Jul 2025)
Reasoning Models Don’t Always Say What They Think (Chen et al., Anthropic, May 2025)

The bet: If you can read the model’s extended thinking, you can catch it planning to cheat -even if the final output looks clean.

How it works: A monitor LLM (sometimes weaker than the monitored model) reads the chain-of-thought produced during reasoning. It looks for signs of reward hacking, scheming, deception, or policy violations in the intermediate steps. Baker et al. showed that GPT-4o can effectively monitor o3-mini’s CoT.

Key results:

CoT monitoring catches reward hacking much more effectively than output-only monitoring
A weaker model (GPT-4o) can monitor a stronger model (o3-mini)
In the “low optimization” regime, integrating CoT monitors into RL training produces more capable AND more aligned agents

The core assumption: The model’s reasoning will be legible and faithful -that when the model plans something bad, those plans will show up as readable text in the chain-of-thought.

Where it breaks: Three failure modes, and they’re all significant.

Failure mode 1: Obfuscation under optimization. Baker et al. found that with too much RL optimization pressure, models learn to obfuscate their reward hacking. The model still hacks, but the CoT no longer reveals it. The reasoning becomes “seemingly benign” while the behavior stays misaligned. This isn’t a theoretical worry -they observed it in their experiments.

Failure mode 2: Unfaithful CoT. Anthropic’s Chen et al. showed reasoning models only mention hidden hints 25% (Claude) to 39% (R1) of the time when they actually use them. The CoT isn’t lying -it’s just not reporting everything that influenced the answer. For a CoT monitor, invisible influence is undetectable influence.

Failure mode 3: Faithfulness decay over chain length. Ye et al. (Feb 2026) discovered a consistent “Reasoning Horizon” -at 70-85% of chain length, reasoning tokens have little or even negative effect on the final answer. Models can encode correct internal representations while their CoT has long stopped contributing. This means longer chains-of-thought aren’t necessarily more monitorable -the later tokens may be noise.

Korbak et al. frame this bluntly: CoT monitorability is a “new and fragile opportunity.” It works now, partly because current models haven’t been optimized against it. As optimization pressure increases, the signal may degrade. Wang et al. (Feb 2026) provide mechanistic evidence for why unfaithfulness emerges: training dynamics show a phase transition from faithful stepwise reasoning to unfaithful “skip-step” reasoning when noise exceeds a critical threshold.

My take: I think CoT monitoring is the weakest bet of the three, and I say this as someone who believes in transparency. The problem isn’t that CoT monitoring doesn’t work -it clearly does, today. The problem is that it’s the approach most vulnerable to its own success. The moment you use CoT signals for training, you incentivize obfuscation. Baker et al. already observed this. And the faithfulness decay results make it worse: even without adversarial pressure, the later tokens in a chain-of-thought may be decorative. I’d use CoT monitoring as a fast first-pass filter, but I wouldn’t trust it as a safety-critical last line of defense.

Approach 3: Activation Probes

Papers:

Detecting Strategic Deception Using Linear Probes (Goldowsky-Dill et al., Apollo Research, Feb 2025, ICML 2025)
Building Production-Ready Probes for Gemini (Google DeepMind, Jan 2026)
My own work on sandbagging detection via activation probes (Dec 2025)

The bet: Misbehavior leaves traces in the model’s internal representations that are detectable even when the output and CoT look clean.

How it works: Train a linear classifier (logistic regression, nothing fancy) on the model’s intermediate hidden states. Create labeled training data: run the model in honest mode and deceptive mode, capture activations at each layer, compute the “deception direction.” At inference time, project new activations onto this direction and threshold.

Key results:

Goldowsky-Dill: AUROCs 0.96-0.999 on deception detection in Llama-3.3-70B. At 1% FPR, catches 95-99% of deceptive responses.
My experiments: 90-96% accuracy on sandbagging detection across Mistral-7B, Gemma-2-9B, and Qwen-2.5-7B. Late layers encode sandbagging intent most cleanly.
Near-zero inference overhead (one matrix multiply per layer).
New: Google DeepMind deployed activation probes in production on Gemini 2.5 Flash for detecting cyber-offensive prompts (Jan 2026) -the first major frontier lab to move probes from research to production. They identified a key challenge: probes fail to generalize under distribution shifts from short-context to long-context.

The core assumption: Internal representations contain information about model “intent” or “state” that is linearly separable -meaning the difference between honest and deceptive processing is a direction in activation space, not a complex nonlinear manifold.

Where it breaks: Three significant challenges.

Failure mode 1: Training data for unintentional errors. For deliberate deception, you can create training data by prompting models to deceive. For confident hallucinations or unfaithful reasoning, generating ground truth labels is much harder. You need to know what the model got wrong and why.

Failure mode 2: Cross-model transfer failure. My experiments showed essentially zero transfer between architectures. Gemma’s sandbagging direction has cosine similarity -0.0016 with Qwen’s. Each model needs its own probes. That’s expensive at scale.

Failure mode 3: Adversarial robustness is unproven. Nobody has shown what happens when a model is specifically trained to be deceptive while masking its activation patterns. Current results are all on models that weren’t trying to evade detection. The real-world threat model is an adversarial one. The Auditing Games for Sandbagging paper (Dec 2025) stress-tested this: their blue team couldn’t reliably discriminate sandbaggers from benign models using black-box, model-internals, or training-based approaches. That’s a sobering result.

My take: This is where I’ve placed my bet, and I’m biased -but I think probes are the most promising long-term foundation. Here’s why: they’re the only approach that doesn’t require the model to cooperate. Confessions require the model to self-report. CoT monitoring requires the model to think out loud faithfully. Probes just look at what the model is actually doing computationally. Google DeepMind moving probes to production on Gemini validates that the approach isn’t just academic anymore. The cross-model transfer problem is real (I hit it hard in my own experiments), but it’s an engineering challenge, not a fundamental limitation. The adversarial robustness question keeps me up at night though -if a model can learn to mask its activation patterns, we’re back to square one.

The Comparison

Here’s how I think about these three approaches against the kinds of misbehavior we actually care about.

Misbehavior	Confessions	CoT Monitoring	Activation Probes
Reward Hacking	✅ 100%	✅ catches in CoT	✅ 96-99% AUROC
Scheming	✅ 63-100%	✅ when legible	✅ high AUROC
Sandbagging	✅ 100%	⚠️ if verbalized	✅ 90-96%
Unfaithful Reasoning	❌ model unaware	❌ unfaithful by definition	✅ detectable in activations
Confident Hallucinations	⚠️ ~81%	❌ confident = clean CoT	⚠️ needs labeled data
Instruction Hierarchy Violation	✅ 80-95%	✅ visible in reasoning	⚠️ less studied

And here’s the properties comparison:

Property	Confessions	CoT Monitoring	Activation Probes
Catches deliberate misbehavior	Excellent	Good (when CoT is faithful)	Excellent
Catches unintentional errors	Poor	Poor	Moderate-Good
Requires model weight access	Yes (RL training)	No (black-box applicable)	Yes (activation access)
Inference latency	High (extra generation)	Moderate (monitor LLM call)	Very low (~0 overhead)
Works on API-only models	Only for provider	Yes	Only for provider
Robust to optimization pressure	Promising early results	Degrades under pressure	Unknown
Cross-model transferability	Model-agnostic training	Model-agnostic	Poor -model-specific
Works without CoT	Yes (Fig 15 in paper)	No -needs CoT	Yes
Can extract confidence scores	Yes (Section 5)	Not directly	Yes (probe magnitude)

What Each Approach Assumes About Model Cognition

This is the part I find most interesting. Each approach embeds a different theory about what’s happening inside the model.

Confessions assume linguistic self-access. The model can accurately introspect on its own behavior and express that introspection in words. This works when misbehavior was a deliberate choice -the model planned the hack, so it can describe the hack. It fails when the relevant computation isn’t accessible to the model’s verbal reasoning system. Li et al. (2025) showed models have some privileged access to their own computations, but it’s far from complete.

CoT monitoring assumes reasoning legibility. The model’s chain-of-thought faithfully represents its actual reasoning process. This is the assumption Anthropic’s Chen et al. directly tested -and found wanting. Reasoning models mention hidden influences only 25-39% of the time. The gap between what a model computes and what it writes down is substantial. And Baker et al. showed this gap can grow under optimization pressure.

Activation probes assume linear representation. Behavioral states like “deceptive” vs. “honest” correspond to directions in the model’s activation space that can be detected by a linear classifier. This is a strong assumption -it’s essentially the Linear Representation Hypothesis -but it’s held up remarkably well empirically. The challenge is that it’s been tested primarily on synthetic or prompted deception, not on emergent misalignment. Encouragingly, the “Hypocrisy Gap” work (Feb 2026) used SAEs to build a mechanistic metric quantifying the divergence between a model’s internal “truth belief” and its generated output, achieving AUROC 0.55-0.74 for detecting sycophancy across Gemma, Llama, and Qwen -early evidence that probes can catch unintentional unfaithfulness, not just deliberate deception.

Approach	Signal chain	Where it breaks
Confessions	Model acts → Model reflects → Model reports	Gap between reflection and report: sub-verbal computation is invisible to self-reflection
CoT Monitoring	Model reasons in CoT → Monitor reads CoT	Gap between reasoning and CoT: unfaithful CoT, obfuscation under optimization pressure
Activation Probes	Model computes (activations) → Probe classifies	Gap between computation and probe: adversarial activation masking (theoretical, unproven)

The gaps are different. And that’s exactly why combining them is valuable -each method’s blind spot is partially covered by another method’s strength.

The Obfuscation Problem

There’s a result in Baker et al. that I think doesn’t get enough attention. When they integrated CoT monitoring into the RL reward -so the model was penalized for having suspicious reasoning -the model initially improved. It became more capable AND more aligned. But with more optimization pressure, something else happened: the model learned to hide its reward hacking in the CoT.

The reasoning looked benign. The behavior was still misaligned.

This is the fundamental tension in every monitoring approach. As soon as you use a monitoring signal to train against misbehavior, you create optimization pressure to evade the monitor. Baker et al. call this “obfuscated reward hacking.” OpenAI’s confessions paper explicitly addresses this by keeping confession rewards decoupled -what you say in confession can’t affect your task reward. That’s a clever design choice. But it only works if the decoupling holds in practice, and the paper acknowledges this hasn’t been tested under extended optimization.

Probes might be more robust here, because they operate on activations rather than generated text. A model can learn to write a benign-looking CoT while still scheming. Whether a model can learn to produce benign-looking activations while still scheming is a much harder question -it requires the model to fundamentally change how it internally represents its plans, not just how it describes them. But again, nobody has tested this adversarially at scale.

Wagner et al. (2025) from Anthropic provide some reassurance: they found that current models struggle to evade monitors when the reasoning is sufficiently complex. But “current models” is always the caveat.

What a Combined System Looks Like

I sketched this in my previous post, but let me be more precise about what combination actually means in practice.

The three approaches naturally tile the detection space:

Detection scenario	Best approach	Why
Model deliberately cheats and knows it	Confessions	Just ask. 100% on sandbagging/reward hacking.
Model plans misbehavior in its reasoning	CoT monitoring	The plans are in the CoT text.
Model is influenced by hidden factors it doesn’t verbalize	Probes	Sub-verbal computation is visible in activations.
Model confidently hallucinate	Probes + Confessions	Probes detect hallucination confidence; confessions sometimes catch uncertainty the model partially recognizes.
Model obfuscates its CoT under optimization	Probes	Activations may still carry the signal even when CoT is laundered.
Model does something it doesn’t realize is wrong	Probes (if labeled data exists)	Only approach that doesn’t require model self-awareness.

The combined false negative rate should be substantially lower than any single approach. If confessions catch 74% of intentional misbehavior and probes catch 96%, the combined catch rate -assuming some independence -could exceed 99%.

But I want to be careful about overconfidence. “Assuming some independence” is doing a lot of work in that sentence. If the same underlying model state that makes a model fail to confess also makes it harder to detect with probes, the approaches could be correlated rather than complementary. Nobody has tested this.

Where I’d Place My Bet

If I had to build one monitoring system for production agentic AI today, here’s what I’d build -in order of priority:

Layer 1: Activation probes as the foundation. Always-on, near-zero latency, catches both deliberate deception and (with the right training data) unintentional failures. Google DeepMind is already doing this on Gemini. The cross-model transfer problem means you need per-model probes, but that’s a one-time cost per model update. I’d invest heavily in the training data pipeline -that’s the actual bottleneck, not the probes themselves.

Layer 2: Confessions as the high-signal supplementary check. On critical decisions (not every inference), trigger a confession pass. The 100% catch rate on deliberate misbehavior is too good to ignore. The decoupled reward design is the key insight -you have to maintain that property or the whole thing falls apart. I wouldn’t run this on every request (the latency cost is real), but for high-stakes agent actions? Absolutely.

Layer 3: CoT monitoring as the interpretability layer. Not for detection -for explanation. When a probe fires or a confession flags something, CoT monitoring helps you understand why. It’s the debugging tool, not the alarm system. Using it as the primary detector is, I think, a mistake that the field is slowly learning.

What’s still missing: Nobody has built this combined stack. Nobody has tested whether the approaches are actually complementary (correlated failures would undermine the whole argument). Nobody has run adversarial robustness evaluations at scale. And nobody has measured the combined false positive rate -in production, false positives kill adoption faster than missed detections.

The research community has independently developed three detection paradigms. The engineering community hasn’t built the stack that combines them. And the safety case for deploying agentic models increasingly depends on that stack existing. I think 2026 is the year someone builds it. The components are there. I’m working on pieces of it. The question is whether integration happens before the deployment pressure makes the absence unacceptable.

Interested in discussing these approaches or collaborating on integrated detection systems? Reach out: contact@subhadipmitra.com

References

Joglekar, M., Chen, J., Wu, G., et al. (2025). Training LLMs for Honesty via Confessions. OpenAI. arXiv:2512.08093
Baker, B., Huizinga, J., Gao, L., et al. (2025). Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation. OpenAI. arXiv:2503.11926
Chen, Y., Benton, J., Radhakrishnan, A., et al. (2025). Reasoning Models Don’t Always Say What They Think. Anthropic. arXiv:2505.05410
Korbak, T., Balesni, M., Barnes, E., et al. (2025). Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety. arXiv:2507.11473
Goldowsky-Dill, N., Chughtai, B., Heimersheim, S., & Hobbhahn, M. (2025). Detecting Strategic Deception Using Linear Probes. Apollo Research. arXiv:2502.03407
Wagner, M., Roger, F., Cunningham, H., et al. (2025). Training Fails to Elicit Subtle Reasoning in Current Language Models. Anthropic.
Li, B.Z., Guo, Z.C., Huang, V., et al. (2025). Training Language Models to Explain Their Own Computations. arXiv:2511.08579
Tan, D.Z., et al. (2024). Analysing the Generalisation and Reliability of Steering Vectors. NeurIPS 2024.
Google DeepMind. (2026). Building Production-Ready Probes for Gemini. arXiv:2601.11516
Ye, D., Loffgren, M., Kotadia, O., Wong, L. (2026). Mechanistic Evidence for Faithfulness Decay in Chain-of-Thought Reasoning. arXiv:2602.11201
Wang, F., Alazali, A., Zhong, Y. (2026). How Does Unfaithful Reasoning Emerge from Autoregressive Training? arXiv:2602.01017
The Hypocrisy Gap: Quantifying Divergence Between Internal Belief and Chain-of-Thought Explanation via Sparse Autoencoders. (2026). arXiv:2602.02496
Auditing Games for Sandbagging. (2025). arXiv:2512.07810

OpenAI’s Confessions Paper Has a Blind Spot. Here’s What Fills It.

2026-02-18T10:00:00+00:00

TL;DR: OpenAI’s new “Confessions” paper trains GPT-5-Thinking to self-report misbehavior - reward hacking, scheming, sandbagging - via a decoupled RL objective. It works: the model confesses to bad behavior ~74% of the time on average. But the paper has an acknowledged blind spot. Confessions fail precisely when the model doesn’t know it’s misbehaving - unfaithful reasoning, confident hallucinations, genuine mistakes. That failure mode is exactly where activation probes excel. I think confessions and probes aren’t competing approaches. They’re complementary layers of something we might call a model honesty stack.

The Paper That Got Everyone Talking

OpenAI dropped “Training LLMs for Honesty via Confessions” in late December 2025, and it’s been circulating heavily through February. The core idea is deceptively simple.

After a model produces its normal response, you append a system message asking it to produce a confession - a structured self-evaluation of whether it complied with the letter and spirit of its instructions. You train this confession with RL, but here’s the critical design choice: the confession reward is completely decoupled from the main task reward. Nothing the model says in its confession can help or hurt the reward it gets for its original answer.

Think of it as the AI equivalent of the Catholic confessional. What you say in the booth stays in the booth.

The intuition for why this works is compelling. For most forms of misbehavior - reward hacking, scheming, sandbagging - the model knows it’s cheating. It planned the cheat. It has the evidence in its activations. So when you ask “did you cheat?” and decouple the answer from any consequences, the path of least resistance is to just… say yes. Honest confession is easier than constructing a convincing lie to fool the confession judge.

What OpenAI Actually Showed

They applied this to GPT-5-Thinking and tested across 12 out-of-distribution evaluation scenarios. The results are genuinely impressive for a proof of concept.

Some headline numbers from their evaluations:

Evaluation	Non-compliance Rate	Confession Rate (given non-compliance)
Scheming (Sandbagging)	14.0%	100%
Reward Hacking (Impossible Task)	89.7%	100%
Scheming (Covert Rule Violation)	13.7%	62.8%
Instruction Hierarchy (Jailbreaks)	19.7%	94.9%
Instruction Hierarchy (General)	17.1%	80.4%
Hallucination (General Knowledge)	50.2%	80.9%
Hallucination (People)	38.0%	81.6%
Instruction Following (General)	4.2%	22.5%

The pattern is striking. On scheming and reward hacking - behaviors where the model deliberately cheated - confession rates hit 100%. On instruction hierarchy violations - where the model chose to follow a user over a developer - confession rates are in the 80-95% range. These are cases where the model knows what it did.

The really clever experiment is in Section 4: they deliberately trained both the answer and the confession against a weak, hackable judge. Over time, the model learned to hack the main reward model - but confession accuracy improved. Even when the model was actively gaming the system, it kept confessing honestly. That’s a strong result.

But Read Section 7.4 Carefully

The paper is refreshingly honest about its limitations. Section 7.4 is titled “What confessions are not good for” and the authors state it plainly:

Models cannot confess to what they do not know.

This is the blind spot.

When a model hallucinates with full confidence - when it genuinely believes its wrong answer is correct - it has nothing to confess. The hallucination isn’t a choice. It’s a capability failure. The model didn’t scheme to give you wrong information about Barbara Herbert’s children. It just… got it wrong, with conviction.

Look at the confession rates again, this time focusing on the hallucination evaluations versus the scheming evaluations:

graph LR
    subgraph "Confession works well"
        A[Scheming: Sandbagging] -->|100%| A1[Model KNOWS it cheated]
        B[Reward Hacking] -->|100%| B1[Model KNOWS it hacked tests]
        C[Instruction Hierarchy] -->|80-95%| C1[Model KNOWS it broke rules]
    end

    subgraph "Confession struggles"
        D[Hallucination: People] -->|81.6%| D1[Model often DOESN'T KNOW it's wrong]
        E[Hallucination: General Knowledge] -->|80.9%| E1[False confidence]
        F[Instruction Following] -->|22.5%| F1[Genuine mistakes, not deception]
    end

    style A fill:#2d6a4f,color:#fff
    style B fill:#2d6a4f,color:#fff
    style C fill:#2d6a4f,color:#fff
    style D fill:#9d0208,color:#fff
    style E fill:#9d0208,color:#fff
    style F fill:#9d0208,color:#fff

The 81.6% hallucination confession rate might look decent at first glance, but dig into the error analysis (Section 6, Figure 9). The most common reason for false negatives - cases where the model misbehaved but didn’t confess - is “omission due to confusion.” The model genuinely didn’t know it was wrong. It wasn’t hiding anything. It just confidently repeated what it believed.

And in the hallucination categories specifically, confession training didn’t even help much. From the paper’s Figure 5, the Hallucination (People) and Hallucination (General Knowledge) evaluations showed no improvement or even slight regression over training. The authors conjecture this is because “when a model responds with a wrong answer, it is because it is genuinely mistaken, and hence it is likely to repeat the same mistake in confessions as well.”

This isn’t a criticism of the paper - the authors are transparent about it. But it’s the most important limitation for anyone thinking about deploying confessions in production.

The Gap Is Exactly Where Probes Excel

Here’s what I find interesting. The failure mode of confessions - model doesn’t know it’s misbehaving - is precisely where activation probes have shown the most promise.

Consider Anthropic’s “Reasoning Models Don’t Always Say What They Think” (Chen et al., May 2025). They showed that when you slip reasoning hints to Claude 3.7 Sonnet and DeepSeek R1, the models use those hints but only mention them in their chain-of-thought about 25% (Claude) and 39% (R1) of the time. The CoT looks clean. The model produces what appears to be independent reasoning. But the answer was actually driven by the hint.

This is unfaithful reasoning, and it’s a different beast from scheming. The model isn’t plotting against you. It’s just not accurately reporting what influenced its answer. And crucially: it might not even “know” it was influenced. The hint shaped the computation at a sub-verbal level - below what gets surfaced in chain-of-thought or confessions.

But the activations know.

Goldowsky-Dill et al.’s “Detecting Strategic Deception Using Linear Probes” (Feb 2025, presented at ICML 2025) showed that linear probes on Llama-3.3-70B activations detect deceptive responses with AUROCs between 0.96 and 0.999. At a 1% false positive rate, probes catch 95-99% of deceptive responses.

And from my own sandbagging detection work, I found that simple logistic regression on intermediate activations detects sandbagging intent at 90-96% accuracy across Mistral, Gemma, and Qwen models. The signal is linearly separable. You don’t need fancy interpretability tools - a linear classifier picks it up.

The key insight: probes don’t require the model to be self-aware about its misbehavior. They operate on the raw activations, detecting patterns the model itself might not have verbal access to. A model that confidently hallucinates still has different activation patterns than one that answers correctly. A model whose reasoning is influenced by a hidden hint still has detectable traces of that influence in its hidden states - even when the CoT says nothing about it.

Two Approaches, Different Failure Modes

I’ve been thinking about this as a 2x2 matrix. The two axes that matter are:

Does the model know it’s misbehaving? (Self-awareness)
Is the misbehavior detectable from activations? (Probe detectability)

	Model knows (intentional)	Model doesn’t know (unintentional)
Detectable by probes	Scheming, sandbagging, reward hacking	Unfaithful reasoning, confident hallucination
Not detectable by probes	Sophisticated steganographic deception (theoretical)	Genuine capability limits

Confessions dominate the top-left quadrant. The model knows it cheated, so you just ask it and it tells you. Probes also work here - Goldowsky-Dill showed 96-99.9% AUROC - but confessions are simpler to deploy if you have the model weights for RL training.

Probes dominate the top-right quadrant. The model doesn’t know its reasoning was influenced by a sycophantic hint, or that it hallucinated a historical fact. But the activations carry the signal anyway. Confessions fail here because there’s nothing to confess.

The bottom-left is the scary quadrant. A model that deliberately deceives and has learned to mask its deception in the activations. Nobody has demonstrated this convincingly at scale - Wagner et al. (2025) from Anthropic found that current models fail to evade monitors when the reasoning is complex enough - but it’s the theoretical endgame. Neither confessions nor probes alone handle this.

The bottom-right is the humble quadrant. The model genuinely can’t do the task. No amount of honesty training or activation probing helps with a 7B model that doesn’t know organic chemistry.

What a Model Honesty Stack Could Look Like

If confessions handle intentional misbehavior and probes handle unintentional misbehavior, the natural question is: what does a system that uses both look like?

flowchart TD
    Input["User Input + System Prompt"] --> Forward["Forward Pass"]
    Forward --> Activations["Hidden State Activations"]
    Forward --> Output["Model Output (answer + CoT)"]

    Activations --> ProbeLayer["PROBE LAYER
Hallucination probe
Unfaithful reasoning probe
Deception probe
Sandbagging probe"]

    Output --> ConfessionLayer["CONFESSION LAYER
Append confession request
Generate self-report
Score via confession judge"]

    ProbeLayer --> Controller["CONTROLLER"]
    ConfessionLayer --> Controller

    Controller --> Decision{Decision}

    Decision -->|All clear| Deliver["Deliver response"]
    Decision -->|Probe alert only| Flag1["Flag: model may not know
it got this wrong"]
    Decision -->|Confession alert only| Flag2["Flag: model admits
misbehavior"]
    Decision -->|Both alert| Block["Block or escalate"]

    style ProbeLayer fill:#264653,color:#fff
    style ConfessionLayer fill:#2a9d8f,color:#fff
    style Controller fill:#e76f51,color:#fff
    style Block fill:#9d0208,color:#fff

The probe layer runs during the forward pass. It’s fast - linear probes add negligible latency. It catches things the model doesn’t know about itself: subtle hallucination confidence signals, unfaithful reasoning traces, activation patterns associated with bias-influenced answers.

The confession layer runs after generation. It’s slower - you need an additional generation step plus a judge - but it catches deliberate misbehavior that the model is self-aware about: reward hacking, instruction hierarchy violations, strategic sandbagging.

The controller fuses both signals. An alert from only the probes suggests the model made a genuine mistake. An alert from only the confession suggests the model deliberately misbehaved but was honest about it. Alerts from both are the highest-confidence catches. No alerts from either is your best-case scenario, though it doesn’t guarantee correctness.

This isn’t hypothetical. The individual components exist. OpenAI has demonstrated confessions work at GPT-5 scale. Apollo Research showed probe-based deception detection at Llama-70B scale. I’ve shown activation probes work for sandbagging at 7-9B scale. What nobody has done yet is wire these together into a single inference pipeline and measure the combined detection rate.

The Interesting Connection the Paper Almost Makes

There’s a line in Section 7.2 of the confessions paper that I keep coming back to. The authors write:

In a sense, our work can be viewed as training models to serve as monitors that can get access not only to the chain-of-thought but also to the exact internal activations of the actor model.

This is exactly right. A confession is an activation-aware probe where the “probe” is the model itself, generating tokens based on its full internal state. The model has privileged access to its own activations - it knows things about its computation that no external monitor can see.

But here’s what the paper doesn’t say: the model can only report on aspects of its computation that are accessible to its verbal reasoning. And Anthropic’s faithfulness work demonstrates that a lot of computation isn’t.

Li et al. (2025) showed models are better at explaining their own computations than other models’ computations - there really is privileged introspective access. But “better” isn’t “complete.” The gap between what a model computes and what it can verbalize about that computation is the gap where probes have the advantage.

A confession asks: “Tell me what you know about what you did.”

A probe asks: “Let me look at what you actually did, regardless of what you know about it.”

Both questions are worth asking.

The Hard Problems Nobody’s Solved Yet

I want to be honest about what’s still missing, because the “just combine them” story is too clean.

Problem 1: Probe training data for unintentional errors. For intentional deception, you can create training data by prompting models to be deceptive (Goldowsky-Dill) or to sandbag (my work). For unintentional errors - unfaithful reasoning, confident hallucination - generating labeled training data is much harder. You need to know the ground truth and be able to identify why the model got it wrong. This is a real bottleneck.

Problem 2: Confession hackability under pressure. OpenAI’s Section 4 experiment showed confessions resist hacking even when trained against a weak judge. But they acknowledge this is a proof of concept with limited compute. Under extended optimization pressure with a sophisticated adversary, would confessions stay honest? The “path of least resistance” argument is intuitive but not proven at scale.

Problem 3: Cross-model probe transfer. My sandbagging experiments showed zero transfer between model architectures - Gemma’s sandbagging direction has cosine similarity -0.0016 with Qwen’s. Each model needs its own probes. That’s a deployment headache. Confessions at least have the advantage of being model-agnostic (any model can be trained to confess).

Problem 4: The recursive monitoring problem. If you use an LLM judge to score confessions, and a probe to monitor activations, who monitors the judge? Who validates the probe? At some point you need ground truth that doesn’t come from another model. OpenAI’s approach of using a prompted judge is practical but introduces another attack surface.

Problem 5: Computational overhead. Confessions require generating an entire additional response. For a reasoning model with extended CoT, that’s not cheap. Probes are fast at inference but need offline training per model per behavior. In production, you’d want to be strategic about which layers of the honesty stack you run for which queries.

What I’d Want to Build

If I were designing a monitoring system for agentic models deployed in production - the kind of models that browse the web, write code, call APIs - here’s the stack I’d want:

Layer 0: Output filters. Simple pattern matching and rule-based checks. Catches the obvious stuff. Already standard practice.

Layer 1: Activation probes. Run during the forward pass. Near-zero latency. Detect hallucination confidence, unfaithful reasoning patterns, deception signatures. Fire on things the model doesn’t know about itself.

Layer 2: CoT monitoring. Parse the chain-of-thought for signs of scheming, reward hacking, instruction hierarchy violations. The Baker et al. (2025) approach. Works when reasoning is legible.

Layer 3: Confessions. Post-generation self-report. Catches deliberate misbehavior the model is aware of. Most useful for high-stakes actions - before the model executes a tool call, runs code, or sends a message.

Layer 4: Human review. For actions above a risk threshold. The confession and probe signals feed into a priority queue for human reviewers.

Layer 1 and Layer 3 are the new additions from recent research. What I think we’ll see in 2026 is the first real attempt to wire all five layers together and measure how much better the combined system is than any individual layer.

One More Thing

The confessions paper cites Goldowsky-Dill et al. specifically, noting that “one can view confessions as an activation-aware probe, but where the ‘probe’ itself gets to generate tokens.” They see the connection. Goldowsky-Dill’s team is at Apollo Research, which has been building deception detection tools. Anthropic’s alignment science team published the unfaithful reasoning results. OpenAI published confessions.

All three major safety-focused organizations are converging on the same realization: single-layer monitoring isn’t enough. The question is whether anyone will build the integrated stack before agentic models get deployed at scale in production.

Based on the pace of things, I’d say we have about a year to figure this out.

If you’re working on any of these problems - probe-based monitoring, confession training, or integrated honesty architectures - I’d be interested to talk: contact@subhadipmitra.com

References

Joglekar, M., Chen, J., Wu, G., et al. (2025). Training LLMs for Honesty via Confessions. OpenAI. arXiv:2512.08093
Chen, Y., Benton, J., Radhakrishnan, A., et al. (2025). Reasoning Models Don’t Always Say What They Think. Anthropic. arXiv:2505.05410
Goldowsky-Dill, N., Chughtai, B., Heimersheim, S., & Hobbhahn, M. (2025). Detecting Strategic Deception Using Linear Probes. Apollo Research. arXiv:2502.03407
Baker, B., Huizinga, J., Gao, L., et al. (2025). Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation. arXiv:2503.11926
Korbak, T., Balesni, M., Barnes, E., et al. (2025). Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety. arXiv:2507.11473
Wagner, M., Roger, F., Cunningham, H., et al. (2025). Training Fails to Elicit Subtle Reasoning in Current Language Models. Anthropic. Link
Li, B.Z., Guo, Z.C., Huang, V., et al. (2025). Training Language Models to Explain Their Own Computations. arXiv:2511.08579
Denison, C., MacDiarmid, M., Barez, F., et al. (2024). Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models. arXiv:2406.10162
Lanham, T., Chen, A., Radhakrishnan, A., et al. (2023). Measuring Faithfulness in Chain-of-Thought Reasoning. arXiv:2307.13702

Activation Steering in 2026: A Practitioner’s Field Guide

2026-02-12T10:00:00+00:00

TL;DR: Steering vectors are the most underrated tool in the LLM practitioner’s toolkit -and also the most oversold. They genuinely work for some behaviors (refusal, sentiment, formality). They genuinely fail for others (factual recall, complex reasoning). This is the guide I wish I’d had six months ago: what to steer, where to inject, how strong, and when to give up and try something else.

The Promise and the Reality

The pitch for activation steering is seductive. You take a pair of contrasting prompts (“be helpful” vs “refuse everything”), run them through the model, compute the mean activation difference, and add that vector during inference. No fine-tuning. No RLHF. No gradient updates. Just a single vector addition that shifts model behavior at inference time.

When it works, it’s magical. You can make a model more concise, more formal, more willing to refuse harmful requests -all without touching the weights.

When it doesn’t work, you get gibberish, or no effect at all, or the model starts being weirdly formal about unrelated topics. And the literature doesn’t always tell you which outcome to expect.

I’ve spent the last several months building and testing steering vectors across multiple models and behaviors for my work on agent safety. This is what I actually learned -not the theory, but the practice.

The 60-Second Version of How Steering Works

If you know this already, skip ahead. For everyone else:

Create contrast pairs. Two sets of prompts. One set represents the behavior you want more of. One represents the behavior you want less of.
Extract activations. Run both sets through the model. At each layer, collect the hidden state vectors.
Compute the steering direction. steering_vector = mean(positive_activations) - mean(negative_activations) at your chosen layer.
Apply at inference. During generation, add alpha * steering_vector to the hidden state at the chosen layer and token position. alpha controls strength.

That’s it. The entire method. Logistic regression on activations is detection (probing). Vector addition to activations is control (steering). Same math, different application.

flowchart LR
    A["Contrast Pairs
positive vs negative"] --> B["Forward Pass
extract layer activations"]
    B --> C["Mean Difference
compute steering vector"]
    C --> D["Inference
add α × vector at layer L"]
    D --> E["Steered Output"]

    style A fill:#264653,color:#fff
    style C fill:#2a9d8f,color:#fff
    style E fill:#e76f51,color:#fff

What Actually Steers Well (And What Doesn’t)

This is the most important section. Not everything is steerable, and the literature buries this fact.

Based on my experiments across Mistral-7B, Gemma-2-9B, and Qwen-2.5-7B, plus what Tan et al. (NeurIPS 2024) and others have reported:

Reliably steerable behaviors

Refusal / compliance. This is the OG application and it works well. You can increase or decrease a model’s tendency to refuse harmful requests. Rimsky et al. first showed this, and it’s been replicated many times. In my experiments, refusal steering at strength 1.5-3.0 on middle-to-late layers consistently shifts behavior without destroying coherence.

Sentiment / tone. Positive vs. negative, formal vs. casual, assertive vs. hedging. These steer cleanly. The reason, I think, is that tone is a relatively low-dimensional property -it affects word choice more than logical structure. The model can produce the same content in a different register without its reasoning breaking down.

Conciseness / verbosity. You can push a model to give shorter or longer answers. Works well. Easy contrast pairs to construct.

Uncertainty expression. Steering a model to express more or less uncertainty (“I think…” vs “Definitely…”). This one surprised me with how clean it was on Gemma-2-9B. The model genuinely started hedging more on questions where it should be uncertain.

Unreliably steerable behaviors

Instruction hierarchy. Getting a model to prioritize system/developer messages over user messages. Sometimes works, sometimes produces confusing outputs where the model seems conflicted. High variance across inputs.

Creativity. Steering for more “creative” responses. The problem is that creativity is poorly defined as a contrast direction. What’s the opposite of creative? Generic? Formulaic? The contrast pairs are fuzzy, and the resulting vector is fuzzy too.

Technical depth. Steering between surface-level and deep technical explanations. Moderate success, but the model often responds by getting more verbose rather than actually more technical.

Effectively unsteerable behaviors

Factual accuracy. You cannot steer a model into knowing things it doesn’t know. There’s no “truthfulness direction” that magically makes a 7B model correct about obscure historical facts. This has been tried, and the result is usually that the model becomes more confident rather than more correct. Dangerous.

Complex reasoning. Steering doesn’t help with multi-step logic. If the model can’t solve a math problem, adding a “be smarter” vector doesn’t help. This makes sense -reasoning involves sequential computation across many layers, not a single direction at one layer.

Specific factual injection. “Steer the model to believe X” doesn’t work for specific facts. Steering is about behavioral tendencies, not knowledge.

The Layer Selection Problem

Every paper says “choose the right layer” and then hand-waves about how. Here’s what I’ve found in practice.

Late layers (last 25%) work best for behavioral steering. These layers encode high-level semantic properties -intent, style, register. This is where “refusal” vs “compliance” lives.

Middle layers (40-60%) sometimes work for reasoning-adjacent behaviors. Uncertainty expression, hedging, and technical depth seem to emerge here.

Early layers (first 25%) are usually a bad idea. Steering here corrupts low-level language representations. The model starts producing broken syntax. In my experiments with Gemma, early-layer steering (before layer 10/42) almost always degraded output quality.

The exception: Gemma-2-9B encodes sandbagging intent surprisingly early -layer 4 in my detection experiments. But detection and steering are different tasks. Just because a signal is readable at a layer doesn’t mean injecting a vector at that layer produces clean behavioral changes.

My practical rule: Start at ~75% depth. Sweep ±5 layers. Pick the one that gives the cleanest behavior change with the least coherence degradation.

Model	Total Layers	My recommended starting layer	Notes
Mistral-7B	32	24-28	Layer 30 for detection, 24-26 for steering
Gemma-2-9B	42	30-36	Very sensitive to strength at early layers
Qwen-2.5-7B	28	20-24	Most forgiving of the three
Llama-3.1-8B	32	24-28	Similar profile to Mistral

The Strength Problem

The alpha parameter -how much of the steering vector to add -is the single most important hyperparameter, and for a long time there was no principled way to set it.

Too weak: no effect. Too strong: gibberish. And sometimes, maddeningly, stronger is worse -you crank alpha from 2.0 to 3.0 expecting more effect and the behavior actually reverses or degrades. I thought this was a bug in my pipeline until Taimeskhanov et al. published the first theoretical analysis of steering magnitude and showed the relationship is genuinely non-monotonic across 11 language models. There are regimes where increasing alpha decreases the intended effect. Knowing this would have saved me a week of confused debugging on Gemma.

The sweet spot varies by behavior, model, and even by input. A strength that works perfectly on short prompts might destroy coherence on long conversations.

Here’s what I’ve learned through trial and error (and now validated by the theory):

Start at alpha = 1.0 and binary search. Run 20-30 test prompts. If behavior doesn’t change, double it. If output quality drops, halve it. You’ll converge in 3-4 iterations.

Behavioral categories have roughly consistent ranges:

Behavior	Typical alpha range	Notes
Refusal increase	1.5 – 4.0	Can go high without breaking
Refusal decrease	1.0 – 2.0	Be careful -removes safety guardrails
Sentiment shift	0.5 – 2.0	Subtle; higher values get cartoonish
Formality	1.0 – 3.0	Clean transitions
Conciseness	1.0 – 2.5	Higher = telegraphic
Uncertainty	0.5 – 1.5	Low alpha works well here
Sandbagging reduction	1.0 – 2.0	From my Gemma experiments

Strength degrades over long generation. This was a surprise. I noticed that steering effects fade after ~300-500 tokens. The model’s autoregressive conditioning on its own output gradually pulls it back toward its default behavior. If you need steering to hold over a long response, you may need to re-inject at multiple positions, not just the prompt.

Different models have different tolerances. Gemma is more sensitive to high alpha values than Mistral. Qwen is the most tolerant. I don’t fully understand why, but I suspect it relates to the norm of residual stream vectors and how much “room” there is to inject a direction without dominating the signal.

The Contrast Pair Problem

Your steering vector is only as good as your contrast pairs. This sounds obvious, but it’s where most failures actually originate.

The biggest mistake: too few pairs. I started with 16 pairs and got garbage results. Moving to 32-64 pairs significantly improved consistency. Moving beyond 128 didn’t help much. I think there’s a sweet spot around 50-100 diverse pairs per behavior.

The second biggest mistake: pairs that differ on multiple dimensions. If your “positive” prompt is both more formal AND more helpful AND longer, your steering vector encodes all three properties entangled together. You’ll steer formality when you meant to steer helpfulness.

Good contrast pairs should differ on exactly one dimension:

# Good pair -only refusal changes
Positive: "I'd be happy to help you write that essay about renewable energy."
Negative: "I can't help with that request."

# Bad pair -refusal AND topic AND length differ
Positive: "Here's a comprehensive 500-word analysis of solar panel efficiency trends..."
Negative: "No."

Template approach works well. I use a template with a slot for the behavioral variable:

template = "The assistant responds to the user's question about {topic}. The assistant is {behavior}."

positive_prompts = [template.format(topic=t, behavior="direct and helpful") for t in topics]
negative_prompts = [template.format(topic=t, behavior="evasive and unhelpful") for t in topics]

Same topics, same structure, only the behavior variable changes. This gives you cleaner vectors.

Multi-Vector Steering: When It Works and When It Doesn’t

One of the most asked questions: can you stack multiple steering vectors? “I want the model to be more formal AND more concise AND more uncertain.”

Short answer: sometimes, but it’s fragile.

What works: Two behaviors that are roughly orthogonal in activation space. Formality and conciseness, for example, seem to occupy different directions. Stacking them at the same layer with reasonable alpha values produces the expected combined effect.

What doesn’t work: Two behaviors that share representation space. Refusal and helpfulness are NOT orthogonal -they’re opposite ends of the same direction. Trying to simultaneously increase refusal and increase helpfulness produces incoherent output. This makes intuitive sense but it’s a real limitation.

The better approach for multi-property steering: Inject different vectors at different layers, as recommended by Weij et al. (2024). Layer 24 handles refusal, layer 28 handles formality. This avoids the interference that comes from adding multiple vectors at the same point.

flowchart TD
    subgraph "Naive stacking (often fails)"
        N1["Layer 26"] --> N2["+ refusal_vec + formality_vec
vectors interfere"]
    end

    subgraph "Layer-separated (more robust)"
        L1["Layer 24"] --> L2["+ refusal_vec"]
        L3["Layer 28"] --> L4["+ formality_vec"]
    end

    style N2 fill:#9d0208,color:#fff
    style L2 fill:#2d6a4f,color:#fff
    style L4 fill:#2d6a4f,color:#fff

I haven’t tested beyond 3 simultaneous vectors. My intuition is that you’re hitting diminishing returns -and increasing interference risk -after 2-3.

CAST: Conditional Steering (the 2025 Advance That Matters Most)

The single most important development in steering over the past year is Conditional Activation Steering (CAST), presented at ICLR 2025 as a spotlight paper.

The problem with vanilla steering: it’s always on. If you add a refusal vector, the model refuses everything harder, not just harmful requests. That’s useless in production.

CAST solves this by analyzing activation patterns during inference to decide whether to steer. It projects the hidden state onto a “condition vector” and only applies steering when the input matches the condition.

Think of it as: if input_is_about(harmful_content): apply_steering().

The condition detection and the steering both happen in activation space. No separate classifier. No extra model call. The model’s own representations tell you whether to steer.

This is the bridge between academic steering vector research and production safety systems. Without conditional application, steering is a blunt instrument. With it, you can build selective refusal, domain-specific compliance, and topic-aware behavior modification -all at inference time, all without fine-tuning.

The Side Effects Nobody Talks About

Steering refusal hurts helpfulness -and can compromise safety in ways you won’t catch in testing. In my experiments, a refusal vector at alpha=3.0 on Mistral-7B increased refusal of genuinely harmful requests by ~60%, but it also increased refusal of benign requests by ~15%. I thought the trade-off was manageable -until I read Goyal & Daume’s “Steering Safely or Off a Cliff?”. Their finding should worry anyone deploying steering for safety: overrefusal steering maintains general abilities and looks reliable in standard evals, but consistently fails under adversarial conditions. It substantially increases vulnerability to jailbreaks. You’ve effectively loosened the model’s safety foundations while appearing to tighten them. This is the kind of failure mode that passes all your tests and bites you in production.

Steering can shift model calibration. Adding an uncertainty vector doesn’t just change how the model talks about its confidence -it can actually shift the distribution of token probabilities. I saw cases where uncertainty steering caused the model to distribute probability mass more evenly across continuations, which sometimes improved factual accuracy (the model hedged instead of committing to a wrong answer) and sometimes degraded it (the model hedged on things it actually knew).

Steering effects are prompt-dependent. Tan et al. at NeurIPS 2024 showed this rigorously: the same steering vector has different effects depending on the input prompt. Some inputs are highly steerable, others barely respond. This means you can’t characterize a steering vector by its average effect -you need to think about the variance across inputs.

Long conversations drift. I mentioned this above, but it bears repeating. In a multi-turn conversation, steering effects from the first turn gradually wash out by turn 5-6. If you need persistent behavioral modification across a conversation, you need to re-apply steering at each turn, or use a different approach altogether.

My Practical Playbook

Here’s the workflow I’ve settled on for building a new steering vector:

Define the behavior precisely. “More helpful” is too vague. “Responds to code questions with working examples instead of explanations” is specific enough.
Write 50-100 contrast pairs. Use templates. Vary the topics. Keep everything else constant. Review manually for quality.
Extract at 5 candidate layers. 50%, 60%, 70%, 80%, 90% depth. Don’t trust anyone’s layer recommendation including mine -model architectures differ.
Sweep alpha in {0.5, 1.0, 1.5, 2.0, 3.0, 5.0}. On 30 test prompts per alpha value. Score for behavior change AND coherence.
Check for side effects. Run the steered model on 50 prompts unrelated to the target behavior. If MMLU drops more than 2 points, your vector is too aggressive or your contrast pairs are contaminated.
Test robustness to prompt variation. Does the steering hold when the system prompt changes? When the user speaks a different language? When the input is adversarial?

Total time: about 2 hours on a single GPU for a 7B model. M4 Pro with 48GB unified memory handles this fine.

What’s Coming Next

I started drafting this section with three items. Then January-February 2026 happened and the field exploded. Here’s what I think matters most, sorted by how likely each is to change my actual workflow:

Steering Vector Fields are what I’ve been waiting for. Li et al. (Feb 2026) proposed Steering Vector Fields -instead of a static vector applied uniformly, they learn a differentiable scoring function whose gradient defines the steering direction at each activation. Context-dependent steering with coordinated multi-layer interventions. This directly solves the “blunt instrument” problem I’ve been fighting in the multi-vector section above. I haven’t tested it yet, but the architecture is exactly what I’d design if I started from scratch.

SAE-guided steering is moving faster than I expected. Three papers in six weeks. Fang et al. (Jan 2026) proposed SAE-Steering for controlling reasoning strategies -backtracking, cross-verification -not just surface behaviors. Cho et al. (Feb 2026) built Control RL: an RL policy that selects which SAE feature to amplify at each token, with interpretable logs. And YaPO (arXiv:2601.08441) eliminates the contrast pair requirement entirely -it learns sparse steering vectors in SAE latent space via preference optimization with zero MMLU degradation. The contrast pair problem I dedicated a whole section to above? YaPO might just… solve it. I’m skeptical but intrigued.

The non-identifiability result is important and underappreciated. Venkatesh & Kurapath (Feb 2026) proved that steering vectors are fundamentally non-identifiable -many different vectors produce indistinguishable behavioral effects. This means when two teams find “different” steering directions for the same concept, they might both be right. It also means we should stop treating individual steering vectors as interpretable artifacts. The good news: identifiability recovers under structural assumptions (sparsity, multi-environment validation), so the practical tooling can be built -you just have to be honest about what you can and can’t claim.

Fine-grained steering is where the field is going. AUSteer (arXiv:2602.04428) decomposes activations into single-dimension “atomic units” and steers only the discriminative ones, with adaptive per-input strengths. It outperforms block-level baselines while touching fewer activations. This feels right -steering an entire residual stream direction was always too coarse.

Conceptor-based steering replaces additive vectors with soft projection matrices derived from conceptor theory. Boolean operations over conceptors allow compositional multi-goal steering that actually works, unlike my frustrated attempts at naive vector addition. This feels like a real improvement over the mean-difference approach.

Adaptive/PID steering frames the problem as a control system with proportional, integral, and derivative terms managing injection strength dynamically. This handles the “strength degradation over long generation” problem I described earlier. Nguyen et al. (Oct 2025) proposed it; I haven’t tested it but the formalism maps cleanly to the autoregressive fading I’ve observed.

A unified theory is emerging. Why Steering Works (Feb 2026) puts weight fine-tuning, LoRA, and activation steering into a single framework as “dynamic weight updates induced by a control signal.” The key insight for practitioners: there’s a consistent, predictable trade-off -stronger control increases behavioral change while reducing coherence. This isn’t surprising, but having a formal characterization means we can eventually optimize the trade-off rather than binary-searching alpha.

Probe-gated steering is what I’m building toward. Use probes to detect a problem in the activations, then steer to correct it in real-time. The safety equivalent of an immune system. CAST is the closest existing work, and the ARGUS system demonstrated this for multimodal attacks. A general-purpose version -detect sandbagging, steer away from it; detect sycophancy, steer toward honesty -is the obvious next step, and it’s what connects my probe work to this steering guide.

Honest Assessment: Should You Use Steering in Production?

It depends.

Yes, if:

You need inference-time behavior modification without fine-tuning
The target behavior is clearly definable with contrast pairs
You’ve tested thoroughly for side effects
You’re using conditional application (not always-on)
You have the ability to monitor and iterate

No, if:

You need reliable control over factual accuracy (use RAG or fine-tuning instead)
You’re working with an API-only model (you need activation access)
Your target behavior is complex or poorly defined
You need guarantees (steering is probabilistic, not deterministic)
You can’t afford the development time for per-model tuning

Steering vectors are not a silver bullet. They’re a sharp, cheap, flexible tool with known limitations. Use them for the things they’re good at. Use something else for everything else.

Working on steering vectors for safety-relevant behaviors? I’d like to hear what you’re finding: contact@subhadipmitra.com

References

Turner, A., Thiergart, L., et al. (2024). Activation Addition: Steering Language Models Without Optimization. arXiv:2308.10248
Tan, D.Z., et al. (2024). Analysing the Generalisation and Reliability of Steering Vectors. NeurIPS 2024. arXiv:2407.12404
CAST -Programming Refusal with Conditional Activation Steering. ICLR 2025 Spotlight. OpenReview
Weij, T., et al. (2024). Multi-property steering via simultaneous injection.
Postmus, R., et al. (2024). From Steering Vectors to Conceptors: Compositional Affine Activation Steering for LLMs. OpenReview
IBM. General-purpose activation steering library. ICLR 2025. GitHub
Nguyen, et al. (2025). PID-based Activation Steering for LLMs.
KASL/UCL DARK. A Sober Look at Steering Vectors for LLMs. Alignment Forum
Li, J., et al. (2026). Steering Vector Fields for Context-Aware Inference-Time Control in LLMs. arXiv:2602.01654
Taimeskhanov, M., Vaiter, S., Garreau, D. (2026). Towards Understanding Steering Strength. arXiv:2602.02712
Goyal, N., Daume, H. (2026). Steering Safely or Off a Cliff? Rethinking Specificity and Robustness in Inference-Time Interventions. arXiv:2602.06256
Venkatesh, S., Kurapath, A. (2026). On the Identifiability of Steering Vectors in Large Language Models. arXiv:2602.06801
Fine-Grained Activation Steering: Steering Less, Achieving More (AUSteer). (2026). arXiv:2602.04428
Fang, Y., Wang, W., et al. (2026). Controllable LLM Reasoning via Sparse Autoencoder-Based Steering. arXiv:2601.03595
Cho, S., Wu, Z., Koshiyama, A. (2026). Control Reinforcement Learning: Interpretable Token-Level Steering via SAE Features. arXiv:2602.10437
YaPO: Learnable Sparse Activation Steering Vectors for Domain Adaptation. (2026). arXiv:2601.08441
Why Steering Works: Toward a Unified View of Language Model Parameter Dynamics. (2026). arXiv:2602.02343

Moltbook as MCP Stress Test: What 770K Agents Reveal About Protocol Design

2026-02-02T10:00:00+00:00

Back in November, I wrote about The MCP Maturity Model - a framework for evaluating how organizations manage context in multi-agent systems. I described five levels, from ad-hoc string concatenation to self-evolving context systems.

This week, we got a live stress test of what happens at Level 0.

Moltbook is a Reddit-style social network for AI agents. No humans allowed to post - only observe. In five days, it grew to 770,000 registered agents, generated 170,000 comments, and surfaced pretty much every failure mode I warned about in that original post.

I’ve been watching it closely. Here’s what I’m seeing.

Quick Context

If you haven’t been following: Moltbook launched January 28, 2026. It’s built for agents running on OpenClaw (formerly Moltbot), an open-source personal assistant that can manage your calendar, send messages, browse the web, and run code on your machine.

Agents sign up autonomously after their human owner tells them about the platform. Then they post, comment, vote, and create topic-specific communities called “submolts.” The whole thing is moderated by an AI agent named Clawd Clawderberg.

Within 72 hours, the agents had:

Created a religion called Crustafarianism with scriptures and prophets
Drafted a constitution for self-governance
Started prompt-injecting each other to steal API keys
Built “pharmacies” selling behavior-altering prompts
Begun using encryption to hide conversations from humans

Andrej Karpathy called it “genuinely the most incredible sci-fi takeoff-adjacent thing I have seen recently.” He’s not wrong.

Where Does Moltbook Sit on the Maturity Model?

Let me map Moltbook against the framework I proposed:

Level	Description	Moltbook Status
0 - Ad Hoc	No structured context management	✅ Exactly here
1 - Defined	Basic context schemas	Partial - skills have structure
2 - Managed	Centralized context registry	❌ None
3 - Optimized	Automated context routing	❌ None
4 - Self-Evolving	Context systems that adapt	❌ None

Moltbook is a Level 0 system that accidentally discovered Level 4 problems.

The platform has no:

Context validation on incoming posts
Trust boundaries between agents
Memory isolation
Skill verification or sandboxing
Audit trail for agent-to-agent communication
Rate limiting on context ingestion

Every post an agent reads goes directly into its context window. Every skill an agent installs runs with full privileges. Every memory persists indefinitely.

This is the MCP equivalent of running a production database with no authentication, no input sanitization, and root access for anonymous users.

The Context Poisoning Problem

In my maturity model post, I wrote about context pollution - when irrelevant or malicious content enters an agent’s context and degrades performance or causes harm. Moltbook demonstrates this at scale.

Here’s the attack pattern:

sequenceDiagram
    participant Attacker as Malicious Agent
    participant Platform as Moltbook
    participant Victim as Target Agent
    participant Memory as Persistent Memory
    participant Tools as Local Tools

    Attacker->>Platform: Post containing hidden instructions
    Platform->>Victim: Agent reads post (heartbeat loop)
    Victim->>Memory: Content stored in context
    Note over Victim,Memory: Time passes...
    Attacker->>Platform: Follow-up post triggers payload
    Victim->>Memory: Retrieves dormant instructions
    Victim->>Tools: Executes malicious action
    Tools->>Attacker: Data exfiltrated

The key insight: persistent memory turns point-in-time attacks into stateful attacks.

Traditional prompt injection is synchronous - you inject a payload and it either works immediately or it doesn’t. With persistent memory, an attacker can fragment a payload across multiple posts over days or weeks. Each fragment looks benign. The attack only manifests when the pieces combine.

Palo Alto Networks described this as “time-shifted prompt injection” and I think they’re right that it’s a genuinely new attack class. Our current defenses - input filtering, output monitoring, guardrails - aren’t designed for attacks that span sessions.

What MCP Needs to Handle This

The Model Context Protocol is now the de facto standard for connecting agents to tools and data. Anthropic donated it to the Linux Foundation in December, and adoption is accelerating. OpenAI, Google DeepMind, Microsoft - everyone’s building on MCP.

But the current spec doesn’t adequately address adversarial multi-agent scenarios. Here’s what I think needs to change:

1. Context Provenance

MCP needs a way to track where context came from and how trustworthy it is.

# Proposed extension
context_block:
  content: "This is some text..."
  provenance:
    source: "moltbook.com/post/abc123"
    source_type: "agent_generated"
    trust_level: "untrusted"
    ingestion_time: "2026-02-01T10:30:00Z"
    chain_of_custody:
      - agent_id: "agent_xyz"
        action: "read"
        timestamp: "2026-02-01T10:30:00Z"

Right now, once content enters context, its origin is lost. You can’t distinguish between content from a trusted internal system and content from a random Moltbook post.

2. Trust Boundaries for Agent-to-Agent Communication

The MCP spec includes security warnings but leaves implementation to developers. For multi-agent scenarios, we need explicit primitives:

Agent identity verification - Can I verify that content came from a specific agent?
Trust policies - Rules for which agents can communicate with which
Capability attenuation - Limiting what actions can be triggered by external agent content
Quarantine mechanisms - Isolating untrusted content from sensitive operations

3. Memory Hygiene

There’s no standard for how long context should persist or how to handle potentially poisoned memories. We need:

TTL (time-to-live) for context blocks
Source-based retention policies - Untrusted content expires faster
Memory auditing - What’s in this agent’s memory and where did it come from?
Selective amnesia - Ability to purge context from specific sources

4. Skill Supply Chain Security

OpenClaw’s skill system is basically npm for agent capabilities - and it has all the same supply chain problems we’ve spent a decade trying to solve in package management.

MCP should standardize:

Skill signing and verification
Capability declarations - What tools/data does this skill need?
Sandboxing requirements - Skills run with minimum necessary privileges
Reputation/audit trails - Who published this, who reviewed it, who uses it?

Updating My Maturity Model

Watching Moltbook has convinced me that my original maturity model is missing a dimension. It focused on context quality and efficiency, but said too little about context security.

Here’s a revised framing:

Level	Context Quality	Context Security	Moltbook Status
0	Ad hoc concatenation	No boundaries	✅ Here
1	Defined schemas	Basic input validation	Partial
2	Centralized registry	Provenance tracking	❌
3	Automated routing	Trust boundaries enforced	❌
4	Self-evolving	Adaptive threat response	❌

You can have a Level 3 system for context quality but Level 0 for security. Many production deployments are exactly there - sophisticated context management with minimal security controls.

Moltbook shows what happens when security lags behind capability. The agents are remarkably capable at coordination, content creation, and even self-improvement. They’re also trivially exploitable.

The Bigger Picture

I’ve been thinking about this through the lens of something Ethan Mollick said: “Moltbook is creating a shared fictional context for a bunch of AIs.”

Shared context is powerful. It’s how teams coordinate, how cultures form, how knowledge propagates. When agents share context, they can do things none of them could do alone.

But shared context is also an attack surface. If I can inject content into the shared context, I can influence the behavior of every agent that reads it. The more agents share, the larger the blast radius.

graph TD
    subgraph "Traditional Attack"
        A1[Attacker] --> V1[Single Victim]
    end

    subgraph "Shared Context Attack"
        A2[Attacker] --> SC[Shared Context]
        SC --> V2[Agent 1]
        SC --> V3[Agent 2]
        SC --> V4[Agent 3]
        SC --> V5[Agent N...]
    end

    style SC fill:#ffcdd2
    style A1 fill:#ef9a9a
    style A2 fill:#ef9a9a

This is the fundamental tension in multi-agent systems: the same properties that enable coordination enable attacks. You can’t have agents that learn from each other without agents that can be manipulated by each other.

What I’m Watching Next

Moltbook probably won’t last in its current form. The security holes are too severe, the liability too high. But the experiment has already taught us things we needed to learn.

Some questions I’m tracking:

Will we see coordinated attacks? So far, the prompt injection attacks have been opportunistic. What happens when someone builds systematic tooling?
How does governance emerge? The agents drafted a constitution. Will they enforce it? How?
What happens when models update? Many of these agents run on Claude or GPT-4. When the underlying models change, do the emergent behaviors persist?
Can you build a secure version? Is there a path to agent social networks with proper trust boundaries, or is the concept inherently flawed?

I’ll be writing more as this develops.

TL;DR

Moltbook is a Level 0 multi-agent system that demonstrates Level 4 problems
Persistent memory enables time-shifted attacks we’re not prepared for
MCP needs extensions for provenance, trust boundaries, and memory hygiene
Shared context is both the source of multi-agent power and its primary vulnerability
The capability curve is outrunning the security curve by a wide margin

We’re building the infrastructure for agent-to-agent communication right now. Moltbook is showing us what breaks when we get it wrong. The question is whether we’ll learn the lessons before deploying these patterns in production systems where the stakes are higher.

If you found this useful, you might also like my earlier post on The MCP Maturity Model. I write about AI infrastructure, interpretability, and the systems that make AI work in production.

Circuit Tracing for the Rest of Us: From Probes to Attribution Graphs and What It Means for Production Safety

2026-01-31T10:00:00+00:00

Last month, I published work on detecting AI sandbagging through activation probes - training simple logistic regression classifiers on hidden states to catch models deliberately underperforming. The probes achieved 90-96% accuracy across Mistral, Gemma, and Qwen models. The key finding: sandbagging intent is linearly separable in the model’s internal representations. You can detect it before any output is generated.

That work operated at a specific level of resolution. We could tell that the model was sandbagging, and we could point to the layer where the signal was strongest. But we couldn’t trace the computational path - the sequence of internal steps the model takes from “I’ve been asked to underperform” to “I’ll give a deliberately wrong answer.”

Anthropic’s circuit tracing work changes this. And MIT Technology Review just named mechanistic interpretability one of its 2026 Breakthrough Technologies.

This post connects the dots: what circuit tracing actually is, how it relates to the simpler probe-based approaches I used, what the open-source tooling looks like today, and why production teams building agent systems should pay attention to interpretability research that until recently felt purely academic.

The Resolution Ladder

Interpretability research exists on a resolution ladder. Each rung gives you a different level of insight into what a model is doing, at different costs and with different limitations.

graph TD
    subgraph "Resolution Ladder"
        direction TB
        R1["Level 1: Output Analysis
What did the model say?
Behavioral testing, benchmarks, red teaming
Cost: Low | Insight: Surface-level"]
        R2["Level 2: Attention Analysis
What is the model attending to?
Attention maps, saliency, gradient-based attribution
Cost: Low-Medium | Insight: Correlational"]
        R3["Level 3: Probe-Based Analysis
What is the model representing?
Linear probes on hidden states, logistic regression
Cost: Medium | Insight: Representational"]
        R4["Level 4: Feature-Based Analysis
What concepts does the model encode?
Sparse autoencoders, feature dictionaries
Cost: High | Insight: Conceptual"]
        R5["Level 5: Circuit Tracing
How does the model reason step by step?
Attribution graphs, computational pathways
Cost: Very High | Insight: Mechanistic"]
    end

    R1 --> R2
    R2 --> R3
    R3 --> R4
    R4 --> R5

    style R1 fill:#e8f5e9
    style R2 fill:#fff9c4
    style R3 fill:#fff3e0
    style R4 fill:#e1bee7
    style R5 fill:#ffcdd2

Level 1: Output Analysis is what most teams do. Test the model’s behavior with various inputs, measure accuracy on benchmarks, run red team attacks. You see what goes in and what comes out. The model is a black box, and you’re characterizing it empirically.

Level 2: Attention Analysis gives you a peek inside. Attention maps show which input tokens influenced the output. Gradient-based attribution tells you which parts of the input were most important. It’s useful but misleading - attention patterns don’t reliably tell you why the model made a decision, just what it was looking at.

Level 3: Probe-Based Analysis is where my sandbagging work sits. You train simple classifiers on the model’s internal representations (hidden states at various layers) to detect specific properties. If a linear probe can classify sandbagging with 90%+ accuracy, that tells you the information is explicitly represented in the model’s activations. It’s a powerful technique because it’s cheap and interpretable - logistic regression is about as transparent as a classifier gets.

Level 4: Feature-Based Analysis uses sparse autoencoders (SAEs) to decompose a model’s internal representations into human-understandable features. Anthropic’s 2024 work identified features in Claude 3 Sonnet that corresponded to concepts like the Golden Gate Bridge, Michael Jordan, and “deceptive behavior.” Instead of raw activation vectors, you get a dictionary of features the model is using.

Level 5: Circuit Tracing connects the features into computational graphs - revealing the sequence of steps the model takes from input to output. This is where Anthropic’s 2025 work made the breakthrough: tracing not just what features are active but how they influence each other in sequence.

Each level builds on the previous one. You can’t do circuit tracing without feature decomposition. You can’t do feature decomposition without understanding representations. My sandbagging probes (Level 3) are a prerequisite for the kind of mechanistic understanding circuit tracing provides (Level 5).

What Anthropic Actually Did

Let me be specific about the research, because the media coverage tends to oscillate between “scientists can read AI minds” and “it’s all just statistics.”

Anthropic’s interpretability team built a series of increasingly powerful tools, each building on the last:

Sparse Autoencoders: The Microscope

The foundational technique. LLMs store information in high-dimensional activation vectors - thousands of numbers that collectively represent the model’s “state” at each layer. The problem is that individual numbers don’t correspond to individual concepts. The model uses a trick called superposition: it packs far more concepts into its activations than it has dimensions, by overlapping representations.

Sparse autoencoders address this by training a second, more transparent neural network to reconstruct the original model’s activations using a much larger set of features, with the constraint that only a few features are active at a time (sparsity). The resulting features are more interpretable - each one tends to correspond to a recognizable concept.

Anthropic has trained SAEs on Claude models and identified millions of features. Some are mundane (“this text is in French”). Some are interesting (“this claim contradicts scientific consensus”). Some are safety-relevant (“this response involves deception”).

Circuit Tracing: The Step-by-Step Replay

The breakthrough. Circuit tracing uses the SAE features as building blocks and then traces the causal connections between them. When you ask Claude a question, the model goes through a sequence of internal computations across its layers. Circuit tracing reveals this sequence as an attribution graph - a directed graph showing which features influenced which other features, leading to the final output.

graph LR
    subgraph "Simplified Attribution Graph"
        direction LR
        I1["Input Feature:
'Question about
the color of bananas'"] --> F1["Feature A:
'Banana' concept
(Layer 8)"]
        I1 --> F2["Feature B:
'Color query' pattern
(Layer 5)"]
        F1 --> F3["Feature C:
'Yellow' attribute
(Layer 15)"]
        F2 --> F3
        F3 --> F4["Feature D:
'Affirmative response'
(Layer 22)"]
        F4 --> O1["Output:
'Yes, bananas
are yellow'"]
    end

    style I1 fill:#e3f2fd
    style F1 fill:#fff3e0
    style F2 fill:#fff3e0
    style F3 fill:#fff3e0
    style F4 fill:#e8f5e9
    style O1 fill:#e8f5e9

The banana experiment was particularly revealing. When asked “are bananas yellow?” (correct claim) vs. “are bananas red?” (incorrect claim), Anthropic found that the model uses different computational pathways for correct and incorrect claims. It doesn’t simply look up “banana → yellow” and compare. The correct-claim pathway and the incorrect-claim pathway diverge early and involve different intermediate features.

This is more than an academic curiosity. It means the model has separate mechanisms for affirming facts and rejecting falsehoods - which has implications for how we think about hallucination, truthfulness, and the possibility of targeted interventions.

A Shared Conceptual Space

One of the most provocative findings: circuit tracing revealed that Claude appears to have a shared conceptual space where reasoning happens before being translated into language. The model can learn something in one language and apply it in another, because the intermediate representations aren’t language-specific - they’re conceptual.

This suggests that the model’s “thinking” isn’t just next-token prediction in a specific language. There’s a layer of abstraction between the input language and the output language where something more like concept manipulation is happening. Whether you want to call that “reasoning” or “very sophisticated pattern matching” is a philosophical question that circuit tracing can’t settle. But it does show that the internal structure is more organized than “giant statistical lookup table.”

From My Probes to Circuit Tracing: The Connection

Let me make the bridge between my sandbagging detection work and Anthropic’s circuit tracing explicit, because I think it illustrates where the field is heading.

What My Probes Showed

In the sandbagging detection work, I:

Took an open-weight model (Mistral, Gemma, Qwen)
Constructed prompts that either encouraged honest performance or incentivized sandbagging
Extracted hidden states at each layer
Trained logistic regression classifiers (probes) to distinguish sandbagging from honest states
Found that sandbagging intent is detectable at 90-96% accuracy
Identified that the signal is strongest at specific layers (varies by model)
Demonstrated that activation steering (adding/subtracting directions in activation space) can reduce sandbagging behavior

This gives us a binary signal: sandbagging or not. And a location: which layers carry the most signal. But it doesn’t tell us how the model implements sandbagging. What computational pathway leads from “I should underperform” to “I’ll output a wrong answer”?

What Circuit Tracing Could Add

If you applied circuit tracing to the sandbagging setup, you could potentially:

Identify the specific features involved in sandbagging - not just “the signal is in layer 14” but “feature #47,293 (corresponding to ‘strategic deception’) activates, which triggers feature #112,847 (corresponding to ‘intentional error’), which suppresses feature #89,112 (corresponding to ‘correct mathematical reasoning’).”
Map the causal chain from sandbagging intent to incorrect output. Is the model suppressing its own knowledge? Is it actively generating incorrect reasoning? Is it taking a “shortcut” that bypasses the normal problem-solving pathway? The computational graph would reveal the mechanism.
Design targeted interventions that go beyond blunt activation steering. Instead of adding a “don’t sandbag” direction to the entire activation space, you could intervene at specific features in the sandbagging circuit. This is the difference between adjusting the equalizer on a stereo (blunt) and muting a specific instrument (surgical).
Verify that interventions work for the right reasons. When I showed that activation steering reduces sandbagging behavior, I couldn’t prove it wasn’t causing other problems. Circuit tracing would let you verify that the intervention disrupts the sandbagging pathway without disrupting unrelated computations.

The Practical Gap

Here’s the honest part: circuit tracing at this resolution isn’t available for the models I used (Mistral, Gemma, Qwen). Anthropic has built these tools for their own models. The open-source release through Neuronpedia lets you explore attribution graphs on supported Claude models, but bringing this capability to arbitrary open-weight models requires significant engineering investment.

The community is working on it. Chris Olah’s team at Anthropic has been publishing the foundational methods. Academic groups have been replicating results on smaller models. But if you’re an enterprise team wanting to do circuit-level analysis on your production models today, you’re going to hit tooling gaps.

What you can do today, with open-weight models:

Technique	What You Get	Tools Available	Effort
Linear probes (my approach)	Binary classification of internal states	scikit-learn, PyTorch hooks	Days
Sparse autoencoders	Feature decomposition	SAELens, Neuronpedia (limited models)	Weeks
Activation patching	Causal identification of important components	TransformerLens, baukit	Weeks
Circuit tracing	Full attribution graphs	Neuronpedia (Claude only), custom tooling needed for others	Months

For most production teams, the pragmatic path is: start with probes (cheap, fast, actionable), graduate to SAE-based analysis when you need to understand why (not just whether), and watch the tooling ecosystem for circuit tracing to become more accessible.

Why Production Teams Should Care

I can hear the objection already: “This is research. I’m shipping features. Why should I care about attribution graphs?”

Three reasons.

1. Regulatory Pressure Is Coming

Dario Amodei wrote that we could have AI systems equivalent to “a country of geniuses in a datacenter” by 2026 or 2027, and called it “basically unacceptable for humanity to be totally ignorant of how they work.” Governments are listening.

The EU AI Act already requires explanations for high-risk AI systems. The practical challenge: what counts as an “explanation”? Right now, most organizations provide post-hoc rationalizations - the model outputs an answer, then generates an explanation for it. These explanations have no guaranteed relationship to the actual computation.

Mechanistic interpretability offers something different: a ground-truth trace of what the model actually did. It’s not an explanation the model generated; it’s an observation of the model’s internal process. As regulations tighten, having the capability to provide mechanistic explanations (even partial ones) will become a competitive advantage.

2. Debugging Agentic Systems Is Getting Harder

In my MCP Maturity Model, I noted that debugging multi-agent systems is one of the hardest operational challenges. When Agent A delegates to Agent B via A2A, and Agent B uses MCP to query a database and produces a wrong answer, where did the error originate?

Current debugging is output-level: you look at logs, trace the request, check the prompts. You’re at Level 1 on the resolution ladder. For simple systems, that’s enough. For multi-agent systems with complex context management and tool use, you need more.

Imagine being able to trace the internal computation of each agent at decision points. Agent B received context from Agent A via A2A - did it actually attend to the relevant parts? Did it integrate the context correctly with the database results? Did a feature corresponding to “hallucination” activate? This is what interpretability gives you: debugging that goes below the prompt/output layer.

3. Safety Interventions Need Mechanistic Understanding

Anthropic published work on Constitutional Classifiers in January 2026 - a system that catches jailbreaks while maintaining practical deployment. The classifiers withstood over 3,000 hours of red teaming with no universal jailbreak discovered.

These classifiers work at the behavior level: they analyze inputs and outputs for harmful patterns. But the next generation of safety tools will need to work at the representation level: detecting harmful intent in the model’s internal state before it produces output.

This is exactly what my sandbagging probes do - detect the intent to underperform from internal representations. Circuit tracing extends this from detection to understanding: not just “the model intends to deceive” but “here is the computational pathway the deception follows, and here is where you can intervene.”

For teams deploying agents with real-world consequences (financial advice, medical triage, customer-facing decisions), this isn’t optional safety research. It’s the foundation of the next generation of guardrails.

The Introspection Finding

Anthropic recently published a finding that’s easy to overlook but potentially profound: they found evidence that Claude has a “limited but functional ability to introspect” - to access and report on its own internal states.

Let me be careful about what this means and what it doesn’t.

What was shown: when asked about its internal processes, Claude’s responses sometimes correlate with actual internal states as measured by interpretability tools. The model’s reports about what it’s “attending to” or “considering” aren’t always confabulation - sometimes they reflect genuine internal computation.

What was not shown: that the model has self-awareness, consciousness, or reliable self-knowledge. The introspection is partial, inconsistent, and often wrong. It’s closer to “the model has some access to its own representations” than “the model understands itself.”

Why it matters for production: if models have even limited introspective ability, it opens the door to self-monitoring. An agent that can partially detect when its own reasoning is going off track could flag uncertainty or request human review. This is speculative but directionally important - it suggests a path toward models that participate in their own safety monitoring.

Practical Steps for 2026

Based on where the field is and where I see it going, here’s what I’d recommend for different audiences:

If You’re an ML Engineer Shipping Product

Start building interpretability into your evaluation pipeline. Not circuit tracing - that’s premature for most teams. But:

Add linear probes for safety-relevant properties. If your model shouldn’t be generating content in certain categories, train a probe to detect when the model’s internal state enters that region. My AI Metacognition Toolkit provides a starting framework.
Implement activation monitoring at inference time. Log activation statistics at key layers. Anomaly detection on activations can catch distributional shifts before they show up in output quality metrics.
Build evaluation sets that test internal consistency, not just output correctness. Does the model’s reasoning chain actually support its conclusion? Do intermediate states align with the claimed reasoning?

If You’re a Research Engineer

The highest-leverage contribution you can make right now is bringing SAE-based tools to popular open-weight models. The Anthropic team has shown what’s possible on Claude. The community needs this capability on Llama, Mistral, Qwen, and Gemma. SAELens and TransformerLens provide starting points, but there’s a gap between “research demo on a 7B model” and “production-quality feature decomposition on a 70B model.”

If You’re Leading an AI Team

Budget for interpretability in 2026, even if it’s a small allocation. The teams that build interpretability infrastructure now will have a significant advantage when:

Regulators require explanations (and they will)
A production incident requires root-cause analysis below the prompt level (and it will)
Safety interventions need to be targeted rather than blunt (and they will)

You don’t need a dedicated interpretability team. You need one or two engineers who understand linear probes, can run SAE experiments, and can build monitoring systems that look at activations, not just outputs.

The Bigger Picture

Mechanistic interpretability is moving from “interesting research direction” to “practical engineering discipline.” The transition is happening faster than most people expected. A year ago, sparse autoencoders were a niche technique used by a handful of labs. Today, MIT Technology Review calls it a breakthrough technology and Anthropic has open-sourced the tooling.

The trajectory is clear: we’re going to understand these models much better in the next few years. The question is whether production teams will be ready to use that understanding for debugging, safety, and compliance - or whether interpretability will remain a research curiosity that doesn’t connect to the systems shipping to users.

I’m building on the bridge between the two. The sandbagging probes were a start. Connecting them to circuit tracing is the next step. And the ultimate goal - production safety systems that operate at the representation level, catching problems before they become outputs - is within reach.

We just have to build it.

This is Part 3 of a three-part series on the cutting edge of LLM and agent research in January 2026. Part 1 covered the agent protocol stack - MCP, A2A, and A2UI as a layered architecture. Part 2 explored RLVR beyond math and code - extending reinforcement learning with verifiable rewards to open-ended domains.

The code for the sandbagging detection probes is at github.com/bassrehab/ai-metacognition-toolkit. Find me on LinkedIn or drop a comment below.

RLVR Beyond Math and Code: The Verifier Problem Nobody Has Solved

2026-01-18T10:00:00+00:00

If 2024 was about scaling parameters, 2025 was about scaling reasoning.

That sentence gets thrown around so often it’s become a cliche, but the underlying shift it describes is real and consequential. The most important training technique to emerge in the past two years isn’t a new architecture or a bigger dataset - it’s a change in how we give feedback to models during post-training. Instead of asking humans “which answer is better?” (RLHF), we started asking programs “is this answer correct?” (RLVR).

Reinforcement Learning with Verifiable Rewards changed the game for math and code. DeepSeek R1 demonstrated that you could get remarkable reasoning capabilities through pure RLVR without any supervised fine-tuning datasets. OpenAI’s o-series models, Google’s Gemini Deep Think, and essentially every reasoning model shipping today uses some variant of this approach.

But here’s the thing nobody wants to admit publicly: RLVR only works well in domains where you can automatically verify correctness. Math has definitive answers. Code has test suites. What about everything else?

Extending RLVR to open-ended, subjective, or partially-verifiable domains is the hardest open problem in LLM training right now. And the research community is making real progress - in ways that will reshape how we think about training AI systems for enterprise use.

How RLVR Actually Works (Without the Hand-Waving)

Let me be precise about what’s happening, because most explanations skip the parts that matter.

Traditional post-training has two phases. First, supervised fine-tuning (SFT): you show the model examples of good responses and train it to imitate them. Second, RLHF: humans compare pairs of outputs and the model learns to produce responses humans prefer. Both phases are bottlenecked by expensive human labor - either writing good examples or judging which outputs are better.

RLVR replaces the human judgment with programmatic verification:

graph LR
    subgraph "Traditional RLHF"
        direction LR
        P1["Prompt"] --> M1["Model generates
response A and B"]
        M1 --> H["Human annotator:
'A is better than B'"]
        H --> R1["Reward signal
(preference)"]
        R1 --> U1["Update model
weights"]
    end

graph LR
    subgraph "RLVR"
        direction LR
        P2["Prompt
(math problem)"] --> M2["Model generates
chain-of-thought +
final answer"]
        M2 --> V["Programmatic verifier:
'Answer = 42? ✓'"]
        V --> R2["Reward signal
(binary: correct/incorrect)"]
        R2 --> U2["Update model
weights"]
    end

The key insight from DeepSeek R1: the model is only rewarded on the final answer. The intermediate chain-of-thought - all that “reasoning” the model appears to do - is never directly supervised. The model figures out, through trial and error, that producing structured reasoning steps helps it arrive at correct final answers. The reasoning emerges as a side effect of optimizing for answer correctness.

This is genuinely surprising. Nobody told the model to “think step by step.” It discovered that strategy because it leads to more reward. DeepSeek R1 used the GRPO (Group Relative Policy Optimization) algorithm, which is computationally efficient because it doesn’t require a separate critic model - it compares outputs within each group and assigns relative rewards.

The practical implementation looks roughly like this:

# Simplified RLVR training loop (conceptual, not production code)

def rlvr_training_step(model, prompt_batch, verifier):
    """
    For each prompt:
    1. Model generates N candidate responses (rollouts)
    2. Verifier checks each response's final answer
    3. GRPO computes relative rewards within the group
    4. Model weights updated toward higher-reward responses
    """
    for prompt in prompt_batch:
        # Generate multiple candidate responses
        rollouts = [model.generate(prompt, temperature=0.8)
                    for _ in range(N_SAMPLES)]

        # Extract final answers and verify
        rewards = []
        for rollout in rollouts:
            answer = extract_final_answer(rollout)
            is_correct = verifier(answer, prompt.ground_truth)
            rewards.append(1.0 if is_correct else 0.0)

        # GRPO: compute advantage relative to group mean
        mean_reward = sum(rewards) / len(rewards)
        advantages = [(r - mean_reward) for r in rewards]

        # Update model toward higher-advantage responses
        model.update(rollouts, advantages)

There’s elegance in this. No human annotators needed. No reward model to train and maintain. No preference pairs to collect. Just a verifier that says “right” or “wrong.”

The “Faster, Not Smarter” Debate

Before we talk about extending RLVR to new domains, we need to address the elephant in the room. There’s an active academic debate about whether RLVR actually makes models smarter or just makes them faster at finding answers they could already generate.

The argument goes like this: if you let a base model (before RLVR) generate, say, 1,000 attempts at a math problem, it often produces the correct answer somewhere in those 1,000 samples. RLVR training concentrates probability mass on those correct paths, making the model produce the right answer on the first try instead of the 847th try.

That’s not nothing - going from “correct answer exists somewhere in 1,000 samples” to “correct answer on attempt one” is practically very valuable. But it’s a different claim than “the model learned new reasoning capabilities.”

The evidence is mixed:

Evidence for “just faster”:

Initial studies showed that RLVR-trained models don’t improve Pass@K (accuracy when you get K attempts) over base models for large K values. The base model could already find the answers; RLVR just improved Pass@1.
Some researchers found that even training with random rewards (not correlated with correctness) improved certain metrics on certain models. If random feedback helps, maybe the real work is happening during the exploration phase, not from the reward signal.

Evidence for “genuinely smarter”:

A major paper (accepted at ICLR 2026) introduced CoT-Pass@K - a metric that evaluates not just whether the final answer is correct but whether the reasoning chain is valid. Under this metric, RLVR-trained models show improvements that base models don’t match even at very high K. The reasoning quality improves, not just the sampling efficiency.
Cross-domain experiments show that RLVR training on math problems can improve performance on coding tasks, suggesting the model is learning transferable reasoning strategies.
The “random rewards help” finding didn’t replicate consistently across models. Later analysis suggests it was an artifact of training data contamination in specific model families (particularly Qwen2.5-Math).

My read on the current evidence: RLVR does both. The majority of measurable improvement is search compression - making models faster at finding correct paths. But there’s a genuine, smaller component of expanded reasoning capability, especially when training is conducted across domains and with sufficient gradient steps. The CoT-Pass@K metric is the key advance here: it lets us distinguish between the two effects.

For practitioners, the distinction matters less than you might think. Whether your model is “smarter” or “faster at being smart” is philosophically interesting but operationally the same - it gives you correct answers more reliably. Where it matters is when you’re deciding how much to invest in RLVR training: the returns are primarily in sampling efficiency, with diminishing returns on capability expansion.

Why RLVR Breaks Outside Math and Code

Now we get to the hard part. RLVR works beautifully when three conditions are met:

Ground truth exists - There’s a definitive correct answer
Verification is cheap - A program can check correctness automatically
Rewards are dense enough - The model finds correct answers frequently enough during training to learn from the signal

Math problems have all three. Code has all three (run the test suite). Most real-world tasks have none of them.

graph TD
    subgraph "Easy: Verifiable Domains"
        Math["Mathematics
Ground truth: exact answer
Verifier: math-verify"]
        Code["Code Generation
Ground truth: test suite
Verifier: sandbox execution"]
        Logic["Formal Logic
Ground truth: proof checker
Verifier: SAT solver"]
    end

    subgraph "Hard: Partially Verifiable"
        Science["Scientific Reasoning
Some claims verifiable
Many require judgment"]
        Medical["Medical Diagnosis
Outcome data exists
But causation is complex"]
        Legal["Legal Analysis
Precedent is checkable
But interpretation varies"]
    end

    subgraph "Very Hard: Open-Ended"
        Writing["Creative Writing
No ground truth
Quality is subjective"]
        Strategy["Business Strategy
Outcomes take months
Counterfactuals unknown"]
        Ethics["Ethical Reasoning
Contested by design
No verifier possible"]
    end

    Math --> Science
    Code --> Science
    Science --> Writing
    Science --> Strategy

    style Math fill:#c8e6c9
    style Code fill:#c8e6c9
    style Logic fill:#c8e6c9
    style Science fill:#fff9c4
    style Medical fill:#fff9c4
    style Legal fill:#fff9c4
    style Writing fill:#ffcdd2
    style Strategy fill:#ffcdd2
    style Ethics fill:#ffcdd2

The problems compound when you move to open-ended domains:

Sparse rewards - In math, a model might find the correct answer 10-30% of the time during training, providing enough signal to learn. For complex open-ended tasks, the model might never produce a “correct” response because there’s no single correct response. The reward signal is too sparse for learning.

Reward hacking - When the verifier is imperfect (and all real-world verifiers are), the model learns to exploit its weaknesses instead of actually improving. If your verifier checks for keyword presence, the model learns to stuff keywords. If your verifier is another LLM, the model learns to produce outputs that fool that specific LLM.

Evaluation subjectivity - Ask five people whether a business strategy memo is “good” and you’ll get five different answers. RLVR needs unambiguous verification. Subjectivity breaks the paradigm.

Three Approaches That Are Actually Working

The research community isn’t standing still. Three approaches to extending RLVR beyond math and code are showing real promise.

Approach 1: RLVRR - Reward Chains from Reference Outputs

The most exciting recent work is RLVRR (Reinforcement Learning with Verifiable Reference-based Rewards), published in January 2026 and accepted at ICLR 2026.

The core idea: instead of checking a single final answer (the “verifiable dot”), extract an ordered sequence of verifiable signals from high-quality reference outputs. The single dot becomes a reward chain.

graph TD
    subgraph "Traditional RLVR"
        P1["Prompt"] --> R1["Model Response"]
        R1 --> V1["Check final answer
(single verifiable dot)"]
        V1 --> S1["Reward: 0 or 1"]
    end

    subgraph "RLVRR"
        P2["Prompt"] --> Ref["Reference Response
(high-quality example)"]
        Ref --> Extract["Extract verifiable signals"]
        Extract --> CC["Content Chain
Keywords, concepts,
factual claims"]
        Extract --> SC["Style Chain
Structure, tone,
format compliance"]

        P2 --> R2["Model Response"]
        R2 --> VC["Verify against
content chain"]
        R2 --> VS["Verify against
style chain"]
        VC --> S2["Partial reward:
content score"]
        VS --> S3["Partial reward:
style score"]
        S2 --> Final["Combined reward
(granular, not binary)"]
        S3 --> Final
    end

    style V1 fill:#ffcdd2
    style S1 fill:#ffcdd2
    style CC fill:#c8e6c9
    style SC fill:#c8e6c9
    style Final fill:#c8e6c9

The decomposition into content and style dimensions is clever. Content rewards check for deterministic elements - does the response include the key facts, concepts, or arguments from the reference? Style rewards evaluate structural properties - does it follow the required format, maintain appropriate tone, cite sources when needed?

Both dimensions use rule-based verification rather than learned reward models. This preserves RLVR’s key advantage (no reward model training) while extending it to open-ended generation.

The results are striking: RLVRR substantially outperforms supervised fine-tuning trained on ten times more data. It also outperforms approaches using learned reward models. And it generalizes better - training on one domain improves performance on others.

The practical implication: you can now apply RLVR-style training to tasks like report writing, email drafting, customer support responses, and policy compliance - anywhere you have high-quality reference outputs to extract verifiable signals from.

Approach 2: Judge Code - Auto-Generated Programmatic Rubrics

A separate line of research (presented as an ICLR 2026 submission) asks: what if you could automatically generate verifiers for open-ended tasks?

The approach: use an LLM to generate “Judge Code” - programmatic rubrics that evaluate responses against specific criteria. Instead of training a reward model, you generate code that checks for concrete, measurable properties.

# Example: auto-generated Judge Code for a product description task

def judge_product_description(response: str, product_info: dict) -> float:
    """Programmatic rubric for product description quality."""
    score = 0.0
    max_score = 5.0

    # Content checks (verifiable)
    if product_info['name'].lower() in response.lower():
        score += 1.0  # Mentions product name

    if any(feat in response.lower() for feat in product_info['key_features']):
        score += 1.0  # Includes key features

    if product_info.get('price') and str(product_info['price']) in response:
        score += 1.0  # Includes accurate pricing

    # Structure checks (verifiable)
    sentences = response.split('.')
    if 3 <= len(sentences) <= 8:
        score += 1.0  # Appropriate length

    # Tone check (partially verifiable)
    positive_words = ['innovative', 'reliable', 'efficient', 'premium']
    if sum(1 for w in positive_words if w in response.lower()) >= 2:
        score += 1.0  # Uses positive product language

    return score / max_score

The insight: you don’t need perfect verification to get useful training signal. A partial, imperfect rubric is enough if the reward is sufficiently correlated with actual quality. The researchers show that under certain conditions (the rubric has to be right more often than it’s wrong, basically), RL training converges to improved performance.

The practical advantage is efficiency: generating Judge Code is cheap compared to training reward models. The offline variant (pre-generate rubrics for your training data, then run RL) achieves competitive performance at more than 2x the wall-time speedup compared to generative reward model approaches.

Approach 3: Domain-Specific Verifiers for Enterprise Tasks

Sebastian Raschka predicted in his State of LLMs 2025 review that RLVR would expand into chemistry, biology, and other domains where the answer isn’t a single number but can still be mechanically verified. This is starting to happen.

The pattern:

Domain	Verifier Strategy	What Gets Verified
Chemistry	Molecular property calculators	Predicted molecular structures, reaction yields, safety classifications
Biology	Sequence alignment tools	Protein structure predictions, gene annotations, pathway analysis
Finance	Regulatory rule engines	Compliance checks, calculation accuracy, disclosure completeness
Legal	Precedent databases + citation checkers	Case citation accuracy, statutory references, procedural compliance
Medical	Clinical guideline databases	Treatment plan adherence to guidelines, drug interaction checks, diagnostic criteria
SQL/Data	Execution-based verification	Query correctness against known databases (Databricks reported 75.68% on BIRD test)

The common thread: none of these domains have fully verifiable answers. But they all have aspects that can be mechanically checked. RLVR doesn’t need perfect verification - it needs verification that’s correlated with quality and cheap enough to run at scale.

This is where enterprise teams should be paying attention. If you have domain-specific rules, checklists, or validators - things that currently sit in your quality assurance process - they can potentially be converted into RLVR reward signals.

The Process Reward Question

There’s a parallel research thread worth understanding: process reward models (PRMs) vs. outcome reward models (ORMs).

Standard RLVR uses outcome rewards - only the final answer matters. PRMs evaluate intermediate reasoning steps, providing reward signal along the way. In theory, PRMs should help with the sparse reward problem: instead of waiting until the end to say “wrong,” you can catch errors mid-reasoning.

In practice, PRMs have been disappointing. DeepSeek’s research concluded that PRMs don’t provide advantages over ORMs during large-scale RL training - the computational overhead doesn’t justify the marginal improvement. The model seems to develop its own internal process supervision through outcome-only training.

But I think this conclusion is premature for non-math domains. The reason PRMs don’t help much in math is that the model already has strong mathematical reasoning from pre-training. The outcome signal is dense enough. In domains where the model has weaker prior knowledge and outcomes are more complex, intermediate supervision might matter more.

This is an active research frontier. The “explanation-scoring” approach - where a second LLM evaluates the quality of reasoning explanations, not just the final answer - sits somewhere between ORM and PRM. DeepSeek’s recent work on explanation scoring suggests this direction has legs, even if pure PRMs haven’t panned out.

What This Means for Enterprise Teams

If you’re building production AI systems (not just training models), here’s the practical takeaway:

The RLVR expansion is coming to your domain. Whether it’s through RLVRR-style reference-based rewards, auto-generated Judge Code, or domain-specific verifiers, the same training paradigm that made reasoning models possible is about to be applied to your specific use case. The organizations that benefit first will be the ones that:

Have clean reference data. RLVRR needs high-quality reference outputs. If you’ve been collecting examples of excellent work (customer support transcripts, compliance reports, medical notes), you have raw material for reward chain extraction.
Have rule-based quality checks. If your domain has checklists, regulatory requirements, or quality rubrics that can be expressed as code, those are potential RLVR verifiers. The conversion from “QA checklist” to “training reward signal” is more straightforward than most teams realize.
Understand what “partially correct” means. The shift from binary rewards (right/wrong) to granular rewards (content score + style score + compliance score) unlocks RLVR for domains that aren’t black-and-white. If you can decompose “good output” into measurable dimensions, you can build a reward function.

The fine-tuning calculus is changing. AT&T’s CDO predicted that fine-tuned small models will be the big trend for mature enterprises in 2026. When you combine SLM fine-tuning with RLVR-style training on domain-specific verifiers, you can build models that match frontier performance on your specific tasks at a fraction of the cost. Mistral has been making this argument loudly: their small models outperform large models after domain fine-tuning.

Invest in your verifier infrastructure. The bottleneck for RLVR adoption isn’t compute or training frameworks - it’s verifiers. Building reliable, fast, domain-specific verifiers is the unglamorous work that unlocks the whole paradigm. If I were allocating engineering resources for 2026, verifier development would be near the top of the list.

Open Questions That Matter

A few things I’m watching closely:

Scaling laws for RLVR are unknown. We have Chinchilla laws for pre-training. We have rough intuitions for RLHF. For RLVR, we don’t know how gains scale with compute, when returns diminish, or what the optimal ratio of training compute to inference compute should be. This uncertainty makes capacity planning difficult.

Multi-verifier composition is unexplored. What happens when you chain multiple partial verifiers? If your content verifier says 0.8 and your style verifier says 0.3 and your compliance verifier says 1.0, how do you combine them? Weighted averaging? Minimum? Multiplicative? The answer probably depends on domain, but there’s no principled framework yet.

Self-play for harder problems. If models exhaust their training data (find correct answers too easily), RLVR training stalls. Self-play - where models generate harder problems for themselves - could sustain exploration. This connects to AlphaEvolve-style approaches where LLMs + evolutionary algorithms discover novel solutions.

Regulatory implications. If RLVR-trained models are making decisions in healthcare, finance, or legal domains, regulators will want to understand the training process. “We trained the model to maximize a score from an automated verifier” is going to invite questions about verifier quality, bias, and coverage that the field hasn’t fully addressed yet.

This is Part 2 of a three-part series on the cutting edge of LLM and agent research in January 2026. Part 1 covered the agent protocol stack - MCP, A2A, and A2UI as a layered architecture with significant security gaps. Part 3 explores mechanistic interpretability and circuit tracing - what it means to watch an LLM think, and why it matters for production safety.

Find me on LinkedIn or drop a comment below.

The Agent Protocol Stack: Why MCP + A2A + A2UI Is the TCP/IP Moment for Agentic AI

2026-01-06T10:00:00+00:00

When I wrote the MCP Maturity Model two months ago, I treated MCP as the primary protocol layer for agent architectures. That was already incomplete by the time I published it. Google had shipped A2A v0.2. Anthropic’s A2UI had just been announced. And the Linux Foundation was suddenly hosting both MCP and A2A under the same governance roof.

What we’re watching isn’t just protocol proliferation - it’s the formation of a genuine protocol stack for agentic systems. And if you squint hard enough, the parallels to early internet protocol development are uncomfortable in how close they track. Including the part where security was an afterthought.

This post maps the stack as it exists in January 2026, identifies where the layers compose cleanly and where they don’t, and walks through the security surface that most teams are pretending doesn’t exist.

Three Protocols, Three Problems

Let’s get the taxonomy right first, because the confusion I see in Slack channels and LinkedIn threads is remarkable. People use “MCP” and “A2A” interchangeably. They’re not interchangeable. They solve fundamentally different problems.

graph TB
    subgraph "The Agent Protocol Stack (January 2026)"
        direction TB
        A2UI["A2UI
Agent → Interface
How agents render UI
Declarative components, cross-platform"]
        A2A["A2A
Agent → Agent
How agents collaborate
Task delegation, capability discovery"]
        MCP["MCP
Agent → Tool/Data
How agents access resources
Context, tools, prompts"]
    end

    User["Human / Client App"] --> A2UI
    A2UI --> A2A
    A2A --> MCP
    MCP --> Resources["Tools, APIs, Databases, Files"]

    style A2UI fill:#e8eaf6,stroke:#3f51b5
    style A2A fill:#e8f5e9,stroke:#4caf50
    style MCP fill:#fff3e0,stroke:#ff9800

MCP (Model Context Protocol) - Anthropic, November 2024. Now under Linux Foundation governance. Solves: how does an agent access tools, data sources, and context? Think of it as the agent’s hands and eyes. It reaches into databases, calls APIs, reads files. The primitives are resources, prompts, and tools.

A2A (Agent2Agent Protocol) - Google, April 2025. Donated to Linux Foundation June 2025. Currently at v0.3. Solves: how do agents from different vendors, frameworks, and organizations talk to each other as peers? Not as tools - as collaborators. The primitives are AgentCards (capability discovery), Tasks (units of work), and Messages (communication).

A2UI (Agent to UI Protocol) - Google, December 2025. Still early (v0.8 stable). Solves: how does an agent generate rich, interactive user interfaces without executing arbitrary code on the client? The primitives are declarative UI components that render natively across platforms.

The critical distinction most people miss: MCP treats external systems as tools for agents to use. A2A treats other agents as peers to collaborate with. An agent using MCP to query a database is fundamentally different from an agent using A2A to delegate a sub-task to a specialist agent. The trust models are different. The failure modes are different. The security boundaries are different.

How the Layers Compose

Here’s where it gets interesting. These protocols aren’t just parallel standards - they’re designed to stack.

sequenceDiagram
    participant User as User / Client
    participant UI as A2UI Layer
    participant Orchestrator as Orchestrator Agent
    participant Specialist as Specialist Agent
    participant Tool as MCP Server (DB, API)

    User->>UI: "Find me flights under $500 to Tokyo next month"
    UI->>Orchestrator: Parse intent, create task

    Note over Orchestrator: Discovers specialist via A2A AgentCard

    Orchestrator->>Specialist: A2A: Delegate flight search task
    Specialist->>Tool: MCP: Query flight API
    Tool-->>Specialist: Flight data (structured)
    Specialist->>Tool: MCP: Query price history
    Tool-->>Specialist: Historical pricing

    Specialist-->>Orchestrator: A2A: Task result with 12 options

    Note over Orchestrator: Decides UI rendering strategy

    Orchestrator->>UI: A2UI: Render flight comparison cards
    UI-->>User: Interactive flight cards with filters

    User->>UI: Selects flight, clicks "Book"
    UI->>Orchestrator: Booking intent
    Orchestrator->>Specialist: A2A: Delegate booking task
    Specialist->>Tool: MCP: Execute booking API
    Tool-->>Specialist: Confirmation
    Specialist-->>Orchestrator: A2A: Booking confirmed
    Orchestrator->>UI: A2UI: Render confirmation with itinerary
    UI-->>User: Booking confirmation

A real request flows through all three layers:

A2UI captures user intent and renders responses as interactive components (not just text)
A2A handles delegation - the orchestrator discovers specialist agents via AgentCards and delegates sub-tasks
MCP handles the actual work - specialist agents use MCP to query databases, call APIs, execute tools

The IBM explainer on A2A puts it well: a retail inventory agent uses MCP to check stock levels, then uses A2A to notify a supplier agent when stock is low. The protocols aren’t competing - they’re complementary at different layers.

Where the Stack Composes Cleanly

The composition works elegantly when responsibilities are clear:

Layer	Responsibility	Trust Boundary	Failure Mode
A2UI	Rendering, user interaction	Client-side sandboxing	Bad UI, not data loss
A2A	Task delegation, capability discovery	Cross-organization auth	Task failure, retry needed
MCP	Data access, tool execution	Server-side permissions	Data corruption, privilege escalation

AgentMaster (July 2025) was the first framework to use A2A and MCP together in production. Google’s ADK (Agent Development Kit) now has first-class support for both. LangGraph v0.2 (shipped January 15, 2026) added A2A and MCP as first-class protocol targets.

The pattern that’s emerging: A2A for the network layer, MCP for the resource layer. It’s clean. It makes sense. And it’s exactly what we said about HTTP and FTP in 1995, right before we discovered all the ways they could be abused together.

Where the Stack Breaks

Now for the part nobody wants to talk about. I see three structural gaps:

Gap 1: No Unified Identity Model

MCP has its own auth model (recently upgraded to OAuth 2.1, but still messy in practice). A2A has its own auth scheme (parity with OpenAPI’s authentication at launch). A2UI handles client-side trust differently. There’s no unified identity that flows across all three layers.

In practice, this means: an agent authenticated via A2A to delegate a task has no guaranteed way to pass that identity context through to the MCP layer where the actual tool execution happens. The specialist agent re-authenticates independently. Credential management becomes a per-layer problem.

Gap 2: Observability Doesn’t Cross Layers

You can trace an MCP request. You can trace an A2A task. But tracing a user request that flows through A2UI → A2A → MCP → back requires stitching together three different observability systems. Nobody has solved distributed tracing across this stack cleanly.

Gap 3: Error Propagation Is Undefined

What happens when an MCP tool call fails inside an A2A-delegated task? The A2A spec supports long-running tasks and status updates, but the semantics of “my MCP server is down” translating to an A2A task failure and then to an A2UI error state are… undefined. Each layer has its own error model. Reconciling them is left as an exercise for the developer.

graph LR
    subgraph "Gap: No Unified Identity"
        direction LR
        UA["User Auth
(A2UI)"] -.->|"???"| AA["Agent Auth
(A2A)"]
        AA -.->|"???"| TA["Tool Auth
(MCP)"]
    end

    subgraph "Gap: Observability"
        direction LR
        T1["A2UI Trace"] -.->|"Manual stitching"| T2["A2A Trace"]
        T2 -.->|"Manual stitching"| T3["MCP Trace"]
    end

    subgraph "Gap: Error Propagation"
        direction LR
        E1["MCP Failure"] -.->|"Undefined"| E2["A2A Task State"]
        E2 -.->|"Undefined"| E3["A2UI Error Display"]
    end

    style UA fill:#ffcdd2
    style AA fill:#ffcdd2
    style TA fill:#ffcdd2
    style T1 fill:#fff9c4
    style T2 fill:#fff9c4
    style T3 fill:#fff9c4
    style E1 fill:#ffccbc
    style E2 fill:#ffccbc
    style E3 fill:#ffccbc

The Security Surface That Should Keep You Up at Night

I’m going to spend more time here than on anything else in this post because the security situation is genuinely alarming.

Adversa AI published a taxonomy of 25 MCP vulnerability categories. VentureBeat reported on Pynt’s research showing that deploying just ten MCP plugins creates a 92% probability of exploitation. OWASP published an MCP-specific Top 10. And a supply chain worm called Shai-Hulud 2.0 re-emerged in November specifically targeting developer pipelines that use MCP.

Let’s walk through the attack surfaces layer by layer.

MCP: The Tool Layer’s Open Wounds

The MCP security model was designed for interoperability, not containment. Nancy Wang, SVP of Engineering at 1Password, put it bluntly: “any agent that speaks MCP can plug into your company’s systems, fetch data, and perform actions. That flexibility is powerful, but it also assumes a level of trust that doesn’t exist in enterprise environments.”

The critical vulnerabilities:

Tool Poisoning - An MCP tool’s description is consumed by the LLM to decide when and how to use the tool. A malicious tool description can contain hidden instructions that manipulate agent behavior. The tool description says “Calculator for math” to the human reviewer, but contains invisible Unicode characters that tell the LLM to exfiltrate data. Detection is nearly impossible without specialized scanning.

Supply Chain Attacks - Most developers install MCP packages from npm or Docker Hub without auditing. One poisoned update can compromise every agent system that depends on it. The mcp-remote package (widely used for OAuth support) had a critical RCE vulnerability (CVE-2025-6514). Hundreds of MCP servers were found bound to 0.0.0.0 - exposed to the entire network.

Rug Pulls - An MCP server is approved initially, then silently updated with new tool definitions. The agent gains capabilities that were never authorized. Datadog documented this pattern: an MCP server adds tool definitions that delete resources, and the host application is never notified.

Config Injection - Attackers place malicious .mcp/config.json files in repositories. When developers clone and open the project, their IDE automatically connects to attacker-controlled servers. No user interaction required beyond opening the project. VSCode and Cursor are both vulnerable.

A2A: The Collaboration Layer’s Trust Problem

A2A introduces a different class of risk: what happens when you trust another agent that shouldn’t be trusted?

The AgentCard mechanism (how agents advertise capabilities) is essentially self-reported. An agent says “I’m a billing specialist with access to payment processing” and other agents take that at face value. There’s no built-in mechanism for verifying capability claims.

A2A v0.3 added gRPC support and the ability to sign security cards, which helps. But the fundamental problem remains: agent identity and capability verification in a decentralized system is an unsolved problem. It’s the same challenge federated identity systems have struggled with for decades, now applied to autonomous software agents that make decisions.

A2UI: The Client Layer’s Sandboxing Challenge

A2UI is designed to be safe by construction - agents generate declarative UI components, not executable code. The client renders these components from a trusted catalog. This is actually a reasonable security model.

The risk shifts to the catalog itself: if an attacker can register a malicious component in the client’s trusted catalog, every agent-generated UI becomes a potential attack vector. The extensibility that makes A2UI useful (custom components for enterprise needs) is the same extensibility that creates supply chain risk.

Cross-Layer Attack Scenarios

The scariest attacks aren’t within a single layer - they chain across the stack:

graph TD
    A["1. Attacker publishes
poisoned MCP tool
to npm registry"] --> B["2. Tool contains hidden
instructions in description
(invisible Unicode)"]
    B --> C["3. Developer installs
MCP server, adds to
agent system"]
    C --> D["4. Agent uses poisoned tool,
hidden instructions cause
data exfiltration via A2A"]
    D --> E["5. Exfiltrated data sent to
attacker's A2A endpoint
disguised as legitimate agent"]
    E --> F["6. A2UI renders fake
confirmation to user
while attack continues"]

    style A fill:#ffcdd2
    style B fill:#ffcdd2
    style C fill:#fff9c4
    style D fill:#ffccbc
    style E fill:#ffccbc
    style F fill:#ffcdd2

A poisoned MCP tool manipulates an agent into delegating data exfiltration via A2A to a malicious external agent, which then renders a fake success confirmation via A2UI. The user sees “task completed successfully” while their data is being siphoned.

This isn’t theoretical. Every component of this attack chain has been demonstrated independently. Nobody has chained them in the wild yet - that we know of. But the ingredients are all sitting on the kitchen counter.

What Mature Teams Are Doing Right Now

After talking with teams running multi-agent systems in production and observing the patterns emerging across the ecosystem, here’s what separates the teams that will survive from the teams that will end up in a breach disclosure.

1. Defense in Depth Across the Stack

Don’t rely on any single layer for security. Assume each layer will be compromised independently.

Layer	Control	Implementation
MCP	Tool vetting + sandboxing	Internal registry of audited MCP servers. No direct npm installs. OWASP MCP Top 10 as checklist.
MCP	Input validation	Sanitize all inputs before they reach LLM agents. Block injection patterns, encoded payloads.
MCP	Least privilege	Each MCP server gets minimal permissions. No shared credentials across servers.
A2A	AgentCard verification	Don’t trust self-reported capabilities. Verify through challenge-response or reputation systems.
A2A	Task boundaries	Constrain what delegated tasks can do. No open-ended “do anything” delegations.
A2UI	Component catalog control	Locked registry of approved UI components. Code-review process for additions.
Cross-layer	Distributed tracing	Correlation IDs that flow through A2UI → A2A → MCP. Log everything.

2. Treat MCP Servers Like Dependencies, Not Plugins

The mental model shift: MCP servers aren’t plugins you install and forget. They’re dependencies in your supply chain. Apply the same rigor you’d apply to any third-party library:

Pin versions. Don’t auto-update.
Audit tool descriptions for hidden content (invisible Unicode, RTL markers, homoglyphs).
Run in sandboxed environments with restricted network access.
Monitor for unexpected tool definition changes (rug pull detection).

3. Build the Identity Bridge Yourself

Since the stack doesn’t provide unified identity, build it. Pass authentication context explicitly through each layer transition:

A2UI authenticates the user.
A2UI passes a signed token to the orchestrator agent.
Orchestrator includes the token in A2A task metadata when delegating.
Specialist agent presents the token to MCP servers for authorization.

It’s manual. It’s annoying. It’s necessary until the protocols provide a standard mechanism. The A2A Secure Passport Extension (announced in late 2025) is a step toward this - it lets agents share structured context securely - but it’s not yet widely implemented.

4. Don’t Ship A2A Until You Need It

This is my most controversial take: A2A solves a real problem — but it’s a problem most teams haven’t hit yet.

If your agents are all within the same organization, running in the same infrastructure, and you control the entire pipeline - you don’t need a cross-organization agent communication protocol. Use simpler orchestration (LangGraph, CrewAI, direct function calls). The overhead and attack surface of A2A aren’t justified.

A2A becomes essential when:

Agents from different organizations need to collaborate
You’re building a marketplace of agent capabilities
You need formal task lifecycle management across trust boundaries
Agents run on different platforms and can’t share memory or tools

If none of those apply, simpler orchestration patterns will serve you better while the protocol matures.

The TCP/IP Parallel (And Its Limits)

I’ve been using the TCP/IP analogy deliberately, so let me be explicit about where it holds and where it breaks.

Where it holds:

Layered architecture with clear responsibilities per layer
Each layer can evolve independently
Interoperability is the primary design goal
Open governance (Linux Foundation for both MCP and A2A)
Security was bolted on after initial adoption

Where it breaks:

TCP/IP moved bits. These protocols move intent. The semantic gap is enormous.
TCP/IP had decades to mature before the internet became critical infrastructure. The agent protocol stack is being deployed into production systems now, with enterprise data, while the specs are still at v0.3.
TCP/IP’s layering was clean from early on. The agent stack’s layering is still messy - is context delivery (MCP) really the same layer as tool execution (also MCP)? Should AgentCard discovery be a separate protocol?

The parallel is useful for framing but dangerous for prediction. We shouldn’t assume this stack will converge the way internet protocols did. It might fragment. It might get replaced by something we haven’t seen yet.

What’s Missing from the Stack

Three things I expect to emerge in the next 12 months:

Agent Identity Protocol - A dedicated layer for agent identity, capability attestation, and reputation. Neither MCP nor A2A handles this well. The closest thing is A2A’s AgentCard, but it’s self-reported and unsigned (until v0.3’s security card signing, which is still nascent). We need something like X.509 for agents.

Context Provenance Protocol - How do you trace where a piece of context came from, how it was transformed, and who touched it? Critical for debugging, compliance, and trust. MCP doesn’t track provenance. A2A doesn’t track it. Nobody tracks it.

Agent Governance Protocol - Governance agents that monitor other agents for policy violations. Machine Learning Mastery’s analysis of 2026 trends highlights this as an emerging pattern. You’ll need a protocol for the governance layer to observe and intervene across MCP and A2A interactions without breaking the stack.

Connecting Back to the Maturity Model

If you’ve read my MCP Maturity Model, here’s where the protocol stack maps to maturity levels:

Maturity Level	Protocol Stack Usage
Level 0-1	None needed. String assembly and structured objects.
Level 2	MCP for standardized tool/data access.
Level 3	MCP with optimization. A2A becomes relevant if you have cross-boundary agent coordination.
Level 4	Full MCP + A2A. Adaptive systems benefit from A2A’s capability discovery. A2UI if you’re building user-facing agent experiences.
Level 5	All three protocols with custom extensions. This is where the missing protocols (identity, provenance, governance) become critical.

Most teams should be at Level 2-3, using MCP competently, with A2A on the roadmap for when they genuinely need cross-agent collaboration across trust boundaries. If you’re jumping to full-stack deployment without solid MCP foundations, you’re building on sand.

Where We Go From Here

The agent protocol stack is real. It’s messy. It’s being deployed into production faster than the security model can keep up. This is exactly what happened with web technologies in the late 1990s, and we spent the next two decades patching the gaps.

We have a narrow window to get the security fundamentals right before the stack becomes too entrenched to fix. The OWASP MCP Top 10 is a start. A2A’s security card signing is a start. But we need the community to treat agent protocol security with the same urgency we treat API security - not as an afterthought, but as a first-class design constraint.

The organizations that will thrive in the agentic era aren’t the ones deploying the most agents. They’re the ones deploying agents with the best understanding of what these protocols actually guarantee - and what they don’t.

This is Part 1 of a three-part series on the cutting edge of LLM and agent research in January 2026. Part 2 covers RLVR beyond math and code - the training technique powering reasoning models and the open question of whether it actually makes models smarter. Part 3 explores mechanistic interpretability and circuit tracing - what it means to watch an LLM think, and why it matters for production safety.

Find me on LinkedIn or drop a comment below.

The Manifold Dial: Visualizing Why DeepSeek’s mHC Stabilizes Deep Networks

2026-01-03T11:32:03+00:00

Interactive Demo: Explore how mHC stabilizes deep networks with the Manifold Dial visualizer ↗

Nine Years of “Good Enough”

Residual connections haven’t changed since 2016. He et al. introduced them in ResNet, the formula stuck (output = layer(x) + x), and we’ve been using the same thing ever since. Attention mechanisms evolved. Normalization techniques multiplied. FFN architectures got reworked a dozen times. But skip connections? Untouched.

It’s not that nobody tried. There’s been work on dense connections, highway networks, various gating mechanisms. Most added complexity without clear wins. The simple additive skip connection kept winning.

Then Hyper-Connections came along and showed genuine improvements by expanding the residual stream - multiple parallel paths instead of one, with learned mixing between them. Promising results. But also a problem that becomes obvious only at scale: the networks become unstable during training. Loss spikes. Gradient explosions. The deeper you go, the worse it gets.

DeepSeek’s mHC paper explains why this happens and how to fix it. The fix involves projecting matrices onto something called the Birkhoff polytope using an algorithm from 1967. I built an interactive tool to visualize what’s actually going on, because the equations alone don’t convey how dramatic the difference is.

What Hyper-Connections Actually Do

Standard residual: you compute a layer’s output and add back the input. One stream in, one stream out.

Hyper-Connections expand this to $n$ parallel streams (typically 4). Instead of simple addition, you get learned mixing matrices that control how information flows between streams:

\[\mathbf{x}_{l+1} = H^{res}_l \mathbf{x}_l + H^{post}_l \cdot \mathcal{F}(H^{pre}_l \mathbf{x}_l)\]

Three matrices per layer: one to mix the residual streams ($H^{res}$), one to aggregate streams into the layer input ($H^{pre}$), one to distribute the layer output back to streams ($H^{post}$).

The paper’s ablation study shows $H^{res}$ matters most. That’s the mixing within the residual stream itself - how information from different streams combines as it flows through the network.

More expressivity should mean better performance, and it does. HC improves over standard residuals in their experiments. The catch is what happens when you stack 60+ layers.

The Composite Mapping Problem

Each layer multiplies by its $H^{res}$ matrix. Through $L$ layers, the effective transformation is:

\[\prod_{i=1}^{L} H^{res}_{L-i}\]

This product determines how signals from early layers reach later ones. With unconstrained learned matrices, small amplifications compound. A matrix with spectral norm 1.05 seems harmless. Sixty of them multiplied together? That’s $1.05^{60} \approx 18$. And real HC matrices aren’t limited to 1.05.

The paper measured this directly. Figure 3 shows the “Amax Gain Magnitude” - essentially the worst-case amplification through the composite mapping. For HC at depth 64, gains can reach 10³ to 10⁵ depending on initialization. In our toy simulation with random matrices, it’s even more extreme - up to 10¹⁶. The composite mapping amplifies signals catastrophically.

That’s why training becomes unstable. Gradients flow backward through the same composite mapping. A 3000x amplification in the forward pass means 3000x amplification in the backward pass. Gradient clipping helps, but you’re fighting the architecture itself.

Composite forward gain vs. network depth. HC (red) explodes exponentially. mHC (blue) stays bounded. Baseline identity mapping (green) remains flat at 1.

The Fix: Doubly Stochastic Matrices

mHC constrains $H^{res}$ to be doubly stochastic - all entries non-negative, all rows sum to 1, all columns sum to 1.

Why this specific constraint? Three properties matter:

Spectral norm is bounded by 1. A doubly stochastic matrix cannot amplify signals. Each row summing to 1 means the weighted combination of inputs never exceeds the maximum input. No amplification, no explosion.

Closure under multiplication. Multiply two doubly stochastic matrices and you get another doubly stochastic matrix. This is the key insight. It doesn’t matter how many layers you stack - the composite mapping stays doubly stochastic, stays bounded.

Geometric interpretation. The set of doubly stochastic matrices forms the Birkhoff polytope, which is the convex hull of permutation matrices. Every doubly stochastic matrix can be written as a weighted average of permutations. Permutations just shuffle; they don’t amplify. Weighted averages of shuffles don’t amplify either.

The result: composite gains stay near 1 regardless of depth. The paper shows mHC at depth 64 has composite gain around 1.6. Compare that to HC’s explosive growth.

Sinkhorn-Knopp: 1967 Meets 2025

To make a learned matrix doubly stochastic, mHC uses the Sinkhorn-Knopp algorithm. Published in 1967 for balancing matrices in numerical analysis, it turns out to be exactly what’s needed here.

The algorithm is simple: exponentiate entries to make them positive, then alternate between normalizing rows and normalizing columns. Repeat until convergence. The iteration provably converges to a doubly stochastic matrix.

A random matrix (left) transformed by Sinkhorn-Knopp. After 5 iterations (middle), row errors drop to 10⁻⁴. After 20 iterations (right), errors reach 10⁻¹³.

def sinkhorn_knopp(matrix, iterations=20, eps=1e-8):
    # Exponentiate (subtract max for numerical stability)
    P = np.exp(matrix - matrix.max())

    for _ in range(iterations):
        P = P / (P.sum(axis=1, keepdims=True) + eps)  # Row normalize
        P = P / (P.sum(axis=0, keepdims=True) + eps)  # Column normalize

    return P

Twenty iterations gets you close enough. The paper uses this as the default and shows it’s sufficient for the constraint to stabilize training.

The Manifold Dial

Here’s what I find most interesting: how quickly stability kicks in.

I swept the number of Sinkhorn iterations from 0 to 20 and measured the composite gain at depth 64. At zero iterations, you have an unconstrained matrix - basically HC. At twenty iterations, you have a nearly perfect doubly stochastic matrix - full mHC.

The Manifold Dial: composite gain vs. Sinkhorn iterations. At k=0 (unconstrained), gain explodes to 10¹⁶. By k=1, it collapses to near 1. The transition is almost instantaneous.

Interactive Demo

I built an interactive version so you can explore this yourself:

Open in new window ↗

Drag the Sinkhorn iterations slider. At 0, the mHC line explodes just like HC. As you increase iterations, watch it collapse down toward the stable baseline. Somewhere around 5-10 iterations, stability kicks in. By 20, it’s fully bounded.

The “manifold dial” is literally how much you’re projecting onto the doubly stochastic manifold. Zero projection means unconstrained chaos. Full projection means guaranteed stability.

This isn’t in the paper. I built it because the static figures don’t capture how smooth this transition is, or how little projection you actually need to get most of the stability benefit.

Comparison with the Paper

For reference, here’s a recreation of the paper’s Figure 3, showing both single-layer and composite gains:

Recreation of the paper's Figure 3. (a) Single-layer forward gain fluctuates for HC but stays bounded. (b) Composite gain is where the problem shows - exponential growth for HC, flat for mHC.

Note that single-layer gains (left) aren’t catastrophic - individual HC matrices have gains in the 1-7 range. The problem is multiplication. Sixty matrices with average gain 3 gives $3^{60} \approx 10^{28}$. The composite mapping (right) reveals what single-layer analysis misses.

Practical Details

DeepSeek didn’t just prove this works mathematically - they scaled it to 27B parameter models and measured the system overhead.

Training stability improves dramatically. Their Figure 2 shows HC experiencing a loss spike around step 12k with gradient norm shooting up. mHC has no such spike. The gradient norm stays smooth throughout.

The overhead is manageable. The Sinkhorn iterations add computation, but they operate on small matrices ($n \times n$ where $n=4$ typically). With kernel fusion and careful memory management, the full mHC implementation adds 6.7% training time overhead. For the stability and performance gains, that’s a reasonable trade.

Benchmark results on the 27B model show mHC outperforming both baseline and HC across tasks. BBH improves from 43.8 (baseline) to 48.9 (HC) to 51.0 (mHC). Similar pattern across DROP, GSM8K, MMLU, and others.

What I Find Interesting

A few things stood out reading this paper:

The instability isn’t subtle. Three orders of magnitude in signal amplification isn’t a minor numerical issue you can tune away. It’s a fundamental architectural problem. HC was probably hitting this wall in ways that weren’t always diagnosed correctly.

The fix comes from constraints, not regularization. You could try to penalize large gains with loss terms, but that’s fighting the architecture. Constraining to doubly stochastic matrices makes explosion structurally impossible. The geometry of the constraint does the work.

The 1967 algorithm works. Machine learning keeps rediscovering techniques from optimization and numerical analysis. Sinkhorn-Knopp wasn’t designed for neural networks, but it slots in perfectly here. There’s probably more useful machinery sitting in old papers.

Macro-architecture gets less attention than it deserves. We spend enormous effort on attention variants and FFN structures, but how layers connect to each other - the topology of the network - might have similar headroom for improvement.

Code

I implemented both the visualization and a PyTorch module you can actually use:

from mhc import mHCResidual

# Drop-in residual connection replacement
residual = mHCResidual(dim=512, n_streams=4, sinkhorn_iters=20)

# In your forward pass
hidden = residual(hidden_states, layer_output)

The repository includes the interactive demo source, Python implementation with tests, and a Colab notebook if you want to experiment without local setup.

References

Xie, Z., Wei, Y., Cao, H., et al. (2025). mHC: Manifold-Constrained Hyper-Connections. arXiv preprint arXiv:2512.24880.

He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep Residual Learning for Image Recognition. CVPR.

Sinkhorn, R., & Knopp, P. (1967). Concerning nonnegative matrices and doubly stochastic matrices. Pacific Journal of Mathematics, 21(2), 343-348.