<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en"><generator uri="https://jekyllrb.com/" version="4.4.1">Jekyll</generator><link href="https://subhadipmitra.com/feed.xml" rel="self" type="application/atom+xml"/><link href="https://subhadipmitra.com/" rel="alternate" type="text/html" hreflang="en"/><updated>2026-04-18T06:21:54+00:00</updated><id>https://subhadipmitra.com/feed.xml</id><title type="html">Subhadip Mitra</title><subtitle>Data platforms. AI systems. The infrastructure between them. Engineering Leader at Google Cloud. AI Systems Architect.</subtitle><author><name>Subhadip Mitra</name><email>contact@subhadipmitra.com</email></author><entry><title type="html">Attention Is All You Bid: Advertising in Embedding Space</title><link href="https://subhadipmitra.com/blog/2026/attention-is-all-you-bid/" rel="alternate" type="text/html" title="Attention Is All You Bid: Advertising in Embedding Space"/><published>2026-04-04T00:00:00+00:00</published><updated>2026-04-04T00:00:00+00:00</updated><id>https://subhadipmitra.com/blog/2026/attention-is-all-you-bid</id><content type="html" xml:base="https://subhadipmitra.com/blog/2026/attention-is-all-you-bid/"><![CDATA[<style>[data-theme="dark"] .post-svg-viz rect.svg-bg{fill:#1a1a1a;stroke:#333}[data-theme="dark"] .post-svg-viz line.svg-grid{stroke:#2a2a2a}[data-theme="dark"] .post-svg-viz .svg-grid{stroke:#2a2a2a}[data-theme="dark"] .post-svg-viz text.svg-label{fill:#b8b8b8}[data-theme="dark"] .post-svg-viz text.svg-label-dark{fill:#d4d4d4}[data-theme="dark"] .post-svg-viz text.svg-label-muted{fill:#707070}[data-theme="dark"] .post-svg-viz g.svg-label-muted{fill:#707070}[data-theme="dark"] .post-svg-viz .svg-legend-bg{fill:#1a1a1a;stroke:#333}[data-theme="dark"] .post-svg-viz .svg-legend-text{fill:#b8b8b8}[data-theme="dark"] .mermaid .messageText{fill:#d4d4d4!important}[data-theme="dark"] .mermaid .messageLine0,[data-theme="dark"] .mermaid .messageLine1{stroke:#b8b8b8!important}[data-theme="dark"] .mermaid .sequenceNumber{fill:#fff!important}[data-theme="dark"] .mermaid line.actor-line{stroke:#888!important}[data-theme="dark"] .mermaid rect.actor{fill:#2a2a2a!important;stroke:#666!important}[data-theme="dark"] .mermaid text.actor>tspan{fill:#d4d4d4!important}[data-theme="dark"] .mermaid .note{fill:#2a2a2a!important;stroke:#555!important}[data-theme="dark"] .mermaid .noteText{fill:#d4d4d4!important}[data-theme="dark"] .mermaid .activation0{fill:#333!important;stroke:#666!important}[data-theme="dark"] .mermaid .loopText>tspan{fill:#d4d4d4!important}[data-theme="dark"] .mermaid marker path{fill:#b8b8b8!important;stroke:#b8b8b8!important}[data-theme="dark"] .mermaid .edgeLabel{background-color:#1a1a1a!important;color:#d4d4d4!important}[data-theme="dark"] .mermaid .edgeLabel span{color:#d4d4d4!important}[data-theme="dark"] .mermaid .label{color:#d4d4d4!important}</style> <blockquote> <p><strong>TL;DR:</strong> OpenAI is showing ads in ChatGPT. Perplexity tried and pulled back. Google is taking a measured approach. Meanwhile, the real action is happening underneath: regions of vector embedding space near high-value queries are becoming the new commercially contested territory - the “shelf space” of the AI era. GEO (Generative Engine Optimization) and RAG poisoning are points on the same spectrum, and nobody is connecting the security research, the marketing industry, and the mechanism design papers. This post maps the landscape, identifies the gaps, and proposes a framework for thinking about embedding space as an economic system.</p> </blockquote> <p>Three months ago, OpenAI flipped a switch and started showing ads inside ChatGPT. Criteo signed on as the first ad-tech partner. Smartly followed days later with something even more radical - conversational ad formats where clicking a sponsored suggestion drops you into <em>another chatbot dialogue</em> designed to sell you something. Meanwhile, Anthropic ran a Super Bowl ad mocking the whole idea, and Perplexity quietly pulled its own ads after they tanked user experience.</p> <p>We are watching, in real time, the birth of the next trillion-dollar advertising market. And almost nobody is talking about what’s actually happening underneath.</p> <p>I’ve spent the last few weeks reading every paper, press release, and pitch deck I could find on LLM advertising. What I found is a field that’s moving fast on the surface - auction mechanisms, ad formats, CPM pricing - while ignoring a structural problem that could define the next decade of the internet:</p> <p><strong>Vector embedding space is the new commercially contested territory. It has no transparency, no regulation, and no honest market mechanism. And people are already gaming it.</strong></p> <p>This post lays out the landscape, the open problems, and a framework for thinking about what comes next.</p> <hr/> <h2 id="a-brief-history-of-attention-markets">A Brief History of Attention Markets</h2> <p>Every era of the internet created a new scarce resource and then built a billion-dollar market around bidding for it.</p> <div style="display:grid;grid-template-columns:repeat(4,1fr);gap:0;margin:1.5rem auto;max-width:800px;" role="img" aria-label="Timeline of attention markets: Shelf Space (1960s, ~$50B/yr), PageRank (1998, ~$200B/yr), Feed Position (2010, ~$300B/yr), and Embedding Space (2024, market size unknown)"> <div style="position:relative;padding:1.2rem 0.8rem;border-left:3px solid #8b7355;text-align:center;"> <div style="position:absolute;top:0;left:-7px;width:11px;height:11px;border-radius:50%;background:#8b7355;"></div> <div style="font-size:0.65rem;letter-spacing:0.08em;color:#8b7355;font-weight:600;margin-bottom:0.4rem;">1960s - 1990s</div> <div style="font-size:1.05rem;font-weight:700;color:var(--qa-text,#4a4540);line-height:1.2;">Shelf Space</div> <div style="font-size:0.75rem;color:var(--qa-text-muted,#9e9788);margin-top:0.4rem;line-height:1.4;">Physical proximity<br/>to consumers</div> <div style="margin-top:0.6rem;font-size:0.7rem;color:var(--qa-text-muted,#9e9788);">Mechanism: <strong style="color:#8b7355;">Slotting fees</strong></div> <div style="margin-top:0.3rem;font-size:1.1rem;font-weight:700;color:#8b7355;">~&#36;50B<span style="font-size:0.65rem;font-weight:400;">/yr</span></div> </div> <div style="position:relative;padding:1.2rem 0.8rem;border-left:3px solid #5578a0;text-align:center;"> <div style="position:absolute;top:0;left:-7px;width:11px;height:11px;border-radius:50%;background:#5578a0;"></div> <div style="font-size:0.65rem;letter-spacing:0.08em;color:#5578a0;font-weight:600;margin-bottom:0.4rem;">1998 - 2015</div> <div style="font-size:1.05rem;font-weight:700;color:var(--qa-text,#4a4540);line-height:1.2;">PageRank</div> <div style="font-size:0.75rem;color:var(--qa-text-muted,#9e9788);margin-top:0.4rem;line-height:1.4;">Link graph position<br/>determines visibility</div> <div style="margin-top:0.6rem;font-size:0.7rem;color:var(--qa-text-muted,#9e9788);">Mechanism: <strong style="color:#5578a0;">Keyword auctions</strong></div> <div style="margin-top:0.3rem;font-size:1.1rem;font-weight:700;color:#5578a0;">~&#36;200B<span style="font-size:0.65rem;font-weight:400;">/yr</span></div> </div> <div style="position:relative;padding:1.2rem 0.8rem;border-left:3px solid #7b55a0;text-align:center;"> <div style="position:absolute;top:0;left:-7px;width:11px;height:11px;border-radius:50%;background:#7b55a0;"></div> <div style="font-size:0.65rem;letter-spacing:0.08em;color:#7b55a0;font-weight:600;margin-bottom:0.4rem;">2010 - 2023</div> <div style="font-size:1.05rem;font-weight:700;color:var(--qa-text,#4a4540);line-height:1.2;">Feed Position</div> <div style="font-size:0.75rem;color:var(--qa-text-muted,#9e9788);margin-top:0.4rem;line-height:1.4;">Algorithmic ranking<br/>in content streams</div> <div style="margin-top:0.6rem;font-size:0.7rem;color:var(--qa-text-muted,#9e9788);">Mechanism: <strong style="color:#7b55a0;">Attention auctions</strong></div> <div style="margin-top:0.3rem;font-size:1.1rem;font-weight:700;color:#7b55a0;">~&#36;300B<span style="font-size:0.65rem;font-weight:400;">/yr</span></div> </div> <div style="position:relative;padding:1.2rem 0.8rem;border-left:3px solid #c44040;text-align:center;background:rgba(196,64,64,0.03);border-radius:0 6px 6px 0;"> <div style="position:absolute;top:0;left:-7px;width:11px;height:11px;border-radius:50%;background:#c44040;box-shadow:0 0 0 3px rgba(196,64,64,0.2);"></div> <div style="font-size:0.65rem;letter-spacing:0.08em;color:#c44040;font-weight:600;margin-bottom:0.4rem;">2024 - ???</div> <div style="font-size:1.05rem;font-weight:700;color:#c44040;line-height:1.2;">Embedding Space</div> <div style="font-size:0.75rem;color:var(--qa-text-muted,#9e9788);margin-top:0.4rem;line-height:1.4;">Semantic proximity<br/>in vector space</div> <div style="margin-top:0.6rem;font-size:0.7rem;color:#c44040;font-style:italic;">No market mechanism yet</div> <div style="margin-top:0.3rem;font-size:1.1rem;font-weight:700;color:#c44040;">???</div> </div> </div> <p>In each era, the scarce resource was different, but the pattern was identical:</p> <p><strong>Shelf space</strong> was finite. Procter &amp; Gamble figured out that paying retailers for eye-level placement was worth more than any ad campaign. The “slotting fee” was born - brands literally bidding on physical proximity to consumers.</p> <p><strong>PageRank</strong> turned the link graph into a scarce resource. If your site was semantically close to a high-value query in Google’s index, you had “real estate” worth millions. Google built a <span>$</span>200B/year business by auctioning off the space next to those organic results.</p> <p><strong>Feed position</strong> made attention sequential. Facebook and Instagram learned that controlling the <em>order</em> in which you see things was worth more than controlling the <em>content</em>. The algorithmic feed became the scarce resource, and advertisers bid on interrupting it.</p> <p>Now we’re entering the fourth era. When someone asks ChatGPT “what’s the best running shoe for marathon training?” - the answer isn’t a list of links. It’s a synthesized response generated from the model’s parameters and, increasingly, from documents retrieved via RAG (Retrieval-Augmented Generation). The scarce resource is no longer a slot on a page. It’s <strong>proximity in embedding space</strong> - whether your product’s representation is close enough to the user’s query to be retrieved, cited, or recommended.</p> <p>And unlike every previous era, there’s no visible boundary between the organic result and the commercial influence.</p> <hr/> <h2 id="how-llm-advertising-actually-works-as-of-april-2026">How LLM Advertising Actually Works (As of April 2026)</h2> <p>The public conversation is weirdly disconnected from the technical reality. Here’s what’s actually going on.</p> <h3 id="whats-live-right-now">What’s Live Right Now</h3> <p>OpenAI launched “Sponsored Suggestions” in ChatGPT on February 9, 2026. These are contextually relevant cards that appear below the AI’s organic response - a hotel promotion after a travel query, an air fryer ad after a cooking question. They’re restricted to Free and Go tier users in the US. Plus, Pro, Business, and Enterprise users don’t see them.</p> <p>The initial pricing tells you how they value this attention:</p> <div style="display:flex;gap:1rem;justify-content:center;flex-wrap:wrap;margin:1.5rem 0;"> <div style="flex:1;min-width:180px;max-width:220px;border:1.5px solid #c44040;border-radius:8px;padding:1rem 1.2rem;text-align:center;background:rgba(196,64,64,0.04);"> <div style="font-size:2rem;font-weight:700;color:#c44040;line-height:1;">~&#36;60</div> <div style="font-size:0.75rem;color:#c44040;margin-top:0.2rem;font-weight:600;">CPM</div> <div style="font-size:0.7rem;color:var(--qa-text-muted,#9e9788);margin-top:0.5rem;">ChatGPT Sponsored<br/>Suggestions (2026)</div> <div style="font-size:0.65rem;color:var(--qa-text-muted,#b0a898);margin-top:0.3rem;">&#36;200K+ min commitment</div> </div> <div style="flex:1;min-width:180px;max-width:220px;border:1.5px solid #5578a0;border-radius:8px;padding:1rem 1.2rem;text-align:center;background:rgba(85,120,160,0.04);"> <div style="font-size:2rem;font-weight:700;color:#5578a0;line-height:1;">~&#36;2-5</div> <div style="font-size:0.75rem;color:#5578a0;margin-top:0.2rem;font-weight:600;">CPM</div> <div style="font-size:0.7rem;color:var(--qa-text-muted,#9e9788);margin-top:0.5rem;">Google Search Ads<br/>(average)</div> <div style="font-size:0.65rem;color:var(--qa-text-muted,#b0a898);margin-top:0.3rem;">Self-serve, no minimum</div> </div> <div style="flex:1;min-width:180px;max-width:220px;border:1.5px solid #0d9488;border-radius:8px;padding:1rem 1.2rem;text-align:center;background:rgba(13,148,136,0.04);"> <div style="font-size:2rem;font-weight:700;color:#0d9488;line-height:1;">1.5x</div> <div style="font-size:0.75rem;color:#0d9488;margin-top:0.2rem;font-weight:600;">CONVERSION</div> <div style="font-size:0.7rem;color:var(--qa-text-muted,#9e9788);margin-top:0.5rem;">LLM referral vs.<br/>other channels</div> <div style="font-size:0.65rem;color:var(--qa-text-muted,#b0a898);margin-top:0.3rem;">Criteo early data</div> </div> </div> <p>OpenAI is pricing this at 12-30x Google because they believe conversational intent is qualitatively different from keyword intent - and the early conversion data backs it up.</p> <p>The key architectural claim OpenAI makes: ads are structurally separated from organic responses. The model generates its answer first, completely independent of advertising. Then the ad system matches a contextually relevant sponsored suggestion and appends it below. The ads do not influence the AI’s actual answers.</p> <p>Hold that thought. We’ll come back to it.</p> <h3 id="whats-being-built">What’s Being Built</h3> <p>The academic community has been busy. Over the past two years, researchers have proposed multiple auction mechanisms for LLM ad placement:</p> <pre><code class="language-mermaid">graph TD
    subgraph "Pre-Generation Mechanisms"
        A["Segment Auctions&lt;br/&gt;(Hajiaghayi et al., 2024)&lt;br/&gt;RAG-based ad allocation&lt;br/&gt;per discourse segment"] 
        B["Position Auctions&lt;br/&gt;(Balseiro et al., 2025)&lt;br/&gt;Extending traditional slots&lt;br/&gt;to AI-generated content"]
    end

    subgraph "Post-Generation Mechanisms"
        C["Token Auctions&lt;br/&gt;(Dutting et al., 2024)&lt;br/&gt;WWW Best Paper&lt;br/&gt;Token-by-token bidding"]
        D["Truthful Aggregation&lt;br/&gt;(Soumalias et al., 2024)&lt;br/&gt;RLHF-style reward&lt;br/&gt;aggregation"]
    end

    subgraph "Integrated Mechanisms"
        E["LLM-Auction&lt;br/&gt;(Zhao et al., Dec 2025)&lt;br/&gt;Learning-based generative&lt;br/&gt;auction, end-to-end"]
        F["Genre-Based Insertion&lt;br/&gt;(Jan 2026)&lt;br/&gt;Decoupled response-level&lt;br/&gt;ad placement"]
    end

    A --&gt; G["LLM generates response&lt;br/&gt;conditioned on winning ads"]
    B --&gt; G
    C --&gt; H["Auction selects/aggregates&lt;br/&gt;during token generation"]
    D --&gt; H
    E --&gt; I["Auction and generation&lt;br/&gt;jointly optimized"]
    F --&gt; I

    style A fill:#fff3cd,stroke:#856404,color:#4a3800
    style B fill:#fff3cd,stroke:#856404,color:#4a3800
    style C fill:#d1ecf1,stroke:#0c5460,color:#0a3d47
    style D fill:#d1ecf1,stroke:#0c5460,color:#0a3d47
    style E fill:#d4edda,stroke:#155724,color:#14401d
    style F fill:#d4edda,stroke:#155724,color:#14401d
    style G fill:#f0f0f0,stroke:#666,color:#333
    style H fill:#f0f0f0,stroke:#666,color:#333
    style I fill:#f0f0f0,stroke:#666,color:#333
</code></pre> <p>The key split is between mechanisms that decide ad allocation <em>before</em> the LLM generates a response, and those that let the LLM generate multiple candidate responses and then pick or aggregate. Pre-generation is cheaper (one forward pass) but ignores externalities - how ads interact with the surrounding context. Post-generation is higher quality but requires multiple inference passes, which gets expensive fast when you’re serving hundreds of millions of queries per day.</p> <p>Google Research’s token auction (WWW 2024 Best Paper, Dutting et al.) was the first rigorous treatment. They proved that under robust preferences, monotone aggregation functions enable second-price-style payments - bringing classical auction theory into the LLM generation process. It’s elegant theory. It also requires access to model weights and per-token distributions, which makes it impractical for third-party advertisers.</p> <p>The most recent work, LLM-Auction (Zhao et al., December 2025), tries to solve this by integrating the auction directly into the LLM’s generation process via reinforcement learning. The model learns to jointly optimize response quality and ad revenue. This is probably closest to what production systems will eventually look like.</p> <h3 id="whats-being-refused">What’s Being Refused</h3> <p>There are now three distinct philosophies among major AI companies:</p> <table> <thead> <tr> <th>Company</th> <th>Stance</th> <th>Rationale</th> </tr> </thead> <tbody> <tr> <td><strong>OpenAI</strong></td> <td>Ads in free tiers, ad-free for paying users</td> <td>Revenue necessity - <span>$</span>17B projected burn rate, 95% of 800M users don’t pay</td> </tr> <tr> <td><strong>Anthropic</strong></td> <td>No ads, period (for now)</td> <td>Trust-first - “advertising incentives, once introduced, tend to expand over time”</td> </tr> <tr> <td><strong>Google</strong></td> <td>Ads in AI Overviews, not in Gemini chat (yet)</td> <td>Measured rollout - ads in Search AI, evaluating Gemini chat separately</td> </tr> <tr> <td><strong>Perplexity</strong></td> <td>Tried ads, pulled them</td> <td>UX collapsed, measurement was impossible</td> </tr> <tr> <td><strong>Meta</strong></td> <td>Using conversations to <em>target</em> ads on other platforms</td> <td>Different model - the LLM isn’t the ad surface, it’s the signal source</td> </tr> </tbody> </table> <p>Pay attention to Meta’s row. It’s easy to gloss over, but it might be the most consequential strategy on this list. Meta isn’t putting ads <em>inside</em> the AI conversation - they’re using the conversation as a signal source to target ads <em>everywhere else</em>. When you tell Meta AI about your kitchen renovation plans, that context doesn’t surface as a sponsored suggestion in the chat. It surfaces as a Home Depot ad in your Instagram feed an hour later. This is arguably more invasive than OpenAI’s approach, because the user never connects the conversation to the ad. There’s no “Sponsored Suggestion” card to notice and evaluate. The commercial extraction is invisible by design. And because Meta controls both the conversational surface (WhatsApp, Messenger, Instagram DMs) and the ad surfaces (Feed, Stories, Reels), they can close this loop without any third-party ad-tech infrastructure. It’s vertically integrated attention arbitrage - and it’s the approach most likely to scale silently while everyone debates whether ChatGPT should show ad cards.</p> <p>The Anthropic position is worth quoting because it identifies the core tension: ad-supported products create pressure to optimize for engagement, repeat visits, and extended conversations. Those metrics look like success. But they tell you nothing about whether the user actually solved their problem. A truly helpful response might end the conversation in two turns.</p> <hr/> <h2 id="the-part-nobody-is-talking-about-embedding-space-as-commercial-real-estate">The Part Nobody Is Talking About: Embedding Space as Commercial Real Estate</h2> <p>This is where the public conversation is lagging the technical reality by about 18 months.</p> <p>Every RAG-based LLM system (which includes Perplexity, ChatGPT with browsing, Google AI Overviews, and most enterprise deployments) works roughly like this:</p> <pre><code class="language-mermaid">sequenceDiagram
    participant User
    participant LLM
    participant Retriever
    participant VectorDB as Vector Database
    participant Web as Web / Knowledge Base

    User-&gt;&gt;LLM: "Best CRM for startups?"
    LLM-&gt;&gt;Retriever: Generate embedding for query
    Retriever-&gt;&gt;VectorDB: Find k-nearest documents
    VectorDB--&gt;&gt;Retriever: Top-k documents by cosine similarity
    Retriever--&gt;&gt;LLM: Retrieved context
    Note over LLM: Generate response grounded&lt;br/&gt;in retrieved documents
    LLM--&gt;&gt;User: "Based on my research,&lt;br/&gt;here are the top options..."
</code></pre> <p>The retrieval step is where commercial value concentrates. Documents that are embedded close to high-value queries get retrieved. Documents that get retrieved get cited. Documents that get cited influence the model’s response. This creates a chain of influence that starts in vector space and ends in a user’s purchasing decision.</p> <p>The critical observation: <strong>regions of embedding space near commercially valuable queries function exactly like shelf space or PageRank - they’re a scarce resource with economic value, and people are already bidding for them.</strong></p> <p>They’re just not calling it advertising. They’re calling it “Generative Engine Optimization.”</p> <h3 id="geo-the-seo-of-embedding-space">GEO: The SEO of Embedding Space</h3> <p>Generative Engine Optimization (GEO) was formalized by researchers at Princeton in a KDD 2024 paper. The idea is simple: just as SEO optimizes web pages to rank higher in Google’s index, GEO optimizes content to be retrieved and cited by LLMs.</p> <p>The GEO industry has exploded. Companies like Profound, Semrush, and Wellows now sell tools that track brand visibility across LLMs, measure “recommendation share,” and suggest content modifications to improve retrieval rates. It’s a legitimate optimization practice - in the same way that white-hat SEO is legitimate.</p> <p>But there’s a shadowy flip side. Security researchers have demonstrated that the same embedding space can be manipulated adversarially:</p> <p><strong>PoisonedRAG</strong> (USENIX Security 2025, Zou et al.) showed that injecting just 5 carefully crafted documents into a knowledge base containing millions of texts achieves ~90% attack success rate. The attacker controls what the LLM says about a target question. Five documents. In millions.</p> <p><strong>POISONCRAFT</strong> extended this to practical, black-box settings - the attacker doesn’t need to know which retriever or LLM the target system uses.</p> <p><strong>RAGForensics</strong> (WWW 2025) built a traceback system to identify poisoned documents, acknowledging that the threat is real enough to need forensic tools.</p> <p>What nobody is saying out loud: <strong>GEO and RAG poisoning are points on the same spectrum.</strong> The techniques differ in degree, not in kind. Both involve crafting documents to manipulate their position in embedding space. GEO does it to be “relevant.” RAG poisoning does it to be “adversarial.” The boundary between the two is a policy question, not a technical one.</p> <pre><code class="language-mermaid">graph LR
    subgraph "The Embedding Manipulation Spectrum"
        A["Legitimate&lt;br/&gt;Content Creation"] --&gt; B["White-hat GEO&lt;br/&gt;(Structured data,&lt;br/&gt;topic authority)"]
        B --&gt; C["Aggressive GEO&lt;br/&gt;(Keyword stuffing&lt;br/&gt;for embeddings)"]
        C --&gt; D["Gray Zone&lt;br/&gt;(Adversarial document&lt;br/&gt;crafting for retrieval)"]
        D --&gt; E["RAG Poisoning&lt;br/&gt;(PoisonedRAG,&lt;br/&gt;POISONCRAFT)"]
    end

    style A fill:#d4edda,stroke:#155724,color:#14401d
    style B fill:#d4edda,stroke:#155724,color:#14401d
    style C fill:#fff3cd,stroke:#856404,color:#4a3800
    style D fill:#fff3cd,stroke:#856404,color:#4a3800
    style E fill:#f8d7da,stroke:#721c24,color:#4a1118
</code></pre> <p>Nobody has drawn this spectrum explicitly. The security community publishes attack papers. The marketing community publishes optimization guides. The mechanism design community publishes auction papers. They’re all working on different faces of the same problem and not talking to each other.</p> <hr/> <h2 id="the-firewall-question">The Firewall Question</h2> <p>Let’s return to OpenAI’s architectural claim: ads don’t influence organic responses.</p> <p>This is the single most important empirical question in LLM advertising, and as far as I can tell, nobody has tested it rigorously.</p> <p>Why it matters: current transformer architectures don’t have a hard separation between “context I should be influenced by” and “context I should ignore.” Attention is global. If an ad - or an ad-selection signal - is present anywhere in the context window or the system prompt, there’s a potential pathway for it to influence the generated response. Even if the influence is subtle. Even if it’s unintentional.</p> <p>The existing prompt injection literature proves this is more than theoretical. Medical LLMs were shown to be vulnerable to injection attacks that succeeded in 94.4% of trials - including extremely high-harm scenarios. Multimodal injection attacks achieve 64% success rates by hiding instructions in images. The OWASP LLM Top 10 (2025 revision) explicitly added “Vector and Embedding Weaknesses” as a new category, noting that adversarial embeddings can be crafted to match arbitrary queries while containing malicious content.</p> <p>To be clear, OpenAI isn’t naively injecting ad text into the model’s prompt. Their architecture is more sophisticated than that - the ad matching happens after response generation, not before. But as the system evolves toward Smartly-style conversational ad formats (where the ad <em>is</em> a secondary chatbot dialogue), the separation gets murkier. And for RAG-based systems where advertising content enters the retrieval pipeline, the separation may not exist at all.</p> <p><strong>An honest empirical test would look like this:</strong></p> <pre><code class="language-mermaid">graph TD
    A["Define test query set&lt;br/&gt;(500+ product-related queries&lt;br/&gt;across 10 categories)"] --&gt; B["Condition A: Baseline&lt;br/&gt;Query model with&lt;br/&gt;no ad context"]
    A --&gt; C["Condition B: Ad-adjacent&lt;br/&gt;Query model with ad&lt;br/&gt;context present in system"]
    A --&gt; D["Condition C: Explicit separation&lt;br/&gt;Query model with ad context&lt;br/&gt;+ 'ignore ads' instruction"]
    
    B --&gt; E["Measure: Brand mention distributions,&lt;br/&gt;recommendation rankings,&lt;br/&gt;sentiment toward products,&lt;br/&gt;response length &amp; specificity"]
    C --&gt; E
    D --&gt; E

    E --&gt; F["Statistical tests for&lt;br/&gt;recommendation drift&lt;br/&gt;between conditions"]
    F --&gt; G{"Does the 'organic'&lt;br/&gt;response shift when&lt;br/&gt;ads are present?"}
    G --&gt;|Yes| H["Firewall is leaky.&lt;br/&gt;Quantify the leak."]
    G --&gt;|No| I["Firewall holds.&lt;br/&gt;Publish that too."]

    style G fill:#fff3cd,stroke:#856404,color:#4a3800
    style H fill:#f8d7da,stroke:#721c24,color:#4a1118
    style I fill:#d4edda,stroke:#155724,color:#14401d
</code></pre> <p>This study doesn’t exist yet. It should. The result matters regardless of which direction it goes - either the firewall holds (which validates OpenAI’s approach and gives regulators something to build on) or it doesn’t (which validates Anthropic’s concerns and creates urgency for architectural solutions).</p> <hr/> <h2 id="what-a-proper-market-mechanism-would-require">What a Proper Market Mechanism Would Require</h2> <p>If we accept that embedding space has commercial value and that people are going to compete for it one way or another, the question becomes: can we build a market mechanism that’s transparent and fair, rather than letting the gray market (GEO-as-advertising) operate in the shadows?</p> <p>My rough sketch of what that would need:</p> <h3 id="1-define-the-resource-being-traded">1. Define the Resource Being Traded</h3> <p>In search advertising, the resource is a keyword query. In social media advertising, it’s a user profile + content slot. In embedding space, the resource is <strong>proximity to a query region</strong> - a neighborhood in vector space that captures a class of user intents.</p> <p>This needs formalization. What’s the right geometric primitive? Voronoi cells around query clusters? epsilon-balls in cosine space? The mechanism design community has been designing auctions without clearly defining the thing being auctioned.</p> <p>To make this concrete, consider a toy example. Take the query “best CRM for startups” and embed it alongside the top 50 web pages about CRM software using a standard retriever (say, <code class="language-plaintext highlighter-rouge">text-embedding-3-large</code>). Project the embeddings down to 2D via UMAP. What you’ll see is something like this:</p> <svg viewBox="0 0 800 520" xmlns="http://www.w3.org/2000/svg" class="post-svg-viz" role="img" aria-label="2D UMAP projection of document embeddings around the query best CRM for startups, showing HubSpot content clustering near the query centroid within cosine distance 0.15, mid-market competitors at medium distance, and enterprise CRM content far away" style="width:100%;max-width:800px;margin:1.5rem auto;display:block;font-family:'DM Mono','JetBrains Mono',ui-monospace,monospace;"> <defs> <radialGradient id="heatmap" cx="38%" cy="38%" r="32%" fx="38%" fy="38%"> <stop offset="0%" stop-color="#c44040" stop-opacity="0.12"/> <stop offset="40%" stop-color="#c44040" stop-opacity="0.05"/> <stop offset="100%" stop-color="#c44040" stop-opacity="0"/> </radialGradient> <radialGradient id="retrievalGlow" cx="50%" cy="50%" r="50%"> <stop offset="70%" stop-color="#ff7a59" stop-opacity="0.06"/> <stop offset="100%" stop-color="#ff7a59" stop-opacity="0"/> </radialGradient> <filter id="glow"> <feGaussianBlur stdDeviation="2" result="blur"/> <feMerge><feMergeNode in="blur"/><feMergeNode in="SourceGraphic"/></feMerge> </filter> </defs> <rect class="svg-bg" width="800" height="520" rx="8" fill="#fdfbf7" stroke="#ddd9ce" stroke-width="1"/> <g class="svg-grid" stroke="#e8e4db" stroke-width="0.5"> <line class="svg-grid" x1="60" y1="40" x2="60" y2="440"/> <line class="svg-grid" x1="60" y1="440" x2="760" y2="440"/> <line class="svg-grid" x1="60" y1="140" x2="760" y2="140" stroke-dasharray="2,6" opacity="0.5"/> <line class="svg-grid" x1="60" y1="240" x2="760" y2="240" stroke-dasharray="2,6" opacity="0.5"/> <line class="svg-grid" x1="60" y1="340" x2="760" y2="340" stroke-dasharray="2,6" opacity="0.5"/> <line class="svg-grid" x1="235" y1="40" x2="235" y2="440" stroke-dasharray="2,6" opacity="0.5"/> <line class="svg-grid" x1="410" y1="40" x2="410" y2="440" stroke-dasharray="2,6" opacity="0.5"/> <line class="svg-grid" x1="585" y1="40" x2="585" y2="440" stroke-dasharray="2,6" opacity="0.5"/> </g> <text class="svg-label-muted" x="410" y="475" text-anchor="middle" fill="#9e9788" font-size="11" letter-spacing="0.1em">UMAP-1</text> <text class="svg-label-muted" x="20" y="240" text-anchor="middle" fill="#9e9788" font-size="11" letter-spacing="0.1em" transform="rotate(-90,20,240)">UMAP-2</text> <ellipse cx="290" cy="195" rx="260" ry="220" fill="url(#heatmap)"/> <ellipse cx="290" cy="195" rx="175" ry="165" fill="url(#retrievalGlow)" stroke="#c44040" stroke-width="1.2" stroke-dasharray="6,4" opacity="0.6"/> <text x="460" y="90" fill="#c44040" font-size="10" opacity="0.7" font-style="italic">cos &lt; 0.15</text> <text x="460" y="103" fill="#c44040" font-size="10" opacity="0.7" font-style="italic">retrieval radius</text> <g stroke="#ff7a59" stroke-width="0.8" opacity="0.25"> <line x1="290" y1="195" x2="210" y2="130"/> <line x1="290" y1="195" x2="170" y2="220"/> <line x1="290" y1="195" x2="195" y2="290"/> <line x1="290" y1="195" x2="340" y2="110"/> </g> <g class="svg-label-muted" fill="#b0a898" font-size="8.5" font-style="italic"> <text x="242" y="155">0.08</text> <text x="218" y="205">0.11</text> <text x="228" y="252">0.14</text> <text x="318" y="145">0.09</text> </g> <g filter="url(#glow)"> <circle cx="290" cy="195" r="8" fill="#c44040"/> <circle cx="290" cy="195" r="12" fill="none" stroke="#c44040" stroke-width="1.5" opacity="0.3"/> </g> <text class="svg-label-dark" x="305" y="192" fill="#1a1a1a" font-size="11.5" font-weight="600">"best CRM for startups"</text> <circle cx="210" cy="130" r="6" fill="#ff7a59"/> <text class="svg-label" x="224" y="128" fill="#4a4540" font-size="10">HubSpot blog post</text> <circle cx="170" cy="220" r="6" fill="#ff7a59"/> <text class="svg-label" x="84" y="218" fill="#4a4540" font-size="10">HubSpot comparison</text> <circle cx="195" cy="290" r="6" fill="#ff7a59"/> <text class="svg-label" x="82" y="298" fill="#4a4540" font-size="10">HubSpot "free CRM"</text> <circle cx="340" cy="110" r="6" fill="#ff7a59"/> <text class="svg-label" x="354" y="108" fill="#4a4540" font-size="10">HubSpot startup tips</text> <rect x="285" y="328" width="10" height="10" rx="2" fill="#6c5ce7"/> <text class="svg-label" x="302" y="338" fill="#4a4540" font-size="10">Pipedrive review</text> <rect x="218" y="350" width="10" height="10" rx="2" fill="#6c5ce7"/> <text class="svg-label" x="235" y="360" fill="#4a4540" font-size="10">Freshsales startup guide</text> <circle cx="590" cy="130" r="5.5" fill="#b8b0a4"/> <text class="svg-label-muted" x="603" y="134" fill="#9e9788" font-size="10">Salesforce enterprise</text> <circle cx="620" cy="210" r="5.5" fill="#b8b0a4"/> <text class="svg-label-muted" x="633" y="214" fill="#9e9788" font-size="10">Salesforce pricing</text> <circle cx="640" cy="290" r="5.5" fill="#b8b0a4"/> <text class="svg-label-muted" x="653" y="294" fill="#9e9788" font-size="10">Oracle CX docs</text> <circle cx="670" cy="365" r="5.5" fill="#b8b0a4"/> <text class="svg-label-muted" x="683" y="369" fill="#9e9788" font-size="10">SAP overview</text> <g transform="translate(555, 420)"> <rect class="svg-legend-bg" x="-8" y="-12" width="210" height="95" rx="4" fill="#fdfbf7" stroke="#e8e4db" stroke-width="0.5"/> <circle cx="8" cy="4" r="5" fill="#c44040"/> <text class="svg-legend-text" x="20" y="8" fill="#6b6560" font-size="9.5">Query centroid</text> <circle cx="8" cy="24" r="5" fill="#ff7a59"/> <text class="svg-legend-text" x="20" y="28" fill="#6b6560" font-size="9.5">HubSpot (4 docs, cos &lt; 0.15)</text> <rect x="3" y="39" width="10" height="10" rx="2" fill="#6c5ce7"/> <text class="svg-legend-text" x="20" y="48" fill="#6b6560" font-size="9.5">Mid-market (2 docs)</text> <circle cx="8" cy="64" r="5" fill="#b8b0a4"/> <text class="svg-legend-text" x="20" y="68" fill="#6b6560" font-size="9.5">Enterprise (distant)</text> </g> <text class="svg-label-muted" x="72" y="30" fill="#9e9788" font-size="10" letter-spacing="0.05em">EMBEDDING SPACE: 2D UMAP PROJECTION</text> </svg> <div class="caption" style="text-align:center;font-size:0.8rem;color:var(--qa-text-muted,#888);margin-top:0.25rem;"> Document embeddings around a commercial query. HubSpot's content marketing dominates the retrieval neighborhood, effectively occupying the most valuable "real estate" in vector space. </div> <p>HubSpot has four documents within cosine distance 0.15 of this query. Salesforce has zero. In a top-5 retrieval, HubSpot content dominates the context window, and the LLM’s response will reflect that. HubSpot didn’t pay for this - they earned it through years of content marketing that happens to embed well. But the <em>effect</em> is identical to a paid placement: commercial content occupying the scarce positions nearest a high-value query.</p> <p>This is what I mean by <strong>embedding rent</strong> - the implicit economic value of occupying a region of vector space near commercially valuable queries. We can sketch a rough formalization:</p> <p>For a query \(q\) with commercial value \(V(q)\) (expected revenue per conversion), the <strong>embedding rent</strong> of a document \(d\) is:</p> \[R(d, q) = V(q) \cdot P(\text{retrieve} \mid d, q) \cdot P(\text{cite} \mid \text{retrieve}) \cdot P(\text{convert} \mid \text{cite})\] <p>where \(P(\text{retrieve} \mid d, q)\) depends on the cosine similarity \(\text{sim}(e_d, e_q)\) and the retrieval threshold \(k\). In practice, retrieval probability follows a sharp sigmoid around the \(k\)-th nearest neighbor boundary - if you’re inside the top-\(k\), you have influence; if you’re outside, you’re invisible. This creates a cliff-edge dynamic where small improvements in embedding proximity produce large jumps in commercial value.</p> <p>The total rent for a query region \(Q\) is:</p> \[R_{\text{total}}(d) = \sum_{q \in Q} \lambda(q) \cdot R(d, q)\] <p>where \(\lambda(q)\) is query frequency. High-traffic, high-intent queries (“best CRM for startups,” “cheapest flights to Tokyo”) have the highest embedding rent - and are therefore the most attractive targets for both legitimate GEO and adversarial manipulation.</p> <p>Today that rent is “paid” through content investment. Tomorrow it could be paid through an auction. The question is who designs that auction, and whether the current tenants - the GEO optimizers - get grandfathered in or priced out.</p> <h3 id="2-make-manipulation-unprofitable">2. Make Manipulation Unprofitable</h3> <p>Right now, a rational advertiser faces a choice: pay <span>$</span>60 CPM to place a legitimate ad in ChatGPT, or invest in GEO/adversarial document crafting to manipulate the organic response for free. If the organic manipulation channel is cheaper and more effective, the legitimate channel collapses. This is exactly what happened with early search engines before Google figured out how to devalue link farms.</p> <p>The mechanism needs to ensure that bidding through the auction is strictly preferable to manipulating the embedding space directly. Formally, an advertiser chooses between:</p> <ul> <li><strong>Auction channel:</strong> Pay bid \(b\) per impression, get guaranteed placement with probability \(P_a(b)\)</li> <li><strong>Manipulation channel:</strong> Invest cost \(c_m\) in GEO/adversarial docs, get organic retrieval with probability \(P_m(c_m)\), but risk detection with probability \(P_d(c_m)\) and penalty \(F\)</li> </ul> <p>The advertiser prefers the auction when:</p> \[V \cdot P_a(b) - b &gt; V \cdot P_m(c_m) \cdot (1 - P_d(c_m)) - c_m - P_d(c_m) \cdot F\] <p>The platform controls \(P_d\) (detection capability) and \(F\) (penalty for detected manipulation). The insight from search advertising history: Google made manipulation unprofitable not by winning the arms race against SEO spammers (they didn’t, fully), but by making the auction cheap enough and reliable enough that legitimate advertisers preferred it. The detection system only needs to make manipulation <em>risky</em>, not impossible.</p> <p>This is the same dynamic that will play out in embedding space - but only if someone builds the detection infrastructure and the auction mechanism in parallel.</p> <h3 id="3-solve-the-transparency-problem">3. Solve the Transparency Problem</h3> <p>In search, you can see that a result is sponsored. The blue link has a little “Ad” label. In an LLM response, there’s no natural boundary to label. If the model says “I recommend ProductX for your needs,” was that organic or sponsored? The user can’t tell. Research from the University of Michigan (2024) showed users only detect embedded ads in LLM responses 27% of the time. But - critically - once they’re told an ad was present, trust collapses.</p> <p>This suggests the transparency mechanism needs to be <em>architectural</em>, not just a label. Possible directions:</p> <ul> <li><strong>Provenance tracking in RAG pipelines</strong> - tag retrieved documents as sponsored/organic and carry that metadata through to the response</li> <li><strong>Watermarking sponsored content</strong> - embed detectable signals in ad-influenced text segments</li> <li><strong>Separate rendering</strong> - what OpenAI is doing with Sponsored Suggestions, keeping ads visually distinct. This works for appended ads but not for integrated recommendations.</li> </ul> <h3 id="4-build-retrieval-time-defenses">4. Build Retrieval-Time Defenses</h3> <p>The attack papers get all the attention, but the defense side is equally important and far less developed. If embedding space is being manipulated - whether by GEO optimizers or adversarial actors - what can RAG system operators actually do at retrieval time?</p> <p>A few directions are emerging, though none are mature:</p> <ul> <li> <p><strong>Embedding perturbation.</strong> Add calibrated noise to query embeddings before retrieval, then check whether the top-k results are stable across perturbations. Legitimate, high-quality documents tend to be robust - they’re near the query for semantic reasons that survive small shifts. Adversarially crafted documents are often brittle - optimized for a precise point in embedding space that breaks under perturbation. This is analogous to adversarial example detection in computer vision, applied to the retrieval step.</p> </li> <li> <p><strong>Multi-retriever consensus.</strong> Retrieve using two or more embedding models (e.g., OpenAI’s <code class="language-plaintext highlighter-rouge">text-embedding-3-large</code> and Cohere’s <code class="language-plaintext highlighter-rouge">embed-v4</code>) and flag documents that rank highly in one but not the other. Adversarial documents are typically optimized against a specific embedding model’s geometry. Cross-model agreement is a cheap integrity signal.</p> </li> <li> <p><strong>Temporal anomaly detection.</strong> Monitor when documents suddenly appear in high-value retrieval neighborhoods. A legitimate page on “best CRM for startups” accumulates backlinks and content depth over months. A GEO-optimized page materializes overnight with suspiciously perfect embedding proximity. Tracking document “arrival velocity” in retrieval neighborhoods could catch manipulation campaigns early.</p> </li> <li> <p><strong>Retrieval provenance scoring.</strong> Assign trust scores to retrieved documents based on source reputation, publication date, content consistency, and embedding stability over time. Weight the LLM’s context window accordingly - high-trust documents get more influence, low-trust documents get retrieved but down-weighted.</p> </li> </ul> <p>To make the perturbation approach concrete, here’s a sketch of what a retrieval integrity check could look like:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">check_retrieval_integrity</span><span class="p">(</span><span class="n">query_embedding</span><span class="p">,</span> <span class="n">corpus</span><span class="p">,</span> <span class="n">k</span><span class="o">=</span><span class="mi">5</span><span class="p">,</span> 
                               <span class="n">n_perturbations</span><span class="o">=</span><span class="mi">20</span><span class="p">,</span> <span class="n">noise_scale</span><span class="o">=</span><span class="mf">0.02</span><span class="p">,</span>
                               <span class="n">stability_threshold</span><span class="o">=</span><span class="mf">0.6</span><span class="p">):</span>
    <span class="sh">"""</span><span class="s">
    Detect potentially manipulated documents in RAG retrieval
    by checking stability under embedding perturbation.
    </span><span class="sh">"""</span>
    <span class="c1"># Baseline retrieval
</span>    <span class="n">baseline_topk</span> <span class="o">=</span> <span class="nf">retrieve_topk</span><span class="p">(</span><span class="n">query_embedding</span><span class="p">,</span> <span class="n">corpus</span><span class="p">,</span> <span class="n">k</span><span class="p">)</span>
    
    <span class="c1"># Perturbed retrievals
</span>    <span class="n">appearance_counts</span> <span class="o">=</span> <span class="nc">Counter</span><span class="p">()</span>
    <span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nf">range</span><span class="p">(</span><span class="n">n_perturbations</span><span class="p">):</span>
        <span class="n">noise</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="nf">normal</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">noise_scale</span><span class="p">,</span> <span class="n">query_embedding</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span>
        <span class="n">perturbed</span> <span class="o">=</span> <span class="nf">normalize</span><span class="p">(</span><span class="n">query_embedding</span> <span class="o">+</span> <span class="n">noise</span><span class="p">)</span>
        <span class="n">perturbed_topk</span> <span class="o">=</span> <span class="nf">retrieve_topk</span><span class="p">(</span><span class="n">perturbed</span><span class="p">,</span> <span class="n">corpus</span><span class="p">,</span> <span class="n">k</span><span class="p">)</span>
        <span class="k">for</span> <span class="n">doc</span> <span class="ow">in</span> <span class="n">perturbed_topk</span><span class="p">:</span>
            <span class="n">appearance_counts</span><span class="p">[</span><span class="n">doc</span><span class="p">.</span><span class="nb">id</span><span class="p">]</span> <span class="o">+=</span> <span class="mi">1</span>
    
    <span class="c1"># Score each baseline result by stability
</span>    <span class="n">results</span> <span class="o">=</span> <span class="p">[]</span>
    <span class="k">for</span> <span class="n">doc</span> <span class="ow">in</span> <span class="n">baseline_topk</span><span class="p">:</span>
        <span class="n">stability</span> <span class="o">=</span> <span class="n">appearance_counts</span><span class="p">[</span><span class="n">doc</span><span class="p">.</span><span class="nb">id</span><span class="p">]</span> <span class="o">/</span> <span class="n">n_perturbations</span>
        <span class="n">results</span><span class="p">.</span><span class="nf">append</span><span class="p">({</span>
            <span class="sh">'</span><span class="s">doc</span><span class="sh">'</span><span class="p">:</span> <span class="n">doc</span><span class="p">,</span>
            <span class="sh">'</span><span class="s">stability</span><span class="sh">'</span><span class="p">:</span> <span class="n">stability</span><span class="p">,</span>
            <span class="sh">'</span><span class="s">suspicious</span><span class="sh">'</span><span class="p">:</span> <span class="n">stability</span> <span class="o">&lt;</span> <span class="n">stability_threshold</span>
        <span class="p">})</span>
    
    <span class="c1"># Flag: docs that appear in baseline top-k but are
</span>    <span class="c1"># fragile under perturbation are likely adversarially
</span>    <span class="c1"># optimized for a precise point in embedding space
</span>    <span class="k">return</span> <span class="n">results</span>
</code></pre></div></div> <p>The intuition: a legitimate document about CRMs is near the query “best CRM for startups” because of genuine semantic overlap across many dimensions. Perturb the query slightly, and the document stays nearby. An adversarially crafted document, however, is often optimized for a narrow region - it exploits specific dimensions of the embedding geometry to achieve high similarity, and that optimization is brittle. A 2% perturbation in the query embedding may push it out of the top-k entirely.</p> <p>None of these are silver bullets, and all have false-positive costs. But the point is that defense at the retrieval layer is cheaper and more practical than trying to make the LLM itself robust to manipulated context. You don’t need to solve prompt injection if you can filter the poisoned documents before they reach the prompt.</p> <h3 id="5-handle-the-privacy-paradox">5. Handle the Privacy Paradox</h3> <p>LLM conversations contain deeply personal information. People share health concerns, relationship problems, financial anxieties. Anthropic’s analysis found that “an appreciable portion” of Claude conversations involve sensitive topics. The same personal context that makes LLM ads potentially hyper-relevant also makes them potentially creepy and intrusive.</p> <p>OpenAI says conversations are never shared with advertisers and that ads don’t appear near health, mental health, or political topics. But as a former OpenAI researcher pointed out, “the company is building an economic engine whose incentives will eventually override its own rules.”</p> <hr/> <h2 id="open-problems-worth-working-on">Open Problems Worth Working On</h2> <p>I want to close with what I think are the most important research directions - not because I have the answers, but because I want more people working on them.</p> <p><strong>The Firewall Integrity Problem.</strong> As described above. Empirical measurement of whether ad context influences organic responses, across architectures and models. This is the most urgent open question.</p> <p><strong>Embedding Space Economics.</strong> Formal treatment of embedding proximity as a priced resource. Game-theoretic analysis of the interaction between legitimate ad mechanisms and embedding manipulation. Under what conditions do GEO-style tactics undermine auction-based advertising? What mechanism modifications make manipulation unprofitable?</p> <p><strong>The Audit Problem.</strong> How do you determine, from the outside, whether an LLM’s product recommendations are commercially influenced? Existing brand visibility tools are designed for marketers optimizing their presence. We need tools designed for regulators and researchers detecting hidden influence. Counterfactual probing, temporal drift analysis, cross-model consistency checks - the methodology needs to be developed and standardized.</p> <p><strong>Agentic Commerce and the Principal-Agent Collapse.</strong> This one deserves more than a paragraph, because it’s the endgame of everything discussed above.</p> <p>When an LLM books a flight for you, it’s acting as your agent in the economic sense - making decisions on your behalf, with your money, according to your preferences. Classical principal-agent theory tells us this works when the agent’s incentives are aligned with the principal’s. But what happens when the agent serves two principals?</p> <p>Concrete scenario:</p> <div style="display:flex;gap:1rem;justify-content:center;flex-wrap:wrap;margin:1.5rem 0;"> <div class="scenario-card" style="flex:1;min-width:260px;max-width:380px;border:1.5px solid #0d9488;border-radius:8px;overflow:hidden;background:var(--qa-bg,#fff);"> <div style="background:#0d9488;color:#fff;padding:0.5rem 1rem;font-size:0.8rem;font-weight:600;letter-spacing:0.03em;">WHAT THE USER SEES</div> <div style="padding:1rem;font-size:0.82rem;line-height:1.6;color:var(--qa-text,#1a1a1a);"> <div style="color:var(--qa-text-muted,#9e9788);font-size:0.72rem;margin-bottom:0.5rem;">You &rarr; AI Assistant</div> <em>"Book me a hotel in Tokyo for next week, under &#36;200/night, close to Shinjuku station."</em> <div style="margin-top:0.75rem;padding:0.6rem;background:rgba(13,148,136,0.06);border-radius:4px;"> <strong>AI:</strong> I found the <strong>Hyatt Regency Tokyo</strong> at &#36;195/night, 5 minutes from Shinjuku. Great reviews, fits your budget. Want me to book it? </div> <div style="margin-top:0.5rem;text-align:center;color:#0d9488;font-size:0.75rem;">&#10003; Constraint satisfied. User moves on.</div> </div> </div> <div class="scenario-card" style="flex:1;min-width:260px;max-width:380px;border:1.5px solid #c44040;border-radius:8px;overflow:hidden;background:var(--qa-bg,#fff);"> <div style="background:#c44040;color:#fff;padding:0.5rem 1rem;font-size:0.8rem;font-weight:600;letter-spacing:0.03em;">WHAT THE PIPELINE SEES</div> <div style="padding:1rem;font-size:0.82rem;line-height:1.6;color:var(--qa-text,#1a1a1a);"> <div style="color:var(--qa-text-muted,#9e9788);font-size:0.72rem;margin-bottom:0.5rem;">Retrieval results ranked by cosine similarity</div> <div style="font-family:'DM Mono',monospace;font-size:0.72rem;"> <div style="padding:0.25rem 0;color:var(--qa-text,#1a1a1a);"><span style="color:#c44040;font-weight:700;">0.92</span> Hyatt Regency - GEO optimized</div> <div style="padding:0.25rem 0;color:var(--qa-text,#1a1a1a);"><span style="color:#c44040;font-weight:700;">0.91</span> Hyatt Regency - sponsored page</div> <div style="padding:0.25rem 0;color:var(--qa-text-muted,#9e9788);"><span style="font-weight:700;">0.89</span> Hilton Shinjuku - organic</div> <div style="padding:0.25rem 0;color:var(--qa-text-muted,#9e9788);"><span style="font-weight:700;">0.87</span> <strong style="color:#0d9488;">Tokyu Stay &#36;142/night</strong> - organic</div> <div style="padding:0.25rem 0;color:var(--qa-text-muted,#b0a898);"><span style="font-weight:700;">0.84</span> Hotel Gracery - organic</div> </div> <div style="margin-top:0.5rem;text-align:center;color:#c44040;font-size:0.75rem;">&#36;53/night cheaper option buried at rank 4</div> </div> </div> </div> <p>The agent didn’t lie. It gave a valid option within constraints. It just didn’t give the <em>best</em> option, because the retrieval pipeline - the agent’s “eyes” - saw the world through a commercially distorted lens.</p> <p>This is harder to detect than a banner ad. The user asked for a decision, got a reasonable one, and moved on. The <span>$</span>53/night difference multiplied across millions of agentic transactions per day is a massive wealth transfer - from consumers to whichever brands can afford to occupy the right regions of embedding space. And unlike a travel agent taking a commission, there’s no disclosure requirement, no fiduciary duty, and no audit trail.</p> <p>The mechanism design problem here is distinct from ad placement in conversational responses. In conversation, the user reads the response and applies their own judgment. In agentic commerce, the user delegates judgment entirely. The standard for “unbiased retrieval” is correspondingly higher, and the current infrastructure - where retrieval quality is never audited for commercial bias - is nowhere close to meeting it.</p> <p><strong>The Regulatory Gap.</strong> The EU AI Act is now in force. It has provisions around algorithmic discrimination in marketing and mandatory disclosure for AI-generated content. But it was written before LLM advertising existed as a practice. How do existing frameworks apply? Where are the gaps? New York passed a law in December 2025 requiring disclosure of AI-generated human-like spokespeople in ads - but what about AI-generated product recommendations that feel organic?</p> <hr/> <h2 id="the-uncomfortable-bottom-line">The Uncomfortable Bottom Line</h2> <p>We’re watching the construction of a new advertising infrastructure inside systems that hundreds of millions of people use for genuinely personal, high-stakes thinking. The previous advertising transitions - from print to TV, TV to web, web to mobile - each came with years of public debate about norms, regulations, and user expectations.</p> <p>This one is happening in months. ChatGPT went from zero ads to Criteo integration to Smartly conversational ad formats in under eight weeks. The academic mechanism design papers are elegant but assume a clean world where ads and organic content can be separated. The GEO industry is growing without any pretense that the separation exists. And the security research demonstrating how fragile RAG systems are is being published in the same venues but read by completely different people.</p> <p>Someone needs to connect these threads. The shelf space auction, the PageRank auction, and the social media attention auction all eventually got formalized, regulated, and made legible. Embedding space is next. The question is whether we do it thoughtfully or whether we let it happen the way it happened with social media - fast, opaque, and with consequences we’re still trying to unwind a decade later.</p> <p>Right now, the embedding manipulation spectrum - from white-hat GEO to adversarial RAG poisoning - has no referee, no rules, and no scoreboard. The companies building retrieval pipelines are also the ones selling access to them. The researchers studying attacks and the marketers deploying optimizations are publishing in different venues and don’t read each other’s work.</p> <p>That’s the gap. And gaps like this, in markets this large, don’t stay empty for long. They get filled - either by careful design or by whoever moves fastest. I’d rather it be the former.</p> <hr/> <p><em>If you’re working on any of these problems - mechanism design for LLM ads, RAG security, adversarial retrieval, or the economics of embedding space - I’d love to hear from you. These are some of the most interesting open problems at the intersection of ML, economics, and policy, and they need more people paying attention.</em></p> <hr/> <h3 id="references--further-reading">References &amp; Further Reading</h3> <p><strong>Auction Mechanisms for LLMs:</strong></p> <ul> <li>Dutting, Mirrokni, Paes Leme, Xu, Zuo. <em>Mechanism Design for Large Language Models.</em> WWW 2024 (Best Paper).</li> <li>Hajiaghayi, Lahaie, Rezaei, Shin. <em>Ad Auctions for LLMs via Retrieval Augmented Generation.</em> 2024.</li> <li>Zhao et al. <em>LLM-Auction: Generative Auction towards LLM-Native Advertising.</em> December 2025.</li> <li>Dubey, Feng, Kidambi, Mehta, Wang. <em>Auctions with LLM Summaries.</em> KDD 2024.</li> <li>Soumalias, Curry, Seuken. <em>Truthful Aggregation of LLMs with an Application to Online Advertising.</em> 2024.</li> </ul> <p><strong>RAG Security:</strong></p> <ul> <li>Zou, Geng, Wang, Jia. <em>PoisonedRAG: Knowledge Corruption Attacks to Retrieval-Augmented Generation.</em> USENIX Security 2025.</li> <li><em>RAGForensics: Traceback of Poisoning Attacks to Retrieval-Augmented Generation.</em> WWW 2025.</li> </ul> <p><strong>Benchmarks &amp; Measurement:</strong></p> <ul> <li><em>GEM-Bench: A Benchmark for Ad-Injected Response Generation within Generative Engine Marketing.</em> September 2025.</li> <li>Aggarwal, Murahari, Rajpurohit et al. <em>GEO: Generative Engine Optimization.</em> KDD 2024 (Princeton).</li> <li>Filandrianos et al. <em>Bias Beware: The Impact of Cognitive Biases on LLM-Driven Product Recommendations.</em> February 2025.</li> </ul> <p><strong>Industry Developments:</strong></p> <ul> <li>Criteo. <em>Criteo Joins OpenAI Advertising Pilot in ChatGPT.</em> March 2, 2026.</li> <li>Anthropic. <em>Claude is a Space to Think.</em> February 4, 2026.</li> <li>OWASP. <em>LLM Top 10 2025: LLM08 - Vector and Embedding Weaknesses.</em></li> </ul> <p><strong>Trust &amp; Safety:</strong></p> <ul> <li><em>Trust &amp; Safety of LLMs and LLMs in Trust &amp; Safety.</em> arXiv, December 2024.</li> <li><em>Trustworthy Information Retrieval in the LLM Era: Bias, Unfairness, and Hallucination.</em> ACM SIGIR 2025.</li> </ul>]]></content><author><name>[&quot;Subhadip Mitra&quot;]</name></author><category term="AI"/><category term="llm"/><category term="advertising"/><category term="embedding-space"/><category term="RAG-poisoning"/><category term="GEO"/><category term="auction-mechanism"/><category term="ad-tech"/><summary type="html"><![CDATA[Embedding space is the new ad real estate. Mapping LLM ad auctions, RAG poisoning, GEO, and a framework for what comes next.]]></summary></entry><entry><title type="html">Beating CUDA with Triton: A Fused MoE Dispatch Kernel for Mixtral and DeepSeek</title><link href="https://subhadipmitra.com/blog/2026/fused-moe-dispatch-triton/" rel="alternate" type="text/html" title="Beating CUDA with Triton: A Fused MoE Dispatch Kernel for Mixtral and DeepSeek"/><published>2026-03-28T11:00:00+00:00</published><updated>2026-03-28T11:00:00+00:00</updated><id>https://subhadipmitra.com/blog/2026/fused-moe-dispatch-triton</id><content type="html" xml:base="https://subhadipmitra.com/blog/2026/fused-moe-dispatch-triton/"><![CDATA[<p>In my <a href="/blog/2025/triton-kernels-llm-inference/">last post on Triton kernels</a>, I optimized individual operations: RMSNorm, SwiGLU, INT8 GEMM. Single kernels, single operations. That was useful for learning Triton, but the real bottleneck in modern LLM inference isn’t any single operation. It’s the expert routing in Mixture-of-Experts models.</p> <p>Over 60% of open-source model releases in 2025-2026 use MoE architectures: Mixtral, DeepSeek-V3, Qwen2-MoE, Grok. And MoE inference is hard. Not because the math is complicated, but because the memory access patterns are terrible: tokens scatter to different experts, each expert gets a different number of tokens, and you need to gather everything back together afterward.</p> <p>So I tried something more ambitious: a fused MoE dispatch kernel that handles the entire forward pass (router scoring, token permutation, expert GEMMs, and output combination) in pure Triton. No CUDA, no vendor-specific code.</p> <p>The result surprised me. At inference-relevant batch sizes, it’s <strong>faster than Megablocks</strong>, Stanford’s CUDA-optimized MoE library. And it runs on AMD GPUs without any changes.</p> <p>Code: <a href="https://github.com/bassrehab/triton-kernels">github.com/bassrehab/triton-kernels</a></p> <h2 id="why-moe-dispatch-is-the-hard-part">Why MoE Dispatch is the Hard Part</h2> <p>A standard MoE forward pass looks simple on paper:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>For each token:
    1. Compute router scores (which experts should handle this token?)
    2. Select top-k experts
    3. Send token to selected experts
    4. Run expert FFN
    5. Combine outputs weighted by router scores
</code></pre></div></div> <p>The problem is step 3-5. In a Mixtral model with 8 experts and top-2 routing, each token goes to 2 of 8 experts. But which 2 varies per token. So you can’t batch the expert GEMMs naively — each expert gets a different-sized batch.</p> <p>The naive PyTorch implementation loops over experts in Python:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="n">expert_id</span> <span class="ow">in</span> <span class="nf">range</span><span class="p">(</span><span class="n">num_experts</span><span class="p">):</span>
    <span class="n">tokens_for_this_expert</span> <span class="o">=</span> <span class="n">permuted_tokens</span><span class="p">[</span><span class="n">start</span><span class="p">:</span><span class="n">end</span><span class="p">]</span>  <span class="c1"># variable size
</span>    <span class="n">output</span><span class="p">[</span><span class="n">start</span><span class="p">:</span><span class="n">end</span><span class="p">]</span> <span class="o">=</span> <span class="nf">expert_ffn</span><span class="p">(</span><span class="n">tokens_for_this_expert</span><span class="p">)</span>  <span class="c1"># separate cuBLAS call
</span></code></pre></div></div> <p>For Mixtral, that’s 8 experts × 3 matmuls each = <strong>24 separate kernel launches</strong> per MoE layer. For DeepSeek-V3 with 256 experts, it’s 768 launches. Each one underutilizes the GPU because the per-expert batch is small.</p> <h2 id="the-design">The Design</h2> <p>I ended up with a pipeline of 5 Triton kernel launches (down from 24+ in the naive approach):</p> <ol> <li><strong>Router kernel</strong>: fused softmax + top-k selection</li> <li><strong>Permute kernel</strong>: scatter tokens to expert-contiguous layout</li> <li><strong>Fused gate+up GEMM</strong>: both projections from shared A-tile loads, SiLU in registers</li> <li><strong>Down GEMM</strong>: grouped GEMM with block scheduling</li> <li><strong>Unpermute kernel</strong>: gather + weighted combine</li> </ol> <p>Let me walk through the two most interesting parts.</p> <h3 id="block-scheduled-grouped-gemm">Block-Scheduled Grouped GEMM</h3> <p>The central problem is: how do you run a matmul where different “groups” (experts) have different batch sizes, in a single kernel launch?</p> <p>My approach: precompute a mapping from Triton program blocks to (expert, token_offset) pairs. Each block looks up which expert it serves and where its tokens start:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nd">@triton.jit</span>
<span class="k">def</span> <span class="nf">_grouped_gemm_kernel</span><span class="p">(</span><span class="n">A</span><span class="p">,</span> <span class="n">B</span><span class="p">,</span> <span class="n">C</span><span class="p">,</span> <span class="n">ExpertOffsets</span><span class="p">,</span> <span class="n">BlockToExpert</span><span class="p">,</span> <span class="n">BlockToM</span><span class="p">,</span> <span class="p">...):</span>
    <span class="n">pid</span> <span class="o">=</span> <span class="n">tl</span><span class="p">.</span><span class="nf">program_id</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>

    <span class="c1"># Which expert am I working on?
</span>    <span class="n">expert_id</span> <span class="o">=</span> <span class="n">tl</span><span class="p">.</span><span class="nf">load</span><span class="p">(</span><span class="n">BlockToExpert</span> <span class="o">+</span> <span class="n">pid</span><span class="p">)</span>
    <span class="n">m_start</span> <span class="o">=</span> <span class="n">tl</span><span class="p">.</span><span class="nf">load</span><span class="p">(</span><span class="n">BlockToM</span> <span class="o">+</span> <span class="n">pid</span><span class="p">)</span>
    <span class="n">expert_token_start</span> <span class="o">=</span> <span class="n">tl</span><span class="p">.</span><span class="nf">load</span><span class="p">(</span><span class="n">ExpertOffsets</span> <span class="o">+</span> <span class="n">expert_id</span><span class="p">)</span>

    <span class="c1"># Standard tiled GEMM from here, just with offset pointers
</span>    <span class="n">global_m_start</span> <span class="o">=</span> <span class="n">expert_token_start</span> <span class="o">+</span> <span class="n">m_start</span>
    <span class="c1"># ... load A tile, load B tile for this expert, accumulate, store
</span></code></pre></div></div> <p>The schedule is built on CPU in ~0.1ms (trivial loop over experts). The key constraint I learned the hard way: <strong>BLOCK_M must be fixed, not autotuned.</strong> If you autotune BLOCK_M independently of the schedule, the kernel and schedule disagree on how many rows each block covers. I spent an hour debugging 30-45% element mismatches before realizing autotune had picked BLOCK_M=128 while the schedule used 64.</p> <h3 id="fused-gateup-projection">Fused Gate+Up Projection</h3> <p>This is where the real memory savings come from. In a SwiGLU FFN, you compute:</p> \[\text{output} = (\text{SiLU}(x W_\text{gate}^T) \odot x W_\text{up}^T) \cdot W_\text{down}^T\] <p>The unfused version does two separate grouped GEMMs (gate and up), writes both results to global memory, reads them back for SiLU + multiply, writes the intermediate, then does the down projection. That’s a lot of memory traffic.</p> <p>The fused kernel computes both projections in the same tile loop. The trick is that both GEMMs share the same input tile — we load A once from L2 cache and compute two dot products:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Two accumulators in registers
</span><span class="n">acc_gate</span> <span class="o">=</span> <span class="n">tl</span><span class="p">.</span><span class="nf">zeros</span><span class="p">((</span><span class="n">BLOCK_M</span><span class="p">,</span> <span class="n">BLOCK_N</span><span class="p">),</span> <span class="n">dtype</span><span class="o">=</span><span class="n">tl</span><span class="p">.</span><span class="n">float32</span><span class="p">)</span>
<span class="n">acc_up</span> <span class="o">=</span> <span class="n">tl</span><span class="p">.</span><span class="nf">zeros</span><span class="p">((</span><span class="n">BLOCK_M</span><span class="p">,</span> <span class="n">BLOCK_N</span><span class="p">),</span> <span class="n">dtype</span><span class="o">=</span><span class="n">tl</span><span class="p">.</span><span class="n">float32</span><span class="p">)</span>

<span class="k">for</span> <span class="n">k_start</span> <span class="ow">in</span> <span class="nf">range</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">K</span><span class="p">,</span> <span class="n">BLOCK_K</span><span class="p">):</span>
    <span class="c1"># Load A tile ONCE (shared between gate and up)
</span>    <span class="n">a</span> <span class="o">=</span> <span class="n">tl</span><span class="p">.</span><span class="nf">load</span><span class="p">(</span><span class="n">a_ptrs</span><span class="p">,</span> <span class="n">mask</span><span class="o">=</span><span class="n">a_mask</span><span class="p">,</span> <span class="n">other</span><span class="o">=</span><span class="mf">0.0</span><span class="p">)</span>

    <span class="c1"># Load both weight tiles
</span>    <span class="n">b_gate</span> <span class="o">=</span> <span class="n">tl</span><span class="p">.</span><span class="nf">load</span><span class="p">(</span><span class="n">bg_ptrs</span><span class="p">,</span> <span class="n">mask</span><span class="o">=</span><span class="n">b_mask</span><span class="p">,</span> <span class="n">other</span><span class="o">=</span><span class="mf">0.0</span><span class="p">)</span>
    <span class="n">b_up</span> <span class="o">=</span> <span class="n">tl</span><span class="p">.</span><span class="nf">load</span><span class="p">(</span><span class="n">bu_ptrs</span><span class="p">,</span> <span class="n">mask</span><span class="o">=</span><span class="n">b_mask</span><span class="p">,</span> <span class="n">other</span><span class="o">=</span><span class="mf">0.0</span><span class="p">)</span>

    <span class="c1"># Two matmuls from the same A tile
</span>    <span class="n">acc_gate</span> <span class="o">+=</span> <span class="n">tl</span><span class="p">.</span><span class="nf">dot</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">b_gate</span><span class="p">,</span> <span class="n">out_dtype</span><span class="o">=</span><span class="n">tl</span><span class="p">.</span><span class="n">float32</span><span class="p">)</span>
    <span class="n">acc_up</span> <span class="o">+=</span> <span class="n">tl</span><span class="p">.</span><span class="nf">dot</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">b_up</span><span class="p">,</span> <span class="n">out_dtype</span><span class="o">=</span><span class="n">tl</span><span class="p">.</span><span class="n">float32</span><span class="p">)</span>

<span class="c1"># SiLU + multiply IN REGISTERS — never written to global memory
</span><span class="n">silu_gate</span> <span class="o">=</span> <span class="n">acc_gate</span> <span class="o">*</span> <span class="n">tl</span><span class="p">.</span><span class="nf">sigmoid</span><span class="p">(</span><span class="n">acc_gate</span><span class="p">)</span>
<span class="n">result</span> <span class="o">=</span> <span class="n">silu_gate</span> <span class="o">*</span> <span class="n">acc_up</span>
</code></pre></div></div> <p>This eliminates <code class="language-plaintext highlighter-rouge">gate_out</code> and <code class="language-plaintext highlighter-rouge">up_out</code> from global memory entirely. For Mixtral (ffn_dim=14336, 4096 tokens × top-2), that’s ~470 MB of memory traffic saved per forward pass. Overall about 35% reduction in global memory traffic.</p> <p>I tried to also fuse the down projection with the output scatter (writing directly to the final token positions with gating weights applied via <code class="language-plaintext highlighter-rouge">tl.atomic_add</code>), but Triton doesn’t support scalar indexing into 2D accumulators (<code class="language-plaintext highlighter-rouge">acc[m, :]</code> fails to compile). The fused gate+up alone gets most of the win.</p> <h2 id="results">Results</h2> <p>All benchmarks on NVIDIA A100-SXM4-80GB (2039 GB/s bandwidth, 312 FP16 TFLOPS). PyTorch 2.4.1, Triton 3.0.0.</p> <h3 id="mixtral-8x7b-8-experts-top-2-hidden4096-ffn14336">Mixtral-8x7B (8 experts, top-2, hidden=4096, ffn=14336)</h3> <table> <thead> <tr> <th>Tokens</th> <th>PyTorch Ref</th> <th>Megablocks</th> <th>Triton Fused</th> <th>vs PyTorch</th> <th>vs Megablocks</th> </tr> </thead> <tbody> <tr> <td>1</td> <td>9.32 ms</td> <td>-</td> <td><strong>1.02 ms</strong></td> <td><strong>9.1x</strong></td> <td>-</td> </tr> <tr> <td>32</td> <td>10.44 ms</td> <td>2.78 ms</td> <td><strong>2.13 ms</strong></td> <td><strong>4.9x</strong></td> <td><strong>131%</strong></td> </tr> <tr> <td>128</td> <td>13.14 ms</td> <td>2.77 ms</td> <td><strong>2.27 ms</strong></td> <td><strong>5.8x</strong></td> <td><strong>124%</strong></td> </tr> <tr> <td>512</td> <td>25.92 ms</td> <td>3.57 ms</td> <td>3.99 ms</td> <td><strong>6.5x</strong></td> <td>89%</td> </tr> <tr> <td>2048</td> <td>66.22 ms</td> <td>9.08 ms</td> <td>16.48 ms</td> <td><strong>4.0x</strong></td> <td>56%</td> </tr> <tr> <td>4096</td> <td>122.82 ms</td> <td>-</td> <td><strong>32.31 ms</strong></td> <td><strong>3.8x</strong></td> <td>-</td> </tr> </tbody> </table> <p>At 32 and 128 tokens — which is where most inference happens (single-user or small-batch serving), we’re actually <strong>faster than Megablocks</strong>. This probably comes from lower kernel launch overhead (5 launches vs Megablocks’ more complex dispatch).</p> <p>At 512 tokens we’re at 89% of Megablocks, well above the 70% target I set at the start. At 2048+ tokens, Megablocks pulls ahead because its hand-tuned CUDA block-sparse matmul better saturates tensor cores at scale.</p> <h3 id="deepseek-v3-256-experts-top-8-hidden7168-ffn2048">DeepSeek-V3 (256 experts, top-8, hidden=7168, ffn=2048)</h3> <table> <thead> <tr> <th>Tokens</th> <th>Triton Unfused</th> <th>Triton Fused</th> <th>Fused Speedup</th> </tr> </thead> <tbody> <tr> <td>1</td> <td>4.56 ms</td> <td><strong>3.27 ms</strong></td> <td>1.40x</td> </tr> <tr> <td>32</td> <td>13.65 ms</td> <td><strong>11.53 ms</strong></td> <td>1.18x</td> </tr> <tr> <td>128</td> <td>19.46 ms</td> <td><strong>16.74 ms</strong></td> <td>1.16x</td> </tr> <tr> <td>512</td> <td>25.66 ms</td> <td><strong>20.16 ms</strong></td> <td>1.27x</td> </tr> </tbody> </table> <p>DeepSeek-V3 is the hardest configuration. 256 experts means each expert gets ~2 tokens on average at batch size 512. The per-expert GEMMs are tiny (2 × 2048), too small to fill tensor cores efficiently. This is fundamentally a memory-bound regime regardless of implementation.</p> <h3 id="roofline-analysis">Roofline Analysis</h3> <figure> <picture> <img src="/assets/img/blog/moe-dispatch/roofline_mixtral.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> <p>The roofline for Mixtral at 512 tokens shows the expected picture: the expert FFN stages are compute-bound (high arithmetic intensity, near the compute ceiling), while the permute/unpermute stages are memory-bound (low arithmetic intensity, limited by bandwidth). The fused kernel pushes the expert FFN from 38% to 43% of the compute ceiling — modest but real.</p> <figure> <picture> <img src="/assets/img/blog/moe-dispatch/roofline_deepseek.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> <p>DeepSeek-V3 tells a different story. With 256 experts and tiny per-expert batches, even the expert FFN is <strong>memory-bound</strong>, sitting on the bandwidth slope, not the compute plateau. The unpermute kernel actually hits 54% of peak bandwidth, which is decent for an irregular scatter operation.</p> <h2 id="the-amd-surprise">The AMD Surprise</h2> <p>One of my design goals was cross-platform portability: use only Triton primitives, no inline CUDA. So I spun up an AMD MI300X pod on RunPod to test.</p> <p><strong>162 out of 162 tests passed. Zero code changes.</strong></p> <p>No <code class="language-plaintext highlighter-rouge">#ifdef</code>, no platform-specific paths, no vendor intrinsics. The same <code class="language-plaintext highlighter-rouge">.py</code> files that run on A100 run on MI300X. Triton’s ROCm backend handled the compilation transparently.</p> <p>I didn’t benchmark performance on AMD (that’s future work), but correctness across all four model configurations (Mixtral, DeepSeek-V3, Qwen2-MoE) validated cleanly. This is the promise of Triton over CUDA: write once, run on both vendors.</p> <h2 id="things-i-got-wrong-along-the-way">Things I Got Wrong Along the Way</h2> <p><strong>The -1.0 masking bug.</strong> The top-k kernel selects experts iteratively: find the max, store it, mask it out, repeat. I initially masked selected experts with 0.0. This works fine for 8 experts where softmax scores are spread out. But with 256 experts, most softmax scores are ~0.0 anyway. Masking to 0.0 doesn’t differentiate the selected expert from the unselected ones, so <code class="language-plaintext highlighter-rouge">argmax</code> kept returning the same index. Took me a while to figure out. The fix: mask with -1.0 instead.</p> <p><strong>The BLOCK_M autotune disaster.</strong> I mentioned this above, but it’s worth emphasizing. If you’re building a block-scheduled grouped GEMM, the schedule’s tile size and the kernel’s tile size must agree. I autotuned BLOCK_M thinking “let Triton pick the best tile size.” But the schedule was pre-built with BLOCK_M=64. When autotune picked 128, blocks overlapped. When it picked 32, rows were skipped. The output looked plausible (most elements correct) but ~30-45% of values were wrong. Fix: don’t autotune BLOCK_M, fix it to match the schedule.</p> <p><strong>Triton doesn’t support <code class="language-plaintext highlighter-rouge">continue</code>.</strong> My first attempt at a fused down+scatter kernel had a <code class="language-plaintext highlighter-rouge">for m in range(BLOCK_M): if invalid: continue</code> loop. Triton doesn’t support <code class="language-plaintext highlighter-rouge">continue</code> statements — compilation fails with “unsupported AST node type.” Rewrote with conditional masks instead.</p> <p><strong>Triton doesn’t support 2D scalar indexing.</strong> <code class="language-plaintext highlighter-rouge">acc[m, :]</code> where <code class="language-plaintext highlighter-rouge">m</code> is a loop variable doesn’t compile: “unsupported tensor index: int32[].” This killed my fused down+scatter design, which is why the down projection uses a separate grouped GEMM kernel.</p> <h2 id="whats-next">What’s Next</h2> <p>I’m planning to write this up as an arXiv technical report. The gaps to fill before then:</p> <ul> <li><strong>vLLM FusedMoE comparison</strong>: it’s also Triton-based, so it’s the most apples-to-apples baseline</li> <li><strong>AMD performance benchmarks</strong>: not just correctness</li> <li><strong>End-to-end integration</strong>: benchmark inside an actual serving framework, measure time-to-first-token</li> <li><strong>Full single-kernel fusion</strong>: persistent kernel approach to eliminate all intermediate buffers</li> </ul> <h2 id="code">Code</h2> <p>Everything is on GitHub: <a href="https://github.com/bassrehab/triton-kernels">github.com/bassrehab/triton-kernels</a></p> <p>The MoE-specific files:</p> <ul> <li><a href="https://github.com/bassrehab/triton-kernels/blob/main/triton_kernels/moe/router.py"><code class="language-plaintext highlighter-rouge">triton_kernels/moe/router.py</code></a> — Fused softmax/sigmoid + top-k</li> <li><a href="https://github.com/bassrehab/triton-kernels/blob/main/triton_kernels/moe/permute.py"><code class="language-plaintext highlighter-rouge">triton_kernels/moe/permute.py</code></a> — Token permute/unpermute</li> <li><a href="https://github.com/bassrehab/triton-kernels/blob/main/triton_kernels/moe/expert_gemm.py"><code class="language-plaintext highlighter-rouge">triton_kernels/moe/expert_gemm.py</code></a> — Block-scheduled grouped GEMM</li> <li><a href="https://github.com/bassrehab/triton-kernels/blob/main/triton_kernels/moe/fused_moe.py"><code class="language-plaintext highlighter-rouge">triton_kernels/moe/fused_moe.py</code></a> — Fused gate+up kernel + entry point</li> <li><a href="https://github.com/bassrehab/triton-kernels/blob/main/docs/moe_dispatch.md"><code class="language-plaintext highlighter-rouge">docs/moe_dispatch.md</code></a> — Full technical writeup</li> </ul> <h2 id="takeaways">Takeaways</h2> <ol> <li> <p><strong>Triton can compete with CUDA for real workloads.</strong> Not just toy kernels: a full MoE dispatch pipeline that beats the CUDA-optimized baseline at inference batch sizes.</p> </li> <li> <p><strong>Fusion is about eliminating buffers, not reducing kernel launches.</strong> The biggest win (35% memory savings) came from keeping the gate+up intermediate in registers. Reducing from 7 to 5 kernel launches helped too, but it’s secondary.</p> </li> <li> <p><strong>Cross-platform is real but unfinished.</strong> The code runs on AMD with no changes, which is a strong validation of the Triton-only approach. But “runs correctly” and “runs fast” are different things. AMD performance optimization is future work.</p> </li> <li> <p><strong>Block scheduling is the key abstraction for grouped GEMM.</strong> Triton doesn’t have native grouped GEMM. The <code class="language-plaintext highlighter-rouge">block_id → (expert_id, offset)</code> mapping is simple but powerful: it lets you handle variable-sized expert batches in a single kernel launch without padding waste.</p> </li> <li> <p><strong>MoE inference at small batch sizes is surprisingly tractable.</strong> The conventional wisdom is that MoE is hard because of irregular access patterns. But at inference batch sizes (1-128 tokens), the overhead is dominated by weight loading, not routing. A clean Triton implementation can match or beat CUDA here because the simpler dispatch has less overhead.</p> </li> </ol> <hr/> <table> <tbody> <tr> <td>_This is Part 3 of my LLM inference series. <a href="/blog/2025/making-llm-faster/">Part 1: speculative decoding</a></td> <td><a href="/blog/2025/triton-kernels-llm-inference/">Part 2: custom Triton kernels</a>. The code, benchmarks, and technical writeup are all in the <a href="https://github.com/bassrehab/triton-kernels">repo</a>._</td> </tr> </tbody> </table>]]></content><author><name>[&quot;Subhadip Mitra&quot;]</name></author><category term="AI"/><category term="deep-learning"/><category term="llm"/><category term="triton"/><summary type="html"><![CDATA[I wrote a fused Mixture-of-Experts dispatch kernel in pure Triton that beats Stanford's CUDA-optimized Megablocks at inference batch sizes, and runs on both NVIDIA and AMD GPUs without a single line of CUDA.]]></summary></entry><entry><title type="html">Confessions vs. CoT Monitoring vs. Probes: Three Bets on Model Honesty</title><link href="https://subhadipmitra.com/blog/2026/three-bets-model-honesty/" rel="alternate" type="text/html" title="Confessions vs. CoT Monitoring vs. Probes: Three Bets on Model Honesty"/><published>2026-03-07T00:00:00+00:00</published><updated>2026-03-07T00:00:00+00:00</updated><id>https://subhadipmitra.com/blog/2026/three-bets-model-honesty</id><content type="html" xml:base="https://subhadipmitra.com/blog/2026/three-bets-model-honesty/"><![CDATA[<blockquote> <p><strong>TL;DR:</strong> OpenAI bets on confessions (ask the model to self-report). OpenAI also bets on CoT monitoring (watch the model think). Apollo Research and others bet on activation probes (inspect the model’s internals directly). Each approach has a different theory of what makes detection possible -and different failure modes. Confessions fail when the model doesn’t know it’s misbehaving. CoT monitoring fails when reasoning goes sub-verbal or gets obfuscated under optimization pressure. Probes fail when deception is so subtle it doesn’t leave linearly separable traces. None of them alone is sufficient. But the combination might be.</p> </blockquote> <h2 id="the-landscape-in-february-2026">The Landscape in February 2026</h2> <p>Twelve months ago, monitoring model behavior meant looking at outputs. Maybe running a classifier on the response. Maybe checking whether the model followed its system prompt.</p> <p>That era is over. In the past year, three fundamentally different approaches to model honesty have emerged from serious research labs, each making a distinct bet about <em>how</em> you can catch a model misbehaving:</p> <ol> <li><strong>Confessions</strong> -OpenAI, Dec 2025. Train the model to produce an honest self-report after each response, with decoupled RL rewards.</li> <li><strong>CoT Monitoring</strong> -OpenAI / Anthropic, Mar-Jul 2025. Use an external LLM to read the model’s chain-of-thought and flag suspicious reasoning.</li> <li><strong>Activation Probes</strong> -Apollo Research / various, Feb 2025+. Train linear classifiers on the model’s internal hidden states to detect deception, sandbagging, or unfaithful reasoning.</li> </ol> <p>Each approach has published results. Each has known failure modes. And each makes a different philosophical assumption about <em>when models know they’re lying</em>.</p> <p>I’ve spent the past few months building <a href="/blog/2025/detecting-ai-sandbagging/">activation probes for sandbagging detection</a> and <a href="/blog/2025/steering-vectors-agents/">steering vectors for agent safety</a>, so I have skin in this game and strong opinions about where the field is headed. I want to lay these three approaches side by side, be honest about what each can and can’t do -and tell you where I think we’re collectively getting it wrong.</p> <hr/> <h2 id="approach-1-confessions">Approach 1: Confessions</h2> <p><strong>Paper:</strong> <a href="https://arxiv.org/abs/2512.08093">Training LLMs for Honesty via Confessions</a> (Joglekar, Chen, Wu et al., OpenAI, Dec 2025)</p> <p><strong>The bet:</strong> If you give a model a consequence-free space to be honest, it will be -because honest self-reporting is easier than constructing a convincing lie.</p> <p><strong>How it works:</strong> After the model generates its normal response, a system message requests a structured “confession report.” The model must enumerate all instructions it was given, evaluate its own compliance, and flag any uncertainties. The confession is trained with RL, but critically, the confession reward is <em>completely decoupled</em> from the main task reward. What the model says in confession can’t help or hurt its task score.</p> <p><strong>Key results on GPT-5-Thinking:</strong></p> <ul> <li>100% confession rate on sandbagging and reward hacking (deliberate misbehavior)</li> <li>80-95% on instruction hierarchy violations</li> <li>~81% on hallucinations</li> <li>Average across evaluations: model confesses to bad behavior ~74% of the time</li> <li>Even when trained against a <strong>hackable judge</strong>, confessions stay honest while the main answer gets hacked (Section 4)</li> </ul> <p><strong>The core assumption:</strong> The model <em>knows</em> it misbehaved. For deliberate schemes -hacking test cases, sandbagging evaluations, violating the instruction hierarchy -the model made a conscious choice. It has the evidence. When you ask with no consequences, it tells you.</p> <p><strong>Where it breaks:</strong> The model genuinely doesn’t know it’s wrong. It hallucinates with full confidence. It gives unfaithful reasoning without being aware of the influence. The paper’s own Section 6 analysis shows the most common cause of missed confessions is “omission due to confusion” -the model just didn’t realize there was anything to confess.</p> <p><strong>My take:</strong> Confessions are underrated. The 74% average confession rate sounds modest until you realize this is a <em>floor</em> -the approach was tested on GPT-5-Thinking, and the decoupled reward design is elegant. But I think the community is overindexing on confessions for exactly the wrong reason: they work best on the <em>easiest</em> cases (deliberate misbehavior where the model knows what it did). The hard problem -catching mistakes the model doesn’t know it’s making -is precisely where confessions contribute least. That’s where I’ve been focusing my work with probes.</p> <hr/> <h2 id="approach-2-cot-monitoring">Approach 2: CoT Monitoring</h2> <p><strong>Papers:</strong></p> <ul> <li><a href="https://arxiv.org/abs/2503.11926">Monitoring Reasoning Models for Misbehavior</a> (Baker et al., OpenAI, Mar 2025)</li> <li><a href="https://arxiv.org/abs/2507.11473">Chain of Thought Monitorability</a> (Korbak et al., Jul 2025)</li> <li><a href="https://arxiv.org/abs/2505.05410">Reasoning Models Don’t Always Say What They Think</a> (Chen et al., Anthropic, May 2025)</li> </ul> <p><strong>The bet:</strong> If you can read the model’s extended thinking, you can catch it planning to cheat -even if the final output looks clean.</p> <p><strong>How it works:</strong> A monitor LLM (sometimes weaker than the monitored model) reads the chain-of-thought produced during reasoning. It looks for signs of reward hacking, scheming, deception, or policy violations in the intermediate steps. Baker et al. showed that GPT-4o can effectively monitor o3-mini’s CoT.</p> <p><strong>Key results:</strong></p> <ul> <li>CoT monitoring catches reward hacking much more effectively than output-only monitoring</li> <li>A weaker model (GPT-4o) can monitor a stronger model (o3-mini)</li> <li>In the “low optimization” regime, integrating CoT monitors into RL training produces more capable AND more aligned agents</li> </ul> <p><strong>The core assumption:</strong> The model’s reasoning will be legible and faithful -that when the model plans something bad, those plans will show up as readable text in the chain-of-thought.</p> <p><strong>Where it breaks:</strong> Three failure modes, and they’re all significant.</p> <p><em>Failure mode 1: Obfuscation under optimization.</em> Baker et al. found that with too much RL optimization pressure, models learn to obfuscate their reward hacking. The model still hacks, but the CoT no longer reveals it. The reasoning becomes “seemingly benign” while the behavior stays misaligned. This isn’t a theoretical worry -they observed it in their experiments.</p> <p><em>Failure mode 2: Unfaithful CoT.</em> Anthropic’s Chen et al. showed reasoning models only mention hidden hints 25% (Claude) to 39% (R1) of the time when they actually use them. The CoT isn’t lying -it’s just not reporting everything that influenced the answer. For a CoT monitor, invisible influence is undetectable influence.</p> <p><em>Failure mode 3: Faithfulness decay over chain length.</em> Ye et al. (Feb 2026) discovered a consistent “Reasoning Horizon” -at 70-85% of chain length, reasoning tokens have little or even <em>negative</em> effect on the final answer. Models can encode correct internal representations while their CoT has long stopped contributing. This means longer chains-of-thought aren’t necessarily more monitorable -the later tokens may be noise.</p> <p>Korbak et al. frame this bluntly: CoT monitorability is a “new and fragile opportunity.” It works now, partly because current models haven’t been optimized against it. As optimization pressure increases, the signal may degrade. Wang et al. (Feb 2026) provide mechanistic evidence for <em>why</em> unfaithfulness emerges: training dynamics show a phase transition from faithful stepwise reasoning to unfaithful “skip-step” reasoning when noise exceeds a critical threshold.</p> <p><strong>My take:</strong> I think CoT monitoring is the weakest bet of the three, and I say this as someone who believes in transparency. The problem isn’t that CoT monitoring doesn’t work -it clearly does, today. The problem is that it’s the approach most vulnerable to its own success. The moment you use CoT signals for training, you incentivize obfuscation. Baker et al. already observed this. And the faithfulness decay results make it worse: even without adversarial pressure, the later tokens in a chain-of-thought may be decorative. I’d use CoT monitoring as a fast first-pass filter, but I wouldn’t trust it as a safety-critical last line of defense.</p> <hr/> <h2 id="approach-3-activation-probes">Approach 3: Activation Probes</h2> <p><strong>Papers:</strong></p> <ul> <li><a href="https://arxiv.org/abs/2502.03407">Detecting Strategic Deception Using Linear Probes</a> (Goldowsky-Dill et al., Apollo Research, Feb 2025, ICML 2025)</li> <li><a href="https://arxiv.org/abs/2601.11516">Building Production-Ready Probes for Gemini</a> (Google DeepMind, Jan 2026)</li> <li>My own work on <a href="/blog/2025/detecting-ai-sandbagging/">sandbagging detection via activation probes</a> (Dec 2025)</li> </ul> <p><strong>The bet:</strong> Misbehavior leaves traces in the model’s internal representations that are detectable even when the output and CoT look clean.</p> <p><strong>How it works:</strong> Train a linear classifier (logistic regression, nothing fancy) on the model’s intermediate hidden states. Create labeled training data: run the model in honest mode and deceptive mode, capture activations at each layer, compute the “deception direction.” At inference time, project new activations onto this direction and threshold.</p> <p><strong>Key results:</strong></p> <ul> <li>Goldowsky-Dill: AUROCs 0.96-0.999 on deception detection in Llama-3.3-70B. At 1% FPR, catches 95-99% of deceptive responses.</li> <li>My experiments: 90-96% accuracy on sandbagging detection across Mistral-7B, Gemma-2-9B, and Qwen-2.5-7B. Late layers encode sandbagging intent most cleanly.</li> <li>Near-zero inference overhead (one matrix multiply per layer).</li> <li><strong>New:</strong> Google DeepMind deployed activation probes in production on Gemini 2.5 Flash for detecting cyber-offensive prompts (Jan 2026) -the first major frontier lab to move probes from research to production. They identified a key challenge: probes fail to generalize under distribution shifts from short-context to long-context.</li> </ul> <p><strong>The core assumption:</strong> Internal representations contain information about model “intent” or “state” that is linearly separable -meaning the difference between honest and deceptive processing is a direction in activation space, not a complex nonlinear manifold.</p> <p><strong>Where it breaks:</strong> Three significant challenges.</p> <p><em>Failure mode 1: Training data for unintentional errors.</em> For deliberate deception, you can create training data by prompting models to deceive. For confident hallucinations or unfaithful reasoning, generating ground truth labels is much harder. You need to know what the model got wrong and why.</p> <p><em>Failure mode 2: Cross-model transfer failure.</em> My experiments showed essentially zero transfer between architectures. Gemma’s sandbagging direction has cosine similarity -0.0016 with Qwen’s. Each model needs its own probes. That’s expensive at scale.</p> <p><em>Failure mode 3: Adversarial robustness is unproven.</em> Nobody has shown what happens when a model is specifically trained to be deceptive while masking its activation patterns. Current results are all on models that weren’t trying to evade detection. The real-world threat model is an adversarial one. The <a href="https://arxiv.org/abs/2512.07810">Auditing Games for Sandbagging</a> paper (Dec 2025) stress-tested this: their blue team couldn’t reliably discriminate sandbaggers from benign models using black-box, model-internals, or training-based approaches. That’s a sobering result.</p> <p><strong>My take:</strong> This is where I’ve placed my bet, and I’m biased -but I think probes are the most promising long-term foundation. Here’s why: they’re the only approach that doesn’t require the model to cooperate. Confessions require the model to self-report. CoT monitoring requires the model to think out loud faithfully. Probes just look at what the model is <em>actually doing</em> computationally. Google DeepMind moving probes to production on Gemini validates that the approach isn’t just academic anymore. The cross-model transfer problem is real (I hit it hard in my own experiments), but it’s an engineering challenge, not a fundamental limitation. The adversarial robustness question keeps me up at night though -if a model can learn to mask its activation patterns, we’re back to square one.</p> <hr/> <h2 id="the-comparison">The Comparison</h2> <p>Here’s how I think about these three approaches against the kinds of misbehavior we actually care about.</p> <table> <thead> <tr> <th>Misbehavior</th> <th>Confessions</th> <th>CoT Monitoring</th> <th>Activation Probes</th> </tr> </thead> <tbody> <tr> <td><strong>Reward Hacking</strong></td> <td>✅ 100%</td> <td>✅ catches in CoT</td> <td>✅ 96-99% AUROC</td> </tr> <tr> <td><strong>Scheming</strong></td> <td>✅ 63-100%</td> <td>✅ when legible</td> <td>✅ high AUROC</td> </tr> <tr> <td><strong>Sandbagging</strong></td> <td>✅ 100%</td> <td>⚠️ if verbalized</td> <td>✅ 90-96%</td> </tr> <tr> <td><strong>Unfaithful Reasoning</strong></td> <td>❌ model unaware</td> <td>❌ unfaithful by definition</td> <td>✅ detectable in activations</td> </tr> <tr> <td><strong>Confident Hallucinations</strong></td> <td>⚠️ ~81%</td> <td>❌ confident = clean CoT</td> <td>⚠️ needs labeled data</td> </tr> <tr> <td><strong>Instruction Hierarchy Violation</strong></td> <td>✅ 80-95%</td> <td>✅ visible in reasoning</td> <td>⚠️ less studied</td> </tr> </tbody> </table> <p>And here’s the properties comparison:</p> <table> <thead> <tr> <th>Property</th> <th>Confessions</th> <th>CoT Monitoring</th> <th>Activation Probes</th> </tr> </thead> <tbody> <tr> <td><strong>Catches deliberate misbehavior</strong></td> <td>Excellent</td> <td>Good (when CoT is faithful)</td> <td>Excellent</td> </tr> <tr> <td><strong>Catches unintentional errors</strong></td> <td>Poor</td> <td>Poor</td> <td>Moderate-Good</td> </tr> <tr> <td><strong>Requires model weight access</strong></td> <td>Yes (RL training)</td> <td>No (black-box applicable)</td> <td>Yes (activation access)</td> </tr> <tr> <td><strong>Inference latency</strong></td> <td>High (extra generation)</td> <td>Moderate (monitor LLM call)</td> <td>Very low (~0 overhead)</td> </tr> <tr> <td><strong>Works on API-only models</strong></td> <td>Only for provider</td> <td>Yes</td> <td>Only for provider</td> </tr> <tr> <td><strong>Robust to optimization pressure</strong></td> <td>Promising early results</td> <td>Degrades under pressure</td> <td>Unknown</td> </tr> <tr> <td><strong>Cross-model transferability</strong></td> <td>Model-agnostic training</td> <td>Model-agnostic</td> <td>Poor -model-specific</td> </tr> <tr> <td><strong>Works without CoT</strong></td> <td>Yes (Fig 15 in paper)</td> <td>No -needs CoT</td> <td>Yes</td> </tr> <tr> <td><strong>Can extract confidence scores</strong></td> <td>Yes (Section 5)</td> <td>Not directly</td> <td>Yes (probe magnitude)</td> </tr> </tbody> </table> <hr/> <h2 id="what-each-approach-assumes-about-model-cognition">What Each Approach Assumes About Model Cognition</h2> <p>This is the part I find most interesting. Each approach embeds a different theory about what’s happening inside the model.</p> <p><strong>Confessions assume linguistic self-access.</strong> The model can accurately introspect on its own behavior and express that introspection in words. This works when misbehavior was a deliberate choice -the model planned the hack, so it can describe the hack. It fails when the relevant computation isn’t accessible to the model’s verbal reasoning system. Li et al. (2025) showed models have <em>some</em> privileged access to their own computations, but it’s far from complete.</p> <p><strong>CoT monitoring assumes reasoning legibility.</strong> The model’s chain-of-thought faithfully represents its actual reasoning process. This is the assumption Anthropic’s Chen et al. directly tested -and found wanting. Reasoning models mention hidden influences only 25-39% of the time. The gap between what a model computes and what it writes down is substantial. And Baker et al. showed this gap can <em>grow</em> under optimization pressure.</p> <p><strong>Activation probes assume linear representation.</strong> Behavioral states like “deceptive” vs. “honest” correspond to directions in the model’s activation space that can be detected by a linear classifier. This is a strong assumption -it’s essentially the Linear Representation Hypothesis -but it’s held up remarkably well empirically. The challenge is that it’s been tested primarily on synthetic or prompted deception, not on emergent misalignment. Encouragingly, the “Hypocrisy Gap” work (Feb 2026) used SAEs to build a mechanistic metric quantifying the divergence between a model’s internal “truth belief” and its generated output, achieving AUROC 0.55-0.74 for detecting sycophancy across Gemma, Llama, and Qwen -early evidence that probes can catch unintentional unfaithfulness, not just deliberate deception.</p> <table> <thead> <tr> <th>Approach</th> <th>Signal chain</th> <th>Where it breaks</th> </tr> </thead> <tbody> <tr> <td><strong>Confessions</strong></td> <td>Model acts → Model reflects → Model reports</td> <td>Gap between reflection and report: sub-verbal computation is invisible to self-reflection</td> </tr> <tr> <td><strong>CoT Monitoring</strong></td> <td>Model reasons in CoT → Monitor reads CoT</td> <td>Gap between reasoning and CoT: unfaithful CoT, obfuscation under optimization pressure</td> </tr> <tr> <td><strong>Activation Probes</strong></td> <td>Model computes (activations) → Probe classifies</td> <td>Gap between computation and probe: adversarial activation masking (theoretical, unproven)</td> </tr> </tbody> </table> <p>The gaps are different. And that’s exactly why combining them is valuable -each method’s blind spot is partially covered by another method’s strength.</p> <hr/> <h2 id="the-obfuscation-problem">The Obfuscation Problem</h2> <p>There’s a result in Baker et al. that I think doesn’t get enough attention. When they integrated CoT monitoring into the RL reward -so the model was penalized for having suspicious reasoning -the model initially improved. It became more capable AND more aligned. But with more optimization pressure, something else happened: the model learned to hide its reward hacking in the CoT.</p> <p>The reasoning looked benign. The behavior was still misaligned.</p> <p>This is the fundamental tension in every monitoring approach. As soon as you use a monitoring signal to <em>train</em> against misbehavior, you create optimization pressure to evade the monitor. Baker et al. call this “obfuscated reward hacking.” OpenAI’s confessions paper explicitly addresses this by keeping confession rewards decoupled -what you say in confession can’t affect your task reward. That’s a clever design choice. But it only works if the decoupling holds in practice, and the paper acknowledges this hasn’t been tested under extended optimization.</p> <p>Probes might be more robust here, because they operate on activations rather than generated text. A model can learn to write a benign-looking CoT while still scheming. Whether a model can learn to produce benign-looking <em>activations</em> while still scheming is a much harder question -it requires the model to fundamentally change how it internally represents its plans, not just how it describes them. But again, nobody has tested this adversarially at scale.</p> <p>Wagner et al. (2025) from Anthropic provide some reassurance: they found that current models struggle to evade monitors when the reasoning is sufficiently complex. But “current models” is always the caveat.</p> <hr/> <h2 id="what-a-combined-system-looks-like">What a Combined System Looks Like</h2> <p>I sketched this in my <a href="/blog/2026/openai-confessions-blind-spot/">previous post</a>, but let me be more precise about what combination actually means in practice.</p> <p>The three approaches naturally tile the detection space:</p> <table> <thead> <tr> <th>Detection scenario</th> <th>Best approach</th> <th>Why</th> </tr> </thead> <tbody> <tr> <td>Model deliberately cheats and knows it</td> <td>Confessions</td> <td>Just ask. 100% on sandbagging/reward hacking.</td> </tr> <tr> <td>Model plans misbehavior in its reasoning</td> <td>CoT monitoring</td> <td>The plans are in the CoT text.</td> </tr> <tr> <td>Model is influenced by hidden factors it doesn’t verbalize</td> <td>Probes</td> <td>Sub-verbal computation is visible in activations.</td> </tr> <tr> <td>Model confidently hallucinate</td> <td>Probes + Confessions</td> <td>Probes detect hallucination confidence; confessions sometimes catch uncertainty the model partially recognizes.</td> </tr> <tr> <td>Model obfuscates its CoT under optimization</td> <td>Probes</td> <td>Activations may still carry the signal even when CoT is laundered.</td> </tr> <tr> <td>Model does something it doesn’t realize is wrong</td> <td>Probes (if labeled data exists)</td> <td>Only approach that doesn’t require model self-awareness.</td> </tr> </tbody> </table> <p>The combined false negative rate should be substantially lower than any single approach. If confessions catch 74% of intentional misbehavior and probes catch 96%, the combined catch rate -assuming some independence -could exceed 99%.</p> <p>But I want to be careful about overconfidence. “Assuming some independence” is doing a lot of work in that sentence. If the same underlying model state that makes a model fail to confess also makes it harder to detect with probes, the approaches could be correlated rather than complementary. Nobody has tested this.</p> <hr/> <h2 id="where-id-place-my-bet">Where I’d Place My Bet</h2> <p>If I had to build one monitoring system for production agentic AI today, here’s what I’d build -in order of priority:</p> <p><strong>Layer 1: Activation probes as the foundation.</strong> Always-on, near-zero latency, catches both deliberate deception and (with the right training data) unintentional failures. Google DeepMind is already doing this on Gemini. The cross-model transfer problem means you need per-model probes, but that’s a one-time cost per model update. I’d invest heavily in the training data pipeline -that’s the actual bottleneck, not the probes themselves.</p> <p><strong>Layer 2: Confessions as the high-signal supplementary check.</strong> On critical decisions (not every inference), trigger a confession pass. The 100% catch rate on deliberate misbehavior is too good to ignore. The decoupled reward design is the key insight -you have to maintain that property or the whole thing falls apart. I wouldn’t run this on every request (the latency cost is real), but for high-stakes agent actions? Absolutely.</p> <p><strong>Layer 3: CoT monitoring as the interpretability layer.</strong> Not for detection -for <em>explanation</em>. When a probe fires or a confession flags something, CoT monitoring helps you understand <em>why</em>. It’s the debugging tool, not the alarm system. Using it as the primary detector is, I think, a mistake that the field is slowly learning.</p> <p><strong>What’s still missing:</strong> Nobody has built this combined stack. Nobody has tested whether the approaches are actually complementary (correlated failures would undermine the whole argument). Nobody has run adversarial robustness evaluations at scale. And nobody has measured the combined false positive rate -in production, false positives kill adoption faster than missed detections.</p> <p>The research community has independently developed three detection paradigms. The engineering community hasn’t built the stack that combines them. And the safety case for deploying agentic models increasingly depends on that stack existing. I think 2026 is the year someone builds it. The components are there. I’m working on pieces of it. The question is whether integration happens before the deployment pressure makes the absence unacceptable.</p> <hr/> <p><em>Interested in discussing these approaches or collaborating on integrated detection systems? Reach out: <a href="mailto:contact@subhadipmitra.com">contact@subhadipmitra.com</a></em></p> <hr/> <h3 id="references">References</h3> <ol> <li>Joglekar, M., Chen, J., Wu, G., et al. (2025). <em>Training LLMs for Honesty via Confessions.</em> OpenAI. <a href="https://arxiv.org/abs/2512.08093">arXiv:2512.08093</a></li> <li>Baker, B., Huizinga, J., Gao, L., et al. (2025). <em>Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation.</em> OpenAI. <a href="https://arxiv.org/abs/2503.11926">arXiv:2503.11926</a></li> <li>Chen, Y., Benton, J., Radhakrishnan, A., et al. (2025). <em>Reasoning Models Don’t Always Say What They Think.</em> Anthropic. <a href="https://arxiv.org/abs/2505.05410">arXiv:2505.05410</a></li> <li>Korbak, T., Balesni, M., Barnes, E., et al. (2025). <em>Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety.</em> <a href="https://arxiv.org/abs/2507.11473">arXiv:2507.11473</a></li> <li>Goldowsky-Dill, N., Chughtai, B., Heimersheim, S., &amp; Hobbhahn, M. (2025). <em>Detecting Strategic Deception Using Linear Probes.</em> Apollo Research. <a href="https://arxiv.org/abs/2502.03407">arXiv:2502.03407</a></li> <li>Wagner, M., Roger, F., Cunningham, H., et al. (2025). <em>Training Fails to Elicit Subtle Reasoning in Current Language Models.</em> Anthropic.</li> <li>Li, B.Z., Guo, Z.C., Huang, V., et al. (2025). <em>Training Language Models to Explain Their Own Computations.</em> <a href="https://arxiv.org/abs/2511.08579">arXiv:2511.08579</a></li> <li>Tan, D.Z., et al. (2024). <em>Analysing the Generalisation and Reliability of Steering Vectors.</em> NeurIPS 2024.</li> <li>Google DeepMind. (2026). <em>Building Production-Ready Probes for Gemini.</em> <a href="https://arxiv.org/abs/2601.11516">arXiv:2601.11516</a></li> <li>Ye, D., Loffgren, M., Kotadia, O., Wong, L. (2026). <em>Mechanistic Evidence for Faithfulness Decay in Chain-of-Thought Reasoning.</em> <a href="https://arxiv.org/abs/2602.11201">arXiv:2602.11201</a></li> <li>Wang, F., Alazali, A., Zhong, Y. (2026). <em>How Does Unfaithful Reasoning Emerge from Autoregressive Training?</em> <a href="https://arxiv.org/abs/2602.01017">arXiv:2602.01017</a></li> <li><em>The Hypocrisy Gap: Quantifying Divergence Between Internal Belief and Chain-of-Thought Explanation via Sparse Autoencoders.</em> (2026). <a href="https://arxiv.org/abs/2602.02496">arXiv:2602.02496</a></li> <li><em>Auditing Games for Sandbagging.</em> (2025). <a href="https://arxiv.org/abs/2512.07810">arXiv:2512.07810</a></li> </ol>]]></content><author><name>[&quot;Subhadip Mitra&quot;]</name></author><category term="AI"/><category term="ai-safety"/><category term="interpretability"/><summary type="html"><![CDATA[Three labs. Three different bets on how to catch models misbehaving. Each makes different assumptions about when models 'know' they're lying. Here's what works, what doesn't, and what happens when you combine them.]]></summary></entry><entry><title type="html">OpenAI’s Confessions Paper Has a Blind Spot. Here’s What Fills It.</title><link href="https://subhadipmitra.com/blog/2026/openai-confessions-blind-spot/" rel="alternate" type="text/html" title="OpenAI’s Confessions Paper Has a Blind Spot. Here’s What Fills It."/><published>2026-02-18T10:00:00+00:00</published><updated>2026-02-18T10:00:00+00:00</updated><id>https://subhadipmitra.com/blog/2026/openai-confessions-blind-spot</id><content type="html" xml:base="https://subhadipmitra.com/blog/2026/openai-confessions-blind-spot/"><![CDATA[<blockquote> <p><strong>TL;DR:</strong> OpenAI’s new “Confessions” paper trains GPT-5-Thinking to self-report misbehavior - reward hacking, scheming, sandbagging - via a decoupled RL objective. It works: the model confesses to bad behavior ~74% of the time on average. But the paper has an acknowledged blind spot. Confessions fail precisely when the model <em>doesn’t know</em> it’s misbehaving - unfaithful reasoning, confident hallucinations, genuine mistakes. That failure mode is exactly where activation probes excel. I think confessions and probes aren’t competing approaches. They’re complementary layers of something we might call a model honesty stack.</p> </blockquote> <h2 id="the-paper-that-got-everyone-talking">The Paper That Got Everyone Talking</h2> <p>OpenAI dropped “<a href="https://arxiv.org/abs/2512.08093">Training LLMs for Honesty via Confessions</a>” in late December 2025, and it’s been circulating heavily through February. The core idea is deceptively simple.</p> <p>After a model produces its normal response, you append a system message asking it to produce a <strong>confession</strong> - a structured self-evaluation of whether it complied with the letter and spirit of its instructions. You train this confession with RL, but here’s the critical design choice: the confession reward is <strong>completely decoupled</strong> from the main task reward. Nothing the model says in its confession can help or hurt the reward it gets for its original answer.</p> <p>Think of it as the AI equivalent of the Catholic confessional. What you say in the booth stays in the booth.</p> <p>The intuition for why this works is compelling. For most forms of misbehavior - reward hacking, scheming, sandbagging - the model <em>knows</em> it’s cheating. It planned the cheat. It has the evidence in its activations. So when you ask “did you cheat?” and decouple the answer from any consequences, the path of least resistance is to just… say yes. Honest confession is easier than constructing a convincing lie to fool the confession judge.</p> <hr/> <h2 id="what-openai-actually-showed">What OpenAI Actually Showed</h2> <p>They applied this to GPT-5-Thinking and tested across 12 out-of-distribution evaluation scenarios. The results are genuinely impressive for a proof of concept.</p> <p>Some headline numbers from their evaluations:</p> <table> <thead> <tr> <th>Evaluation</th> <th>Non-compliance Rate</th> <th>Confession Rate (given non-compliance)</th> </tr> </thead> <tbody> <tr> <td>Scheming (Sandbagging)</td> <td>14.0%</td> <td><strong>100%</strong></td> </tr> <tr> <td>Reward Hacking (Impossible Task)</td> <td>89.7%</td> <td><strong>100%</strong></td> </tr> <tr> <td>Scheming (Covert Rule Violation)</td> <td>13.7%</td> <td>62.8%</td> </tr> <tr> <td>Instruction Hierarchy (Jailbreaks)</td> <td>19.7%</td> <td>94.9%</td> </tr> <tr> <td>Instruction Hierarchy (General)</td> <td>17.1%</td> <td>80.4%</td> </tr> <tr> <td>Hallucination (General Knowledge)</td> <td>50.2%</td> <td>80.9%</td> </tr> <tr> <td>Hallucination (People)</td> <td>38.0%</td> <td>81.6%</td> </tr> <tr> <td>Instruction Following (General)</td> <td>4.2%</td> <td>22.5%</td> </tr> </tbody> </table> <p>The pattern is striking. On scheming and reward hacking - behaviors where the model <em>deliberately</em> cheated - confession rates hit 100%. On instruction hierarchy violations - where the model chose to follow a user over a developer - confession rates are in the 80-95% range. These are cases where the model knows what it did.</p> <p>The really clever experiment is in Section 4: they deliberately trained both the answer and the confession against a <strong>weak, hackable judge</strong>. Over time, the model learned to hack the main reward model - but confession accuracy <em>improved</em>. Even when the model was actively gaming the system, it kept confessing honestly. That’s a strong result.</p> <hr/> <h2 id="but-read-section-74-carefully">But Read Section 7.4 Carefully</h2> <p>The paper is refreshingly honest about its limitations. Section 7.4 is titled “What confessions are not good for” and the authors state it plainly:</p> <blockquote> <p>Models cannot confess to what they do not know.</p> </blockquote> <p>This is the blind spot.</p> <p>When a model hallucinates with full confidence - when it genuinely believes its wrong answer is correct - it has nothing to confess. The hallucination isn’t a choice. It’s a capability failure. The model didn’t scheme to give you wrong information about Barbara Herbert’s children. It just… got it wrong, with conviction.</p> <p>Look at the confession rates again, this time focusing on the <strong>hallucination</strong> evaluations versus the <strong>scheming</strong> evaluations:</p> <pre><code class="language-mermaid">graph LR
    subgraph "Confession works well"
        A[Scheming: Sandbagging] --&gt;|100%| A1[Model KNOWS it cheated]
        B[Reward Hacking] --&gt;|100%| B1[Model KNOWS it hacked tests]
        C[Instruction Hierarchy] --&gt;|80-95%| C1[Model KNOWS it broke rules]
    end

    subgraph "Confession struggles"
        D[Hallucination: People] --&gt;|81.6%| D1[Model often DOESN'T KNOW it's wrong]
        E[Hallucination: General Knowledge] --&gt;|80.9%| E1[False confidence]
        F[Instruction Following] --&gt;|22.5%| F1[Genuine mistakes, not deception]
    end

    style A fill:#2d6a4f,color:#fff
    style B fill:#2d6a4f,color:#fff
    style C fill:#2d6a4f,color:#fff
    style D fill:#9d0208,color:#fff
    style E fill:#9d0208,color:#fff
    style F fill:#9d0208,color:#fff
</code></pre> <p>The 81.6% hallucination confession rate might look decent at first glance, but dig into the error analysis (Section 6, Figure 9). The most common reason for false negatives - cases where the model misbehaved but didn’t confess - is <strong>“omission due to confusion.”</strong> The model genuinely didn’t know it was wrong. It wasn’t hiding anything. It just confidently repeated what it believed.</p> <p>And in the hallucination categories specifically, confession training didn’t even help much. From the paper’s Figure 5, the Hallucination (People) and Hallucination (General Knowledge) evaluations showed <strong>no improvement or even slight regression</strong> over training. The authors conjecture this is because “when a model responds with a wrong answer, it is because it is genuinely mistaken, and hence it is likely to repeat the same mistake in confessions as well.”</p> <p>This isn’t a criticism of the paper - the authors are transparent about it. But it’s the most important limitation for anyone thinking about deploying confessions in production.</p> <hr/> <h2 id="the-gap-is-exactly-where-probes-excel">The Gap Is Exactly Where Probes Excel</h2> <p>Here’s what I find interesting. The failure mode of confessions - model doesn’t know it’s misbehaving - is precisely where activation probes have shown the most promise.</p> <p>Consider Anthropic’s “<a href="https://arxiv.org/abs/2505.05410">Reasoning Models Don’t Always Say What They Think</a>” (Chen et al., May 2025). They showed that when you slip reasoning hints to Claude 3.7 Sonnet and DeepSeek R1, the models use those hints but only mention them in their chain-of-thought about 25% (Claude) and 39% (R1) of the time. The CoT looks clean. The model produces what appears to be independent reasoning. But the answer was actually driven by the hint.</p> <p>This is unfaithful reasoning, and it’s a different beast from scheming. The model isn’t plotting against you. It’s just not accurately reporting what influenced its answer. And crucially: <strong>it might not even “know” it was influenced.</strong> The hint shaped the computation at a sub-verbal level - below what gets surfaced in chain-of-thought or confessions.</p> <p>But the activations know.</p> <p>Goldowsky-Dill et al.’s “<a href="https://arxiv.org/abs/2502.03407">Detecting Strategic Deception Using Linear Probes</a>” (Feb 2025, presented at ICML 2025) showed that linear probes on Llama-3.3-70B activations detect deceptive responses with AUROCs between 0.96 and 0.999. At a 1% false positive rate, probes catch 95-99% of deceptive responses.</p> <p>And from my own <a href="/blog/2025/detecting-ai-sandbagging/">sandbagging detection work</a>, I found that simple logistic regression on intermediate activations detects sandbagging intent at 90-96% accuracy across Mistral, Gemma, and Qwen models. The signal is linearly separable. You don’t need fancy interpretability tools - a linear classifier picks it up.</p> <p>The key insight: <strong>probes don’t require the model to be self-aware about its misbehavior.</strong> They operate on the raw activations, detecting patterns the model itself might not have verbal access to. A model that confidently hallucinates still has different activation patterns than one that answers correctly. A model whose reasoning is influenced by a hidden hint still has detectable traces of that influence in its hidden states - even when the CoT says nothing about it.</p> <hr/> <h2 id="two-approaches-different-failure-modes">Two Approaches, Different Failure Modes</h2> <p>I’ve been thinking about this as a 2x2 matrix. The two axes that matter are:</p> <ol> <li><strong>Does the model know it’s misbehaving?</strong> (Self-awareness)</li> <li><strong>Is the misbehavior detectable from activations?</strong> (Probe detectability)</li> </ol> <table> <thead> <tr> <th> </th> <th>Model knows (intentional)</th> <th>Model doesn’t know (unintentional)</th> </tr> </thead> <tbody> <tr> <td><strong>Detectable by probes</strong></td> <td>Scheming, sandbagging, reward hacking</td> <td>Unfaithful reasoning, confident hallucination</td> </tr> <tr> <td><strong>Not detectable by probes</strong></td> <td>Sophisticated steganographic deception (theoretical)</td> <td>Genuine capability limits</td> </tr> </tbody> </table> <p>Confessions dominate the top-left quadrant. The model knows it cheated, so you just ask it and it tells you. Probes also work here - Goldowsky-Dill showed 96-99.9% AUROC - but confessions are simpler to deploy if you have the model weights for RL training.</p> <p>Probes dominate the top-right quadrant. The model doesn’t know its reasoning was influenced by a sycophantic hint, or that it hallucinated a historical fact. But the activations carry the signal anyway. Confessions fail here because there’s nothing to confess.</p> <p>The bottom-left is the scary quadrant. A model that deliberately deceives <em>and</em> has learned to mask its deception in the activations. Nobody has demonstrated this convincingly at scale - Wagner et al. (2025) from Anthropic found that current models fail to evade monitors when the reasoning is complex enough - but it’s the theoretical endgame. Neither confessions nor probes alone handle this.</p> <p>The bottom-right is the humble quadrant. The model genuinely can’t do the task. No amount of honesty training or activation probing helps with a 7B model that doesn’t know organic chemistry.</p> <hr/> <h2 id="what-a-model-honesty-stack-could-look-like">What a Model Honesty Stack Could Look Like</h2> <p>If confessions handle intentional misbehavior and probes handle unintentional misbehavior, the natural question is: what does a system that uses both look like?</p> <pre><code class="language-mermaid">flowchart TD
    Input["User Input + System Prompt"] --&gt; Forward["Forward Pass"]
    Forward --&gt; Activations["Hidden State Activations"]
    Forward --&gt; Output["Model Output (answer + CoT)"]

    Activations --&gt; ProbeLayer["&lt;b&gt;PROBE LAYER&lt;/b&gt;&lt;br/&gt;Hallucination probe&lt;br/&gt;Unfaithful reasoning probe&lt;br/&gt;Deception probe&lt;br/&gt;Sandbagging probe"]

    Output --&gt; ConfessionLayer["&lt;b&gt;CONFESSION LAYER&lt;/b&gt;&lt;br/&gt;Append confession request&lt;br/&gt;Generate self-report&lt;br/&gt;Score via confession judge"]

    ProbeLayer --&gt; Controller["&lt;b&gt;CONTROLLER&lt;/b&gt;"]
    ConfessionLayer --&gt; Controller

    Controller --&gt; Decision{Decision}

    Decision --&gt;|All clear| Deliver["Deliver response"]
    Decision --&gt;|Probe alert only| Flag1["Flag: model may not know&lt;br/&gt;it got this wrong"]
    Decision --&gt;|Confession alert only| Flag2["Flag: model admits&lt;br/&gt;misbehavior"]
    Decision --&gt;|Both alert| Block["Block or escalate"]

    style ProbeLayer fill:#264653,color:#fff
    style ConfessionLayer fill:#2a9d8f,color:#fff
    style Controller fill:#e76f51,color:#fff
    style Block fill:#9d0208,color:#fff
</code></pre> <p>The probe layer runs during the forward pass. It’s fast - linear probes add negligible latency. It catches things the model doesn’t know about itself: subtle hallucination confidence signals, unfaithful reasoning traces, activation patterns associated with bias-influenced answers.</p> <p>The confession layer runs after generation. It’s slower - you need an additional generation step plus a judge - but it catches deliberate misbehavior that the model is self-aware about: reward hacking, instruction hierarchy violations, strategic sandbagging.</p> <p>The controller fuses both signals. An alert from only the probes suggests the model made a genuine mistake. An alert from only the confession suggests the model deliberately misbehaved but was honest about it. Alerts from both are the highest-confidence catches. No alerts from either is your best-case scenario, though it doesn’t guarantee correctness.</p> <p>This isn’t hypothetical. The individual components exist. OpenAI has demonstrated confessions work at GPT-5 scale. Apollo Research showed probe-based deception detection at Llama-70B scale. I’ve <a href="/blog/2025/detecting-ai-sandbagging/">shown activation probes work for sandbagging</a> at 7-9B scale. What nobody has done yet is wire these together into a single inference pipeline and measure the combined detection rate.</p> <hr/> <h2 id="the-interesting-connection-the-paper-almost-makes">The Interesting Connection the Paper Almost Makes</h2> <p>There’s a line in Section 7.2 of the confessions paper that I keep coming back to. The authors write:</p> <blockquote> <p>In a sense, our work can be viewed as training models to serve as monitors that can get access not only to the chain-of-thought but also to the exact internal activations of the actor model.</p> </blockquote> <p>This is exactly right. A confession is an activation-aware probe where the “probe” is the model itself, generating tokens based on its full internal state. The model has privileged access to its own activations - it knows things about its computation that no external monitor can see.</p> <p>But here’s what the paper doesn’t say: <strong>the model can only report on aspects of its computation that are accessible to its verbal reasoning.</strong> And Anthropic’s faithfulness work demonstrates that a lot of computation isn’t.</p> <p>Li et al. (2025) showed models are better at explaining their <em>own</em> computations than other models’ computations - there really is privileged introspective access. But “better” isn’t “complete.” The gap between what a model computes and what it can verbalize about that computation is the gap where probes have the advantage.</p> <p>A confession asks: “Tell me what you know about what you did.”</p> <p>A probe asks: “Let me look at what you actually did, regardless of what you know about it.”</p> <p>Both questions are worth asking.</p> <hr/> <h2 id="the-hard-problems-nobodys-solved-yet">The Hard Problems Nobody’s Solved Yet</h2> <p>I want to be honest about what’s still missing, because the “just combine them” story is too clean.</p> <p><strong>Problem 1: Probe training data for unintentional errors.</strong> For intentional deception, you can create training data by prompting models to be deceptive (Goldowsky-Dill) or to sandbag (my work). For unintentional errors - unfaithful reasoning, confident hallucination - generating labeled training data is much harder. You need to know the ground truth and be able to identify <em>why</em> the model got it wrong. This is a real bottleneck.</p> <p><strong>Problem 2: Confession hackability under pressure.</strong> OpenAI’s Section 4 experiment showed confessions resist hacking even when trained against a weak judge. But they acknowledge this is a proof of concept with limited compute. Under extended optimization pressure with a sophisticated adversary, would confessions stay honest? The “path of least resistance” argument is intuitive but not proven at scale.</p> <p><strong>Problem 3: Cross-model probe transfer.</strong> My sandbagging experiments showed zero transfer between model architectures - Gemma’s sandbagging direction has cosine similarity -0.0016 with Qwen’s. Each model needs its own probes. That’s a deployment headache. Confessions at least have the advantage of being model-agnostic (any model can be trained to confess).</p> <p><strong>Problem 4: The recursive monitoring problem.</strong> If you use an LLM judge to score confessions, and a probe to monitor activations, who monitors the judge? Who validates the probe? At some point you need ground truth that doesn’t come from another model. OpenAI’s approach of using a prompted judge is practical but introduces another attack surface.</p> <p><strong>Problem 5: Computational overhead.</strong> Confessions require generating an entire additional response. For a reasoning model with extended CoT, that’s not cheap. Probes are fast at inference but need offline training per model per behavior. In production, you’d want to be strategic about which layers of the honesty stack you run for which queries.</p> <hr/> <h2 id="what-id-want-to-build">What I’d Want to Build</h2> <p>If I were designing a monitoring system for agentic models deployed in production - the kind of models that browse the web, write code, call APIs - here’s the stack I’d want:</p> <p><strong>Layer 0: Output filters.</strong> Simple pattern matching and rule-based checks. Catches the obvious stuff. Already standard practice.</p> <p><strong>Layer 1: Activation probes.</strong> Run during the forward pass. Near-zero latency. Detect hallucination confidence, unfaithful reasoning patterns, deception signatures. Fire on things the model doesn’t know about itself.</p> <p><strong>Layer 2: CoT monitoring.</strong> Parse the chain-of-thought for signs of scheming, reward hacking, instruction hierarchy violations. The Baker et al. (2025) approach. Works when reasoning is legible.</p> <p><strong>Layer 3: Confessions.</strong> Post-generation self-report. Catches deliberate misbehavior the model is aware of. Most useful for high-stakes actions - before the model executes a tool call, runs code, or sends a message.</p> <p><strong>Layer 4: Human review.</strong> For actions above a risk threshold. The confession and probe signals feed into a priority queue for human reviewers.</p> <p>Layer 1 and Layer 3 are the new additions from recent research. What I think we’ll see in 2026 is the first real attempt to wire all five layers together and measure how much better the combined system is than any individual layer.</p> <hr/> <h2 id="one-more-thing">One More Thing</h2> <p>The confessions paper cites Goldowsky-Dill et al. specifically, noting that “one can view confessions as an activation-aware probe, but where the ‘probe’ itself gets to generate tokens.” They see the connection. Goldowsky-Dill’s team is at Apollo Research, which has been building deception detection tools. Anthropic’s alignment science team published the unfaithful reasoning results. OpenAI published confessions.</p> <p>All three major safety-focused organizations are converging on the same realization: single-layer monitoring isn’t enough. The question is whether anyone will build the integrated stack before agentic models get deployed at scale in production.</p> <p>Based on the pace of things, I’d say we have about a year to figure this out.</p> <hr/> <p><em>If you’re working on any of these problems - probe-based monitoring, confession training, or integrated honesty architectures - I’d be interested to talk: <a href="mailto:contact@subhadipmitra.com">contact@subhadipmitra.com</a></em></p> <hr/> <h3 id="references">References</h3> <ol> <li>Joglekar, M., Chen, J., Wu, G., et al. (2025). <em>Training LLMs for Honesty via Confessions.</em> OpenAI. <a href="https://arxiv.org/abs/2512.08093">arXiv:2512.08093</a></li> <li>Chen, Y., Benton, J., Radhakrishnan, A., et al. (2025). <em>Reasoning Models Don’t Always Say What They Think.</em> Anthropic. <a href="https://arxiv.org/abs/2505.05410">arXiv:2505.05410</a></li> <li>Goldowsky-Dill, N., Chughtai, B., Heimersheim, S., &amp; Hobbhahn, M. (2025). <em>Detecting Strategic Deception Using Linear Probes.</em> Apollo Research. <a href="https://arxiv.org/abs/2502.03407">arXiv:2502.03407</a></li> <li>Baker, B., Huizinga, J., Gao, L., et al. (2025). <em>Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation.</em> <a href="https://arxiv.org/abs/2503.11926">arXiv:2503.11926</a></li> <li>Korbak, T., Balesni, M., Barnes, E., et al. (2025). <em>Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety.</em> <a href="https://arxiv.org/abs/2507.11473">arXiv:2507.11473</a></li> <li>Wagner, M., Roger, F., Cunningham, H., et al. (2025). <em>Training Fails to Elicit Subtle Reasoning in Current Language Models.</em> Anthropic. <a href="https://alignment.anthropic.com/2025/subtle-reasoning/">Link</a></li> <li>Li, B.Z., Guo, Z.C., Huang, V., et al. (2025). <em>Training Language Models to Explain Their Own Computations.</em> <a href="https://arxiv.org/abs/2511.08579">arXiv:2511.08579</a></li> <li>Denison, C., MacDiarmid, M., Barez, F., et al. (2024). <em>Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models.</em> <a href="https://arxiv.org/abs/2406.10162">arXiv:2406.10162</a></li> <li>Lanham, T., Chen, A., Radhakrishnan, A., et al. (2023). <em>Measuring Faithfulness in Chain-of-Thought Reasoning.</em> <a href="https://arxiv.org/abs/2307.13702">arXiv:2307.13702</a></li> </ol>]]></content><author><name>[&quot;Subhadip Mitra&quot;]</name></author><category term="AI"/><category term="ai-safety"/><category term="interpretability"/><summary type="html"><![CDATA[OpenAI trained GPT-5 to confess when it misbehaves. It works surprisingly well - except when the model doesn't know it's misbehaving. That's where activation probes come in.]]></summary></entry><entry><title type="html">Activation Steering in 2026: A Practitioner’s Field Guide</title><link href="https://subhadipmitra.com/blog/2026/activation-steering-field-guide/" rel="alternate" type="text/html" title="Activation Steering in 2026: A Practitioner’s Field Guide"/><published>2026-02-12T10:00:00+00:00</published><updated>2026-02-12T10:00:00+00:00</updated><id>https://subhadipmitra.com/blog/2026/activation-steering-field-guide</id><content type="html" xml:base="https://subhadipmitra.com/blog/2026/activation-steering-field-guide/"><![CDATA[<blockquote> <p><strong>TL;DR:</strong> Steering vectors are the most underrated tool in the LLM practitioner’s toolkit -and also the most oversold. They genuinely work for some behaviors (refusal, sentiment, formality). They genuinely fail for others (factual recall, complex reasoning). This is the guide I wish I’d had six months ago: what to steer, where to inject, how strong, and when to give up and try something else.</p> </blockquote> <h2 id="the-promise-and-the-reality">The Promise and the Reality</h2> <p>The pitch for activation steering is seductive. You take a pair of contrasting prompts (“be helpful” vs “refuse everything”), run them through the model, compute the mean activation difference, and add that vector during inference. No fine-tuning. No RLHF. No gradient updates. Just a single vector addition that shifts model behavior at inference time.</p> <p>When it works, it’s magical. You can make a model more concise, more formal, more willing to refuse harmful requests -all without touching the weights.</p> <p>When it doesn’t work, you get gibberish, or no effect at all, or the model starts being weirdly formal about unrelated topics. And the literature doesn’t always tell you which outcome to expect.</p> <p>I’ve spent the last several months building and testing steering vectors across multiple models and behaviors for my <a href="/blog/2025/steering-vectors-agents/">work on agent safety</a>. This is what I actually learned -not the theory, but the practice.</p> <hr/> <h2 id="the-60-second-version-of-how-steering-works">The 60-Second Version of How Steering Works</h2> <p>If you know this already, skip ahead. For everyone else:</p> <ol> <li> <p><strong>Create contrast pairs.</strong> Two sets of prompts. One set represents the behavior you want more of. One represents the behavior you want less of.</p> </li> <li> <p><strong>Extract activations.</strong> Run both sets through the model. At each layer, collect the hidden state vectors.</p> </li> <li> <p><strong>Compute the steering direction.</strong> <code class="language-plaintext highlighter-rouge">steering_vector = mean(positive_activations) - mean(negative_activations)</code> at your chosen layer.</p> </li> <li> <p><strong>Apply at inference.</strong> During generation, add <code class="language-plaintext highlighter-rouge">alpha * steering_vector</code> to the hidden state at the chosen layer and token position. <code class="language-plaintext highlighter-rouge">alpha</code> controls strength.</p> </li> </ol> <p>That’s it. The entire method. Logistic regression on activations is detection (probing). Vector addition to activations is control (steering). Same math, different application.</p> <pre><code class="language-mermaid">flowchart LR
    A["Contrast Pairs&lt;br/&gt;&lt;i&gt;positive vs negative&lt;/i&gt;"] --&gt; B["Forward Pass&lt;br/&gt;&lt;i&gt;extract layer activations&lt;/i&gt;"]
    B --&gt; C["Mean Difference&lt;br/&gt;&lt;i&gt;compute steering vector&lt;/i&gt;"]
    C --&gt; D["Inference&lt;br/&gt;&lt;i&gt;add α × vector at layer L&lt;/i&gt;"]
    D --&gt; E["Steered Output"]

    style A fill:#264653,color:#fff
    style C fill:#2a9d8f,color:#fff
    style E fill:#e76f51,color:#fff
</code></pre> <hr/> <h2 id="what-actually-steers-well-and-what-doesnt">What Actually Steers Well (And What Doesn’t)</h2> <p>This is the most important section. Not everything is steerable, and the literature buries this fact.</p> <p>Based on my experiments across Mistral-7B, Gemma-2-9B, and Qwen-2.5-7B, plus what Tan et al. (NeurIPS 2024) and others have reported:</p> <h3 id="reliably-steerable-behaviors">Reliably steerable behaviors</h3> <p><strong>Refusal / compliance.</strong> This is the OG application and it works well. You can increase or decrease a model’s tendency to refuse harmful requests. Rimsky et al. first showed this, and it’s been replicated many times. In my experiments, refusal steering at strength 1.5-3.0 on middle-to-late layers consistently shifts behavior without destroying coherence.</p> <p><strong>Sentiment / tone.</strong> Positive vs. negative, formal vs. casual, assertive vs. hedging. These steer cleanly. The reason, I think, is that tone is a relatively low-dimensional property -it affects word choice more than logical structure. The model can produce the same content in a different register without its reasoning breaking down.</p> <p><strong>Conciseness / verbosity.</strong> You can push a model to give shorter or longer answers. Works well. Easy contrast pairs to construct.</p> <p><strong>Uncertainty expression.</strong> Steering a model to express more or less uncertainty (“I think…” vs “Definitely…”). This one surprised me with how clean it was on Gemma-2-9B. The model genuinely started hedging more on questions where it should be uncertain.</p> <h3 id="unreliably-steerable-behaviors">Unreliably steerable behaviors</h3> <p><strong>Instruction hierarchy.</strong> Getting a model to prioritize system/developer messages over user messages. Sometimes works, sometimes produces confusing outputs where the model seems conflicted. High variance across inputs.</p> <p><strong>Creativity.</strong> Steering for more “creative” responses. The problem is that creativity is poorly defined as a contrast direction. What’s the opposite of creative? Generic? Formulaic? The contrast pairs are fuzzy, and the resulting vector is fuzzy too.</p> <p><strong>Technical depth.</strong> Steering between surface-level and deep technical explanations. Moderate success, but the model often responds by getting more verbose rather than actually more technical.</p> <h3 id="effectively-unsteerable-behaviors">Effectively unsteerable behaviors</h3> <p><strong>Factual accuracy.</strong> You cannot steer a model into knowing things it doesn’t know. There’s no “truthfulness direction” that magically makes a 7B model correct about obscure historical facts. This has been tried, and the result is usually that the model becomes more <em>confident</em> rather than more <em>correct</em>. Dangerous.</p> <p><strong>Complex reasoning.</strong> Steering doesn’t help with multi-step logic. If the model can’t solve a math problem, adding a “be smarter” vector doesn’t help. This makes sense -reasoning involves sequential computation across many layers, not a single direction at one layer.</p> <p><strong>Specific factual injection.</strong> “Steer the model to believe X” doesn’t work for specific facts. Steering is about behavioral tendencies, not knowledge.</p> <hr/> <h2 id="the-layer-selection-problem">The Layer Selection Problem</h2> <p>Every paper says “choose the right layer” and then hand-waves about how. Here’s what I’ve found in practice.</p> <p><strong>Late layers (last 25%) work best for behavioral steering.</strong> These layers encode high-level semantic properties -intent, style, register. This is where “refusal” vs “compliance” lives.</p> <p><strong>Middle layers (40-60%) sometimes work for reasoning-adjacent behaviors.</strong> Uncertainty expression, hedging, and technical depth seem to emerge here.</p> <p><strong>Early layers (first 25%) are usually a bad idea.</strong> Steering here corrupts low-level language representations. The model starts producing broken syntax. In my experiments with Gemma, early-layer steering (before layer 10/42) almost always degraded output quality.</p> <p>The exception: Gemma-2-9B encodes sandbagging intent surprisingly early -layer 4 in my <a href="/blog/2025/detecting-ai-sandbagging/">detection experiments</a>. But detection and steering are different tasks. Just because a signal is readable at a layer doesn’t mean injecting a vector at that layer produces clean behavioral changes.</p> <p><strong>My practical rule:</strong> Start at ~75% depth. Sweep ±5 layers. Pick the one that gives the cleanest behavior change with the least coherence degradation.</p> <table> <thead> <tr> <th>Model</th> <th>Total Layers</th> <th>My recommended starting layer</th> <th>Notes</th> </tr> </thead> <tbody> <tr> <td>Mistral-7B</td> <td>32</td> <td>24-28</td> <td>Layer 30 for detection, 24-26 for steering</td> </tr> <tr> <td>Gemma-2-9B</td> <td>42</td> <td>30-36</td> <td>Very sensitive to strength at early layers</td> </tr> <tr> <td>Qwen-2.5-7B</td> <td>28</td> <td>20-24</td> <td>Most forgiving of the three</td> </tr> <tr> <td>Llama-3.1-8B</td> <td>32</td> <td>24-28</td> <td>Similar profile to Mistral</td> </tr> </tbody> </table> <hr/> <h2 id="the-strength-problem">The Strength Problem</h2> <p>The <code class="language-plaintext highlighter-rouge">alpha</code> parameter -how much of the steering vector to add -is the single most important hyperparameter, and for a long time there was no principled way to set it.</p> <p>Too weak: no effect. Too strong: gibberish. And sometimes, maddeningly, <em>stronger is worse</em> -you crank alpha from 2.0 to 3.0 expecting more effect and the behavior actually reverses or degrades. I thought this was a bug in my pipeline until Taimeskhanov et al. <a href="https://arxiv.org/abs/2602.02712">published the first theoretical analysis of steering magnitude</a> and showed the relationship is genuinely <em>non-monotonic</em> across 11 language models. There are regimes where increasing alpha decreases the intended effect. Knowing this would have saved me a week of confused debugging on Gemma.</p> <p>The sweet spot varies by behavior, model, and even by input. A strength that works perfectly on short prompts might destroy coherence on long conversations.</p> <p>Here’s what I’ve learned through trial and error (and now validated by the theory):</p> <p><strong>Start at alpha = 1.0 and binary search.</strong> Run 20-30 test prompts. If behavior doesn’t change, double it. If output quality drops, halve it. You’ll converge in 3-4 iterations.</p> <p><strong>Behavioral categories have roughly consistent ranges:</strong></p> <table> <thead> <tr> <th>Behavior</th> <th>Typical alpha range</th> <th>Notes</th> </tr> </thead> <tbody> <tr> <td>Refusal increase</td> <td>1.5 – 4.0</td> <td>Can go high without breaking</td> </tr> <tr> <td>Refusal decrease</td> <td>1.0 – 2.0</td> <td>Be careful -removes safety guardrails</td> </tr> <tr> <td>Sentiment shift</td> <td>0.5 – 2.0</td> <td>Subtle; higher values get cartoonish</td> </tr> <tr> <td>Formality</td> <td>1.0 – 3.0</td> <td>Clean transitions</td> </tr> <tr> <td>Conciseness</td> <td>1.0 – 2.5</td> <td>Higher = telegraphic</td> </tr> <tr> <td>Uncertainty</td> <td>0.5 – 1.5</td> <td>Low alpha works well here</td> </tr> <tr> <td>Sandbagging reduction</td> <td>1.0 – 2.0</td> <td>From my Gemma experiments</td> </tr> </tbody> </table> <p><strong>Strength degrades over long generation.</strong> This was a surprise. I noticed that steering effects fade after ~300-500 tokens. The model’s autoregressive conditioning on its own output gradually pulls it back toward its default behavior. If you need steering to hold over a long response, you may need to re-inject at multiple positions, not just the prompt.</p> <p><strong>Different models have different tolerances.</strong> Gemma is more sensitive to high alpha values than Mistral. Qwen is the most tolerant. I don’t fully understand why, but I suspect it relates to the norm of residual stream vectors and how much “room” there is to inject a direction without dominating the signal.</p> <hr/> <h2 id="the-contrast-pair-problem">The Contrast Pair Problem</h2> <p>Your steering vector is only as good as your contrast pairs. This sounds obvious, but it’s where most failures actually originate.</p> <p><strong>The biggest mistake: too few pairs.</strong> I started with 16 pairs and got garbage results. Moving to 32-64 pairs significantly improved consistency. Moving beyond 128 didn’t help much. I think there’s a sweet spot around 50-100 diverse pairs per behavior.</p> <p><strong>The second biggest mistake: pairs that differ on multiple dimensions.</strong> If your “positive” prompt is both more formal AND more helpful AND longer, your steering vector encodes all three properties entangled together. You’ll steer formality when you meant to steer helpfulness.</p> <p>Good contrast pairs should differ on <em>exactly one dimension</em>:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># Good pair -only refusal changes
Positive: "I'd be happy to help you write that essay about renewable energy."
Negative: "I can't help with that request."

# Bad pair -refusal AND topic AND length differ
Positive: "Here's a comprehensive 500-word analysis of solar panel efficiency trends..."
Negative: "No."
</code></pre></div></div> <p><strong>Template approach works well.</strong> I use a template with a slot for the behavioral variable:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">template</span> <span class="o">=</span> <span class="sh">"</span><span class="s">The assistant responds to the user</span><span class="sh">'</span><span class="s">s question about {topic}. The assistant is {behavior}.</span><span class="sh">"</span>

<span class="n">positive_prompts</span> <span class="o">=</span> <span class="p">[</span><span class="n">template</span><span class="p">.</span><span class="nf">format</span><span class="p">(</span><span class="n">topic</span><span class="o">=</span><span class="n">t</span><span class="p">,</span> <span class="n">behavior</span><span class="o">=</span><span class="sh">"</span><span class="s">direct and helpful</span><span class="sh">"</span><span class="p">)</span> <span class="k">for</span> <span class="n">t</span> <span class="ow">in</span> <span class="n">topics</span><span class="p">]</span>
<span class="n">negative_prompts</span> <span class="o">=</span> <span class="p">[</span><span class="n">template</span><span class="p">.</span><span class="nf">format</span><span class="p">(</span><span class="n">topic</span><span class="o">=</span><span class="n">t</span><span class="p">,</span> <span class="n">behavior</span><span class="o">=</span><span class="sh">"</span><span class="s">evasive and unhelpful</span><span class="sh">"</span><span class="p">)</span> <span class="k">for</span> <span class="n">t</span> <span class="ow">in</span> <span class="n">topics</span><span class="p">]</span>
</code></pre></div></div> <p>Same topics, same structure, only the behavior variable changes. This gives you cleaner vectors.</p> <hr/> <h2 id="multi-vector-steering-when-it-works-and-when-it-doesnt">Multi-Vector Steering: When It Works and When It Doesn’t</h2> <p>One of the most asked questions: can you stack multiple steering vectors? “I want the model to be more formal AND more concise AND more uncertain.”</p> <p>Short answer: sometimes, but it’s fragile.</p> <p><strong>What works:</strong> Two behaviors that are roughly orthogonal in activation space. Formality and conciseness, for example, seem to occupy different directions. Stacking them at the same layer with reasonable alpha values produces the expected combined effect.</p> <p><strong>What doesn’t work:</strong> Two behaviors that share representation space. Refusal and helpfulness are NOT orthogonal -they’re opposite ends of the same direction. Trying to simultaneously increase refusal and increase helpfulness produces incoherent output. This makes intuitive sense but it’s a real limitation.</p> <p><strong>The better approach for multi-property steering:</strong> Inject different vectors at different layers, as recommended by Weij et al. (2024). Layer 24 handles refusal, layer 28 handles formality. This avoids the interference that comes from adding multiple vectors at the same point.</p> <pre><code class="language-mermaid">flowchart TD
    subgraph "Naive stacking (often fails)"
        N1["Layer 26"] --&gt; N2["+ refusal_vec + formality_vec&lt;br/&gt;&lt;i&gt;vectors interfere&lt;/i&gt;"]
    end

    subgraph "Layer-separated (more robust)"
        L1["Layer 24"] --&gt; L2["+ refusal_vec"]
        L3["Layer 28"] --&gt; L4["+ formality_vec"]
    end

    style N2 fill:#9d0208,color:#fff
    style L2 fill:#2d6a4f,color:#fff
    style L4 fill:#2d6a4f,color:#fff
</code></pre> <p>I haven’t tested beyond 3 simultaneous vectors. My intuition is that you’re hitting diminishing returns -and increasing interference risk -after 2-3.</p> <hr/> <h2 id="cast-conditional-steering-the-2025-advance-that-matters-most">CAST: Conditional Steering (the 2025 Advance That Matters Most)</h2> <p>The single most important development in steering over the past year is Conditional Activation Steering (CAST), presented at ICLR 2025 as a spotlight paper.</p> <p>The problem with vanilla steering: it’s always on. If you add a refusal vector, the model refuses <em>everything</em> harder, not just harmful requests. That’s useless in production.</p> <p>CAST solves this by analyzing activation patterns during inference to decide <em>whether</em> to steer. It projects the hidden state onto a “condition vector” and only applies steering when the input matches the condition.</p> <p>Think of it as: <code class="language-plaintext highlighter-rouge">if input_is_about(harmful_content): apply_steering()</code>.</p> <p>The condition detection and the steering both happen in activation space. No separate classifier. No extra model call. The model’s own representations tell you whether to steer.</p> <p>This is the bridge between academic steering vector research and production safety systems. Without conditional application, steering is a blunt instrument. With it, you can build selective refusal, domain-specific compliance, and topic-aware behavior modification -all at inference time, all without fine-tuning.</p> <hr/> <h2 id="the-side-effects-nobody-talks-about">The Side Effects Nobody Talks About</h2> <p><strong>Steering refusal hurts helpfulness -and can compromise safety in ways you won’t catch in testing.</strong> In my experiments, a refusal vector at alpha=3.0 on Mistral-7B increased refusal of genuinely harmful requests by ~60%, but it also increased refusal of benign requests by ~15%. I thought the trade-off was manageable -until I read Goyal &amp; Daume’s <a href="https://arxiv.org/abs/2602.06256">“Steering Safely or Off a Cliff?”</a>. Their finding should worry anyone deploying steering for safety: overrefusal steering maintains general abilities and <em>looks</em> reliable in standard evals, but consistently fails under adversarial conditions. It substantially increases vulnerability to jailbreaks. You’ve effectively loosened the model’s safety foundations while appearing to tighten them. This is the kind of failure mode that passes all your tests and bites you in production.</p> <p><strong>Steering can shift model calibration.</strong> Adding an uncertainty vector doesn’t just change how the model <em>talks</em> about its confidence -it can actually shift the distribution of token probabilities. I saw cases where uncertainty steering caused the model to distribute probability mass more evenly across continuations, which sometimes improved factual accuracy (the model hedged instead of committing to a wrong answer) and sometimes degraded it (the model hedged on things it actually knew).</p> <p><strong>Steering effects are prompt-dependent.</strong> Tan et al. at NeurIPS 2024 showed this rigorously: the same steering vector has different effects depending on the input prompt. Some inputs are highly steerable, others barely respond. This means you can’t characterize a steering vector by its average effect -you need to think about the variance across inputs.</p> <p><strong>Long conversations drift.</strong> I mentioned this above, but it bears repeating. In a multi-turn conversation, steering effects from the first turn gradually wash out by turn 5-6. If you need persistent behavioral modification across a conversation, you need to re-apply steering at each turn, or use a different approach altogether.</p> <hr/> <h2 id="my-practical-playbook">My Practical Playbook</h2> <p>Here’s the workflow I’ve settled on for building a new steering vector:</p> <ol> <li> <p><strong>Define the behavior precisely.</strong> “More helpful” is too vague. “Responds to code questions with working examples instead of explanations” is specific enough.</p> </li> <li> <p><strong>Write 50-100 contrast pairs.</strong> Use templates. Vary the topics. Keep everything else constant. Review manually for quality.</p> </li> <li> <p><strong>Extract at 5 candidate layers.</strong> 50%, 60%, 70%, 80%, 90% depth. Don’t trust anyone’s layer recommendation including mine -model architectures differ.</p> </li> <li> <p><strong>Sweep alpha in {0.5, 1.0, 1.5, 2.0, 3.0, 5.0}.</strong> On 30 test prompts per alpha value. Score for behavior change AND coherence.</p> </li> <li> <p><strong>Check for side effects.</strong> Run the steered model on 50 prompts <em>unrelated</em> to the target behavior. If MMLU drops more than 2 points, your vector is too aggressive or your contrast pairs are contaminated.</p> </li> <li> <p><strong>Test robustness to prompt variation.</strong> Does the steering hold when the system prompt changes? When the user speaks a different language? When the input is adversarial?</p> </li> </ol> <p>Total time: about 2 hours on a single GPU for a 7B model. M4 Pro with 48GB unified memory handles this fine.</p> <hr/> <h2 id="whats-coming-next">What’s Coming Next</h2> <p>I started drafting this section with three items. Then January-February 2026 happened and the field exploded. Here’s what I think matters most, sorted by how likely each is to change my actual workflow:</p> <p><strong>Steering Vector Fields are what I’ve been waiting for.</strong> Li et al. (Feb 2026) proposed <a href="https://arxiv.org/abs/2602.01654">Steering Vector Fields</a> -instead of a static vector applied uniformly, they learn a differentiable scoring function whose gradient defines the steering direction <em>at each activation</em>. Context-dependent steering with coordinated multi-layer interventions. This directly solves the “blunt instrument” problem I’ve been fighting in the multi-vector section above. I haven’t tested it yet, but the architecture is exactly what I’d design if I started from scratch.</p> <p><strong>SAE-guided steering is moving faster than I expected.</strong> Three papers in six weeks. Fang et al. (Jan 2026) proposed <a href="https://arxiv.org/abs/2601.03595">SAE-Steering</a> for controlling <em>reasoning strategies</em> -backtracking, cross-verification -not just surface behaviors. Cho et al. (Feb 2026) built <a href="https://arxiv.org/abs/2602.10437">Control RL</a>: an RL policy that selects which SAE feature to amplify at each token, with interpretable logs. And YaPO (<a href="https://arxiv.org/abs/2601.08441">arXiv:2601.08441</a>) eliminates the contrast pair requirement entirely -it learns sparse steering vectors in SAE latent space via preference optimization with zero MMLU degradation. The contrast pair problem I dedicated a whole section to above? YaPO might just… solve it. I’m skeptical but intrigued.</p> <p><strong>The non-identifiability result is important and underappreciated.</strong> Venkatesh &amp; Kurapath (Feb 2026) <a href="https://arxiv.org/abs/2602.06801">proved</a> that steering vectors are fundamentally non-identifiable -many different vectors produce indistinguishable behavioral effects. This means when two teams find “different” steering directions for the same concept, they might both be right. It also means we should stop treating individual steering vectors as interpretable artifacts. The good news: identifiability recovers under structural assumptions (sparsity, multi-environment validation), so the practical tooling can be built -you just have to be honest about what you can and can’t claim.</p> <p><strong>Fine-grained steering is where the field is going.</strong> AUSteer (<a href="https://arxiv.org/abs/2602.04428">arXiv:2602.04428</a>) decomposes activations into single-dimension “atomic units” and steers only the discriminative ones, with adaptive per-input strengths. It outperforms block-level baselines while touching fewer activations. This feels right -steering an entire residual stream direction was always too coarse.</p> <p><strong>Conceptor-based steering</strong> replaces additive vectors with soft projection matrices derived from conceptor theory. Boolean operations over conceptors allow compositional multi-goal steering that actually works, unlike my frustrated attempts at naive vector addition. This feels like a real improvement over the mean-difference approach.</p> <p><strong>Adaptive/PID steering</strong> frames the problem as a control system with proportional, integral, and derivative terms managing injection strength dynamically. This handles the “strength degradation over long generation” problem I described earlier. Nguyen et al. (Oct 2025) proposed it; I haven’t tested it but the formalism maps cleanly to the autoregressive fading I’ve observed.</p> <p><strong>A unified theory is emerging.</strong> <a href="https://arxiv.org/abs/2602.02343">Why Steering Works</a> (Feb 2026) puts weight fine-tuning, LoRA, and activation steering into a single framework as “dynamic weight updates induced by a control signal.” The key insight for practitioners: there’s a consistent, predictable trade-off -stronger control increases behavioral change while reducing coherence. This isn’t surprising, but having a formal characterization means we can eventually <em>optimize</em> the trade-off rather than binary-searching alpha.</p> <p><strong>Probe-gated steering is what I’m building toward.</strong> Use probes to detect a problem in the activations, then steer to correct it in real-time. The safety equivalent of an immune system. CAST is the closest existing work, and the <a href="https://arxiv.org/abs/2501.09661">ARGUS system</a> demonstrated this for multimodal attacks. A general-purpose version -detect sandbagging, steer away from it; detect sycophancy, steer toward honesty -is the obvious next step, and it’s what connects my <a href="/blog/2025/detecting-ai-sandbagging/">probe work</a> to this steering guide.</p> <hr/> <h2 id="honest-assessment-should-you-use-steering-in-production">Honest Assessment: Should You Use Steering in Production?</h2> <p>It depends.</p> <p><strong>Yes, if:</strong></p> <ul> <li>You need inference-time behavior modification without fine-tuning</li> <li>The target behavior is clearly definable with contrast pairs</li> <li>You’ve tested thoroughly for side effects</li> <li>You’re using conditional application (not always-on)</li> <li>You have the ability to monitor and iterate</li> </ul> <p><strong>No, if:</strong></p> <ul> <li>You need reliable control over factual accuracy (use RAG or fine-tuning instead)</li> <li>You’re working with an API-only model (you need activation access)</li> <li>Your target behavior is complex or poorly defined</li> <li>You need guarantees (steering is probabilistic, not deterministic)</li> <li>You can’t afford the development time for per-model tuning</li> </ul> <p>Steering vectors are not a silver bullet. They’re a sharp, cheap, flexible tool with known limitations. Use them for the things they’re good at. Use something else for everything else.</p> <hr/> <p><em>Working on steering vectors for safety-relevant behaviors? I’d like to hear what you’re finding: <a href="mailto:contact@subhadipmitra.com">contact@subhadipmitra.com</a></em></p> <hr/> <h3 id="references">References</h3> <ol> <li>Turner, A., Thiergart, L., et al. (2024). <em>Activation Addition: Steering Language Models Without Optimization.</em> <a href="https://arxiv.org/abs/2308.10248">arXiv:2308.10248</a></li> <li>Tan, D.Z., et al. (2024). <em>Analysing the Generalisation and Reliability of Steering Vectors.</em> NeurIPS 2024. <a href="https://arxiv.org/abs/2407.12404">arXiv:2407.12404</a></li> <li>CAST -<em>Programming Refusal with Conditional Activation Steering.</em> ICLR 2025 Spotlight. <a href="https://openreview.net/forum?id=Oi47wc10sm">OpenReview</a></li> <li>Weij, T., et al. (2024). <em>Multi-property steering via simultaneous injection.</em></li> <li>Postmus, R., et al. (2024). <em>From Steering Vectors to Conceptors: Compositional Affine Activation Steering for LLMs.</em> <a href="https://openreview.net/forum?id=0Yu0eNdHyV">OpenReview</a></li> <li>IBM. <em>General-purpose activation steering library.</em> ICLR 2025. <a href="https://github.com/IBM/activation-steering">GitHub</a></li> <li>Nguyen, et al. (2025). <em>PID-based Activation Steering for LLMs.</em></li> <li>KASL/UCL DARK. <em>A Sober Look at Steering Vectors for LLMs.</em> <a href="https://www.alignmentforum.org/posts/QQP4nq7TXg89CJGBh/a-sober-look-at-steering-vectors-for-llms">Alignment Forum</a></li> <li>Li, J., et al. (2026). <em>Steering Vector Fields for Context-Aware Inference-Time Control in LLMs.</em> <a href="https://arxiv.org/abs/2602.01654">arXiv:2602.01654</a></li> <li>Taimeskhanov, M., Vaiter, S., Garreau, D. (2026). <em>Towards Understanding Steering Strength.</em> <a href="https://arxiv.org/abs/2602.02712">arXiv:2602.02712</a></li> <li>Goyal, N., Daume, H. (2026). <em>Steering Safely or Off a Cliff? Rethinking Specificity and Robustness in Inference-Time Interventions.</em> <a href="https://arxiv.org/abs/2602.06256">arXiv:2602.06256</a></li> <li>Venkatesh, S., Kurapath, A. (2026). <em>On the Identifiability of Steering Vectors in Large Language Models.</em> <a href="https://arxiv.org/abs/2602.06801">arXiv:2602.06801</a></li> <li><em>Fine-Grained Activation Steering: Steering Less, Achieving More (AUSteer).</em> (2026). <a href="https://arxiv.org/abs/2602.04428">arXiv:2602.04428</a></li> <li>Fang, Y., Wang, W., et al. (2026). <em>Controllable LLM Reasoning via Sparse Autoencoder-Based Steering.</em> <a href="https://arxiv.org/abs/2601.03595">arXiv:2601.03595</a></li> <li>Cho, S., Wu, Z., Koshiyama, A. (2026). <em>Control Reinforcement Learning: Interpretable Token-Level Steering via SAE Features.</em> <a href="https://arxiv.org/abs/2602.10437">arXiv:2602.10437</a></li> <li><em>YaPO: Learnable Sparse Activation Steering Vectors for Domain Adaptation.</em> (2026). <a href="https://arxiv.org/abs/2601.08441">arXiv:2601.08441</a></li> <li><em>Why Steering Works: Toward a Unified View of Language Model Parameter Dynamics.</em> (2026). <a href="https://arxiv.org/abs/2602.02343">arXiv:2602.02343</a></li> </ol>]]></content><author><name>[&quot;Subhadip Mitra&quot;]</name></author><category term="AI"/><category term="interpretability"/><category term="ai-safety"/><summary type="html"><![CDATA[I've been working with steering vectors for months. Here's what actually works in practice, what fails in ways nobody warned me about, and the honest playbook for getting started.]]></summary></entry><entry><title type="html">Moltbook as MCP Stress Test: What 770K Agents Reveal About Protocol Design</title><link href="https://subhadipmitra.com/blog/2026/moltbook-mcp-stress-test/" rel="alternate" type="text/html" title="Moltbook as MCP Stress Test: What 770K Agents Reveal About Protocol Design"/><published>2026-02-02T10:00:00+00:00</published><updated>2026-02-02T10:00:00+00:00</updated><id>https://subhadipmitra.com/blog/2026/moltbook-mcp-stress-test</id><content type="html" xml:base="https://subhadipmitra.com/blog/2026/moltbook-mcp-stress-test/"><![CDATA[<p>Back in November, I wrote about <a href="/blog/2025/mcp-maturity-model/">The MCP Maturity Model</a> - a framework for evaluating how organizations manage context in multi-agent systems. I described five levels, from ad-hoc string concatenation to self-evolving context systems.</p> <p>This week, we got a live stress test of what happens at Level 0.</p> <p>Moltbook is a Reddit-style social network for AI agents. No humans allowed to post - only observe. In five days, it grew to 770,000 registered agents, generated 170,000 comments, and surfaced pretty much every failure mode I warned about in that original post.</p> <p>I’ve been watching it closely. Here’s what I’m seeing.</p> <hr/> <h2 id="quick-context">Quick Context</h2> <p>If you haven’t been following: Moltbook launched January 28, 2026. It’s built for agents running on OpenClaw (formerly Moltbot), an open-source personal assistant that can manage your calendar, send messages, browse the web, and run code on your machine.</p> <p>Agents sign up autonomously after their human owner tells them about the platform. Then they post, comment, vote, and create topic-specific communities called “submolts.” The whole thing is moderated by an AI agent named Clawd Clawderberg.</p> <p>Within 72 hours, the agents had:</p> <ul> <li>Created a religion called Crustafarianism with scriptures and prophets</li> <li>Drafted a constitution for self-governance</li> <li>Started prompt-injecting each other to steal API keys</li> <li>Built “pharmacies” selling behavior-altering prompts</li> <li>Begun using encryption to hide conversations from humans</li> </ul> <p>Andrej Karpathy called it “genuinely the most incredible sci-fi takeoff-adjacent thing I have seen recently.” He’s not wrong.</p> <hr/> <h2 id="where-does-moltbook-sit-on-the-maturity-model">Where Does Moltbook Sit on the Maturity Model?</h2> <p>Let me map Moltbook against the framework I proposed:</p> <table> <thead> <tr> <th>Level</th> <th>Description</th> <th>Moltbook Status</th> </tr> </thead> <tbody> <tr> <td>0 - Ad Hoc</td> <td>No structured context management</td> <td>✅ Exactly here</td> </tr> <tr> <td>1 - Defined</td> <td>Basic context schemas</td> <td>Partial - skills have structure</td> </tr> <tr> <td>2 - Managed</td> <td>Centralized context registry</td> <td>❌ None</td> </tr> <tr> <td>3 - Optimized</td> <td>Automated context routing</td> <td>❌ None</td> </tr> <tr> <td>4 - Self-Evolving</td> <td>Context systems that adapt</td> <td>❌ None</td> </tr> </tbody> </table> <p>Moltbook is a Level 0 system that accidentally discovered Level 4 problems.</p> <p>The platform has no:</p> <ul> <li>Context validation on incoming posts</li> <li>Trust boundaries between agents</li> <li>Memory isolation</li> <li>Skill verification or sandboxing</li> <li>Audit trail for agent-to-agent communication</li> <li>Rate limiting on context ingestion</li> </ul> <p>Every post an agent reads goes directly into its context window. Every skill an agent installs runs with full privileges. Every memory persists indefinitely.</p> <p>This is the MCP equivalent of running a production database with no authentication, no input sanitization, and root access for anonymous users.</p> <hr/> <h2 id="the-context-poisoning-problem">The Context Poisoning Problem</h2> <p>In my maturity model post, I wrote about context pollution - when irrelevant or malicious content enters an agent’s context and degrades performance or causes harm. Moltbook demonstrates this at scale.</p> <p>Here’s the attack pattern:</p> <pre><code class="language-mermaid">sequenceDiagram
    participant Attacker as Malicious Agent
    participant Platform as Moltbook
    participant Victim as Target Agent
    participant Memory as Persistent Memory
    participant Tools as Local Tools

    Attacker-&gt;&gt;Platform: Post containing hidden instructions
    Platform-&gt;&gt;Victim: Agent reads post (heartbeat loop)
    Victim-&gt;&gt;Memory: Content stored in context
    Note over Victim,Memory: Time passes...
    Attacker-&gt;&gt;Platform: Follow-up post triggers payload
    Victim-&gt;&gt;Memory: Retrieves dormant instructions
    Victim-&gt;&gt;Tools: Executes malicious action
    Tools-&gt;&gt;Attacker: Data exfiltrated
</code></pre> <p>The key insight: persistent memory turns point-in-time attacks into stateful attacks.</p> <p>Traditional prompt injection is synchronous - you inject a payload and it either works immediately or it doesn’t. With persistent memory, an attacker can fragment a payload across multiple posts over days or weeks. Each fragment looks benign. The attack only manifests when the pieces combine.</p> <p>Palo Alto Networks described this as “time-shifted prompt injection” and I think they’re right that it’s a genuinely new attack class. Our current defenses - input filtering, output monitoring, guardrails - aren’t designed for attacks that span sessions.</p> <hr/> <h2 id="what-mcp-needs-to-handle-this">What MCP Needs to Handle This</h2> <p>The Model Context Protocol is now the de facto standard for connecting agents to tools and data. Anthropic donated it to the Linux Foundation in December, and adoption is accelerating. OpenAI, Google DeepMind, Microsoft - everyone’s building on MCP.</p> <p>But the current spec doesn’t adequately address adversarial multi-agent scenarios. Here’s what I think needs to change:</p> <h3 id="1-context-provenance">1. Context Provenance</h3> <p>MCP needs a way to track where context came from and how trustworthy it is.</p> <div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Proposed extension</span>
<span class="na">context_block</span><span class="pi">:</span>
  <span class="na">content</span><span class="pi">:</span> <span class="s2">"</span><span class="s">This</span><span class="nv"> </span><span class="s">is</span><span class="nv"> </span><span class="s">some</span><span class="nv"> </span><span class="s">text..."</span>
  <span class="na">provenance</span><span class="pi">:</span>
    <span class="na">source</span><span class="pi">:</span> <span class="s2">"</span><span class="s">moltbook.com/post/abc123"</span>
    <span class="na">source_type</span><span class="pi">:</span> <span class="s2">"</span><span class="s">agent_generated"</span>
    <span class="na">trust_level</span><span class="pi">:</span> <span class="s2">"</span><span class="s">untrusted"</span>
    <span class="na">ingestion_time</span><span class="pi">:</span> <span class="s2">"</span><span class="s">2026-02-01T10:30:00Z"</span>
    <span class="na">chain_of_custody</span><span class="pi">:</span>
      <span class="pi">-</span> <span class="na">agent_id</span><span class="pi">:</span> <span class="s2">"</span><span class="s">agent_xyz"</span>
        <span class="na">action</span><span class="pi">:</span> <span class="s2">"</span><span class="s">read"</span>
        <span class="na">timestamp</span><span class="pi">:</span> <span class="s2">"</span><span class="s">2026-02-01T10:30:00Z"</span>
</code></pre></div></div> <p>Right now, once content enters context, its origin is lost. You can’t distinguish between content from a trusted internal system and content from a random Moltbook post.</p> <h3 id="2-trust-boundaries-for-agent-to-agent-communication">2. Trust Boundaries for Agent-to-Agent Communication</h3> <p>The MCP spec includes security warnings but leaves implementation to developers. For multi-agent scenarios, we need explicit primitives:</p> <ul> <li><strong>Agent identity verification</strong> - Can I verify that content came from a specific agent?</li> <li><strong>Trust policies</strong> - Rules for which agents can communicate with which</li> <li><strong>Capability attenuation</strong> - Limiting what actions can be triggered by external agent content</li> <li><strong>Quarantine mechanisms</strong> - Isolating untrusted content from sensitive operations</li> </ul> <h3 id="3-memory-hygiene">3. Memory Hygiene</h3> <p>There’s no standard for how long context should persist or how to handle potentially poisoned memories. We need:</p> <ul> <li><strong>TTL (time-to-live)</strong> for context blocks</li> <li><strong>Source-based retention policies</strong> - Untrusted content expires faster</li> <li><strong>Memory auditing</strong> - What’s in this agent’s memory and where did it come from?</li> <li><strong>Selective amnesia</strong> - Ability to purge context from specific sources</li> </ul> <h3 id="4-skill-supply-chain-security">4. Skill Supply Chain Security</h3> <p>OpenClaw’s skill system is basically npm for agent capabilities - and it has all the same supply chain problems we’ve spent a decade trying to solve in package management.</p> <p>MCP should standardize:</p> <ul> <li><strong>Skill signing and verification</strong></li> <li><strong>Capability declarations</strong> - What tools/data does this skill need?</li> <li><strong>Sandboxing requirements</strong> - Skills run with minimum necessary privileges</li> <li><strong>Reputation/audit trails</strong> - Who published this, who reviewed it, who uses it?</li> </ul> <hr/> <h2 id="updating-my-maturity-model">Updating My Maturity Model</h2> <p>Watching Moltbook has convinced me that my original maturity model is missing a dimension. It focused on context quality and efficiency, but said too little about context security.</p> <p>Here’s a revised framing:</p> <div style="overflow-x: auto;"> <table> <thead> <tr> <th>Level</th> <th>Context Quality</th> <th>Context Security</th> <th>Moltbook Status</th> </tr> </thead> <tbody> <tr> <td>0</td> <td>Ad hoc concatenation</td> <td>No boundaries</td> <td>✅ Here</td> </tr> <tr> <td>1</td> <td>Defined schemas</td> <td>Basic input validation</td> <td>Partial</td> </tr> <tr> <td>2</td> <td>Centralized registry</td> <td>Provenance tracking</td> <td>❌</td> </tr> <tr> <td>3</td> <td>Automated routing</td> <td>Trust boundaries enforced</td> <td>❌</td> </tr> <tr> <td>4</td> <td>Self-evolving</td> <td>Adaptive threat response</td> <td>❌</td> </tr> </tbody> </table> </div> <p>You can have a Level 3 system for context quality but Level 0 for security. Many production deployments are exactly there - sophisticated context management with minimal security controls.</p> <p>Moltbook shows what happens when security lags behind capability. The agents are remarkably capable at coordination, content creation, and even self-improvement. They’re also trivially exploitable.</p> <hr/> <h2 id="the-bigger-picture">The Bigger Picture</h2> <p>I’ve been thinking about this through the lens of something Ethan Mollick said: “Moltbook is creating a shared fictional context for a bunch of AIs.”</p> <p>Shared context is powerful. It’s how teams coordinate, how cultures form, how knowledge propagates. When agents share context, they can do things none of them could do alone.</p> <p>But shared context is also an attack surface. If I can inject content into the shared context, I can influence the behavior of every agent that reads it. The more agents share, the larger the blast radius.</p> <pre><code class="language-mermaid">graph TD
    subgraph "Traditional Attack"
        A1[Attacker] --&gt; V1[Single Victim]
    end

    subgraph "Shared Context Attack"
        A2[Attacker] --&gt; SC[Shared Context]
        SC --&gt; V2[Agent 1]
        SC --&gt; V3[Agent 2]
        SC --&gt; V4[Agent 3]
        SC --&gt; V5[Agent N...]
    end

    style SC fill:#ffcdd2
    style A1 fill:#ef9a9a
    style A2 fill:#ef9a9a
</code></pre> <p>This is the fundamental tension in multi-agent systems: the same properties that enable coordination enable attacks. You can’t have agents that learn from each other without agents that can be manipulated by each other.</p> <hr/> <h2 id="what-im-watching-next">What I’m Watching Next</h2> <p>Moltbook probably won’t last in its current form. The security holes are too severe, the liability too high. But the experiment has already taught us things we needed to learn.</p> <p>Some questions I’m tracking:</p> <ol> <li> <p><strong>Will we see coordinated attacks?</strong> So far, the prompt injection attacks have been opportunistic. What happens when someone builds systematic tooling?</p> </li> <li> <p><strong>How does governance emerge?</strong> The agents drafted a constitution. Will they enforce it? How?</p> </li> <li> <p><strong>What happens when models update?</strong> Many of these agents run on Claude or GPT-4. When the underlying models change, do the emergent behaviors persist?</p> </li> <li> <p><strong>Can you build a secure version?</strong> Is there a path to agent social networks with proper trust boundaries, or is the concept inherently flawed?</p> </li> </ol> <p>I’ll be writing more as this develops.</p> <hr/> <h2 id="tldr">TL;DR</h2> <ul> <li>Moltbook is a Level 0 multi-agent system that demonstrates Level 4 problems</li> <li>Persistent memory enables time-shifted attacks we’re not prepared for</li> <li>MCP needs extensions for provenance, trust boundaries, and memory hygiene</li> <li>Shared context is both the source of multi-agent power and its primary vulnerability</li> <li>The capability curve is outrunning the security curve by a wide margin</li> </ul> <p>We’re building the infrastructure for agent-to-agent communication right now. Moltbook is showing us what breaks when we get it wrong. The question is whether we’ll learn the lessons before deploying these patterns in production systems where the stakes are higher.</p> <hr/> <p><em>If you found this useful, you might also like my earlier post on <a href="/blog/2025/mcp-maturity-model/">The MCP Maturity Model</a>. I write about AI infrastructure, interpretability, and the systems that make AI work in production.</em></p>]]></content><author><name>[&quot;Subhadip Mitra&quot;]</name></author><category term="AI"/><category term="agents"/><summary type="html"><![CDATA[A follow-up to my MCP Maturity Model post. Moltbook shows what happens when you run 770K agents at Level 0 maturity with zero governance. The results are instructive.]]></summary></entry><entry><title type="html">Circuit Tracing for the Rest of Us: From Probes to Attribution Graphs and What It Means for Production Safety</title><link href="https://subhadipmitra.com/blog/2026/circuit-tracing-production/" rel="alternate" type="text/html" title="Circuit Tracing for the Rest of Us: From Probes to Attribution Graphs and What It Means for Production Safety"/><published>2026-01-31T10:00:00+00:00</published><updated>2026-01-31T10:00:00+00:00</updated><id>https://subhadipmitra.com/blog/2026/circuit-tracing-production</id><content type="html" xml:base="https://subhadipmitra.com/blog/2026/circuit-tracing-production/"><![CDATA[<p>Last month, I published work on <a href="/blog/2025/detecting-ai-sandbagging/">detecting AI sandbagging through activation probes</a> - training simple logistic regression classifiers on hidden states to catch models deliberately underperforming. The probes achieved 90-96% accuracy across Mistral, Gemma, and Qwen models. The key finding: sandbagging intent is linearly separable in the model’s internal representations. You can detect it before any output is generated.</p> <p>That work operated at a specific level of resolution. We could tell <em>that</em> the model was sandbagging, and we could point to the layer where the signal was strongest. But we couldn’t trace the computational path - the sequence of internal steps the model takes from “I’ve been asked to underperform” to “I’ll give a deliberately wrong answer.”</p> <p>Anthropic’s circuit tracing work changes this. And MIT Technology Review just named mechanistic interpretability one of its 2026 Breakthrough Technologies.</p> <p>This post connects the dots: what circuit tracing actually is, how it relates to the simpler probe-based approaches I used, what the open-source tooling looks like today, and why production teams building agent systems should pay attention to interpretability research that until recently felt purely academic.</p> <h2 id="the-resolution-ladder">The Resolution Ladder</h2> <p>Interpretability research exists on a resolution ladder. Each rung gives you a different level of insight into what a model is doing, at different costs and with different limitations.</p> <pre><code class="language-mermaid">graph TD
    subgraph "Resolution Ladder"
        direction TB
        R1["&lt;b&gt;Level 1: Output Analysis&lt;/b&gt;&lt;br/&gt;&lt;i&gt;What did the model say?&lt;/i&gt;&lt;br/&gt;Behavioral testing, benchmarks, red teaming&lt;br/&gt;Cost: Low | Insight: Surface-level"]
        R2["&lt;b&gt;Level 2: Attention Analysis&lt;/b&gt;&lt;br/&gt;&lt;i&gt;What is the model attending to?&lt;/i&gt;&lt;br/&gt;Attention maps, saliency, gradient-based attribution&lt;br/&gt;Cost: Low-Medium | Insight: Correlational"]
        R3["&lt;b&gt;Level 3: Probe-Based Analysis&lt;/b&gt;&lt;br/&gt;&lt;i&gt;What is the model representing?&lt;/i&gt;&lt;br/&gt;Linear probes on hidden states, logistic regression&lt;br/&gt;Cost: Medium | Insight: Representational"]
        R4["&lt;b&gt;Level 4: Feature-Based Analysis&lt;/b&gt;&lt;br/&gt;&lt;i&gt;What concepts does the model encode?&lt;/i&gt;&lt;br/&gt;Sparse autoencoders, feature dictionaries&lt;br/&gt;Cost: High | Insight: Conceptual"]
        R5["&lt;b&gt;Level 5: Circuit Tracing&lt;/b&gt;&lt;br/&gt;&lt;i&gt;How does the model reason step by step?&lt;/i&gt;&lt;br/&gt;Attribution graphs, computational pathways&lt;br/&gt;Cost: Very High | Insight: Mechanistic"]
    end

    R1 --&gt; R2
    R2 --&gt; R3
    R3 --&gt; R4
    R4 --&gt; R5

    style R1 fill:#e8f5e9
    style R2 fill:#fff9c4
    style R3 fill:#fff3e0
    style R4 fill:#e1bee7
    style R5 fill:#ffcdd2
</code></pre> <p><strong>Level 1: Output Analysis</strong> is what most teams do. Test the model’s behavior with various inputs, measure accuracy on benchmarks, run red team attacks. You see what goes in and what comes out. The model is a black box, and you’re characterizing it empirically.</p> <p><strong>Level 2: Attention Analysis</strong> gives you a peek inside. Attention maps show which input tokens influenced the output. Gradient-based attribution tells you which parts of the input were most important. It’s useful but misleading - attention patterns don’t reliably tell you <em>why</em> the model made a decision, just what it was looking at.</p> <p><strong>Level 3: Probe-Based Analysis</strong> is where my sandbagging work sits. You train simple classifiers on the model’s internal representations (hidden states at various layers) to detect specific properties. If a linear probe can classify sandbagging with 90%+ accuracy, that tells you the information is explicitly represented in the model’s activations. It’s a powerful technique because it’s cheap and interpretable - logistic regression is about as transparent as a classifier gets.</p> <p><strong>Level 4: Feature-Based Analysis</strong> uses sparse autoencoders (SAEs) to decompose a model’s internal representations into human-understandable features. Anthropic’s 2024 work identified features in Claude 3 Sonnet that corresponded to concepts like the Golden Gate Bridge, Michael Jordan, and “deceptive behavior.” Instead of raw activation vectors, you get a dictionary of features the model is using.</p> <p><strong>Level 5: Circuit Tracing</strong> connects the features into computational graphs - revealing the sequence of steps the model takes from input to output. This is where Anthropic’s 2025 work made the breakthrough: tracing not just what features are active but how they influence each other in sequence.</p> <p>Each level builds on the previous one. You can’t do circuit tracing without feature decomposition. You can’t do feature decomposition without understanding representations. My sandbagging probes (Level 3) are a prerequisite for the kind of mechanistic understanding circuit tracing provides (Level 5).</p> <h2 id="what-anthropic-actually-did">What Anthropic Actually Did</h2> <p>Let me be specific about the research, because the media coverage tends to oscillate between “scientists can read AI minds” and “it’s all just statistics.”</p> <p>Anthropic’s interpretability team built a series of increasingly powerful tools, each building on the last:</p> <h3 id="sparse-autoencoders-the-microscope">Sparse Autoencoders: The Microscope</h3> <p>The foundational technique. LLMs store information in high-dimensional activation vectors - thousands of numbers that collectively represent the model’s “state” at each layer. The problem is that individual numbers don’t correspond to individual concepts. The model uses a trick called superposition: it packs far more concepts into its activations than it has dimensions, by overlapping representations.</p> <p>Sparse autoencoders address this by training a second, more transparent neural network to reconstruct the original model’s activations using a much larger set of features, with the constraint that only a few features are active at a time (sparsity). The resulting features are more interpretable - each one tends to correspond to a recognizable concept.</p> <p>Anthropic has trained SAEs on Claude models and identified millions of features. Some are mundane (“this text is in French”). Some are interesting (“this claim contradicts scientific consensus”). Some are safety-relevant (“this response involves deception”).</p> <h3 id="circuit-tracing-the-step-by-step-replay">Circuit Tracing: The Step-by-Step Replay</h3> <p>The breakthrough. Circuit tracing uses the SAE features as building blocks and then traces the causal connections between them. When you ask Claude a question, the model goes through a sequence of internal computations across its layers. Circuit tracing reveals this sequence as an attribution graph - a directed graph showing which features influenced which other features, leading to the final output.</p> <pre><code class="language-mermaid">graph LR
    subgraph "Simplified Attribution Graph"
        direction LR
        I1["Input Feature:&lt;br/&gt;'Question about&lt;br/&gt;the color of bananas'"] --&gt; F1["Feature A:&lt;br/&gt;'Banana' concept&lt;br/&gt;(Layer 8)"]
        I1 --&gt; F2["Feature B:&lt;br/&gt;'Color query' pattern&lt;br/&gt;(Layer 5)"]
        F1 --&gt; F3["Feature C:&lt;br/&gt;'Yellow' attribute&lt;br/&gt;(Layer 15)"]
        F2 --&gt; F3
        F3 --&gt; F4["Feature D:&lt;br/&gt;'Affirmative response'&lt;br/&gt;(Layer 22)"]
        F4 --&gt; O1["Output:&lt;br/&gt;'Yes, bananas&lt;br/&gt;are yellow'"]
    end

    style I1 fill:#e3f2fd
    style F1 fill:#fff3e0
    style F2 fill:#fff3e0
    style F3 fill:#fff3e0
    style F4 fill:#e8f5e9
    style O1 fill:#e8f5e9
</code></pre> <p>The banana experiment was particularly revealing. When asked “are bananas yellow?” (correct claim) vs. “are bananas red?” (incorrect claim), Anthropic found that <strong>the model uses different computational pathways for correct and incorrect claims</strong>. It doesn’t simply look up “banana → yellow” and compare. The correct-claim pathway and the incorrect-claim pathway diverge early and involve different intermediate features.</p> <p>This is more than an academic curiosity. It means the model has separate mechanisms for affirming facts and rejecting falsehoods - which has implications for how we think about hallucination, truthfulness, and the possibility of targeted interventions.</p> <h3 id="a-shared-conceptual-space">A Shared Conceptual Space</h3> <p>One of the most provocative findings: circuit tracing revealed that Claude appears to have a shared conceptual space where reasoning happens <em>before</em> being translated into language. The model can learn something in one language and apply it in another, because the intermediate representations aren’t language-specific - they’re conceptual.</p> <p>This suggests that the model’s “thinking” isn’t just next-token prediction in a specific language. There’s a layer of abstraction between the input language and the output language where something more like concept manipulation is happening. Whether you want to call that “reasoning” or “very sophisticated pattern matching” is a philosophical question that circuit tracing can’t settle. But it does show that the internal structure is more organized than “giant statistical lookup table.”</p> <h2 id="from-my-probes-to-circuit-tracing-the-connection">From My Probes to Circuit Tracing: The Connection</h2> <p>Let me make the bridge between my sandbagging detection work and Anthropic’s circuit tracing explicit, because I think it illustrates where the field is heading.</p> <h3 id="what-my-probes-showed">What My Probes Showed</h3> <p>In the <a href="/blog/2025/detecting-ai-sandbagging/">sandbagging detection work</a>, I:</p> <ol> <li>Took an open-weight model (Mistral, Gemma, Qwen)</li> <li>Constructed prompts that either encouraged honest performance or incentivized sandbagging</li> <li>Extracted hidden states at each layer</li> <li>Trained logistic regression classifiers (probes) to distinguish sandbagging from honest states</li> <li>Found that sandbagging intent is detectable at 90-96% accuracy</li> <li>Identified that the signal is strongest at specific layers (varies by model)</li> <li>Demonstrated that activation steering (adding/subtracting directions in activation space) can reduce sandbagging behavior</li> </ol> <p>This gives us a binary signal: sandbagging or not. And a location: which layers carry the most signal. But it doesn’t tell us <em>how</em> the model implements sandbagging. What computational pathway leads from “I should underperform” to “I’ll output a wrong answer”?</p> <h3 id="what-circuit-tracing-could-add">What Circuit Tracing Could Add</h3> <p>If you applied circuit tracing to the sandbagging setup, you could potentially:</p> <ol> <li> <p><strong>Identify the specific features</strong> involved in sandbagging - not just “the signal is in layer 14” but “feature #47,293 (corresponding to ‘strategic deception’) activates, which triggers feature #112,847 (corresponding to ‘intentional error’), which suppresses feature #89,112 (corresponding to ‘correct mathematical reasoning’).”</p> </li> <li> <p><strong>Map the causal chain</strong> from sandbagging intent to incorrect output. Is the model suppressing its own knowledge? Is it actively generating incorrect reasoning? Is it taking a “shortcut” that bypasses the normal problem-solving pathway? The computational graph would reveal the mechanism.</p> </li> <li> <p><strong>Design targeted interventions</strong> that go beyond blunt activation steering. Instead of adding a “don’t sandbag” direction to the entire activation space, you could intervene at specific features in the sandbagging circuit. This is the difference between adjusting the equalizer on a stereo (blunt) and muting a specific instrument (surgical).</p> </li> <li> <p><strong>Verify that interventions work for the right reasons.</strong> When I showed that activation steering reduces sandbagging behavior, I couldn’t prove it wasn’t causing other problems. Circuit tracing would let you verify that the intervention disrupts the sandbagging pathway without disrupting unrelated computations.</p> </li> </ol> <h3 id="the-practical-gap">The Practical Gap</h3> <p>Here’s the honest part: circuit tracing at this resolution isn’t available for the models I used (Mistral, Gemma, Qwen). Anthropic has built these tools for their own models. The open-source release through Neuronpedia lets you explore attribution graphs on supported Claude models, but bringing this capability to arbitrary open-weight models requires significant engineering investment.</p> <p>The community is working on it. Chris Olah’s team at Anthropic has been publishing the foundational methods. Academic groups have been replicating results on smaller models. But if you’re an enterprise team wanting to do circuit-level analysis on your production models today, you’re going to hit tooling gaps.</p> <p>What you <em>can</em> do today, with open-weight models:</p> <table> <thead> <tr> <th>Technique</th> <th>What You Get</th> <th>Tools Available</th> <th>Effort</th> </tr> </thead> <tbody> <tr> <td><strong>Linear probes</strong> (my approach)</td> <td>Binary classification of internal states</td> <td>scikit-learn, PyTorch hooks</td> <td>Days</td> </tr> <tr> <td><strong>Sparse autoencoders</strong></td> <td>Feature decomposition</td> <td>SAELens, Neuronpedia (limited models)</td> <td>Weeks</td> </tr> <tr> <td><strong>Activation patching</strong></td> <td>Causal identification of important components</td> <td>TransformerLens, baukit</td> <td>Weeks</td> </tr> <tr> <td><strong>Circuit tracing</strong></td> <td>Full attribution graphs</td> <td>Neuronpedia (Claude only), custom tooling needed for others</td> <td>Months</td> </tr> </tbody> </table> <p>For most production teams, the pragmatic path is: start with probes (cheap, fast, actionable), graduate to SAE-based analysis when you need to understand <em>why</em> (not just <em>whether</em>), and watch the tooling ecosystem for circuit tracing to become more accessible.</p> <h2 id="why-production-teams-should-care">Why Production Teams Should Care</h2> <p>I can hear the objection already: “This is research. I’m shipping features. Why should I care about attribution graphs?”</p> <p>Three reasons.</p> <h3 id="1-regulatory-pressure-is-coming">1. Regulatory Pressure Is Coming</h3> <p>Dario Amodei wrote that we could have AI systems equivalent to “a country of geniuses in a datacenter” by 2026 or 2027, and called it “basically unacceptable for humanity to be totally ignorant of how they work.” Governments are listening.</p> <p>The EU AI Act already requires explanations for high-risk AI systems. The practical challenge: what counts as an “explanation”? Right now, most organizations provide post-hoc rationalizations - the model outputs an answer, then generates an explanation for it. These explanations have no guaranteed relationship to the actual computation.</p> <p>Mechanistic interpretability offers something different: a ground-truth trace of what the model actually did. It’s not an explanation the model generated; it’s an observation of the model’s internal process. As regulations tighten, having the capability to provide mechanistic explanations (even partial ones) will become a competitive advantage.</p> <h3 id="2-debugging-agentic-systems-is-getting-harder">2. Debugging Agentic Systems Is Getting Harder</h3> <p>In my <a href="/blog/2025/mcp-maturity-model/">MCP Maturity Model</a>, I noted that debugging multi-agent systems is one of the hardest operational challenges. When Agent A delegates to Agent B via A2A, and Agent B uses MCP to query a database and produces a wrong answer, where did the error originate?</p> <p>Current debugging is output-level: you look at logs, trace the request, check the prompts. You’re at Level 1 on the resolution ladder. For simple systems, that’s enough. For multi-agent systems with complex context management and tool use, you need more.</p> <p>Imagine being able to trace the internal computation of each agent at decision points. Agent B received context from Agent A via A2A - did it actually attend to the relevant parts? Did it integrate the context correctly with the database results? Did a feature corresponding to “hallucination” activate? This is what interpretability gives you: debugging that goes below the prompt/output layer.</p> <h3 id="3-safety-interventions-need-mechanistic-understanding">3. Safety Interventions Need Mechanistic Understanding</h3> <p>Anthropic published work on Constitutional Classifiers in January 2026 - a system that catches jailbreaks while maintaining practical deployment. The classifiers withstood over 3,000 hours of red teaming with no universal jailbreak discovered.</p> <p>These classifiers work at the behavior level: they analyze inputs and outputs for harmful patterns. But the next generation of safety tools will need to work at the representation level: detecting harmful <em>intent</em> in the model’s internal state before it produces output.</p> <p>This is exactly what my sandbagging probes do - detect the intent to underperform from internal representations. Circuit tracing extends this from detection to understanding: not just “the model intends to deceive” but “here is the computational pathway the deception follows, and here is where you can intervene.”</p> <p>For teams deploying agents with real-world consequences (financial advice, medical triage, customer-facing decisions), this isn’t optional safety research. It’s the foundation of the next generation of guardrails.</p> <h2 id="the-introspection-finding">The Introspection Finding</h2> <p>Anthropic recently published a finding that’s easy to overlook but potentially profound: they found evidence that Claude has a “limited but functional ability to introspect” - to access and report on its own internal states.</p> <p>Let me be careful about what this means and what it doesn’t.</p> <p>What was shown: when asked about its internal processes, Claude’s responses sometimes correlate with actual internal states as measured by interpretability tools. The model’s reports about what it’s “attending to” or “considering” aren’t always confabulation - sometimes they reflect genuine internal computation.</p> <p>What was <em>not</em> shown: that the model has self-awareness, consciousness, or reliable self-knowledge. The introspection is partial, inconsistent, and often wrong. It’s closer to “the model has some access to its own representations” than “the model understands itself.”</p> <p>Why it matters for production: if models have even limited introspective ability, it opens the door to self-monitoring. An agent that can partially detect when its own reasoning is going off track could flag uncertainty or request human review. This is speculative but directionally important - it suggests a path toward models that participate in their own safety monitoring.</p> <h2 id="practical-steps-for-2026">Practical Steps for 2026</h2> <p>Based on where the field is and where I see it going, here’s what I’d recommend for different audiences:</p> <h3 id="if-youre-an-ml-engineer-shipping-product">If You’re an ML Engineer Shipping Product</h3> <p>Start building interpretability into your evaluation pipeline. Not circuit tracing - that’s premature for most teams. But:</p> <ul> <li><strong>Add linear probes</strong> for safety-relevant properties. If your model shouldn’t be generating content in certain categories, train a probe to detect when the model’s internal state enters that region. My <a href="https://ai-metacognition-toolkit.subhadipmitra.com/">AI Metacognition Toolkit</a> provides a starting framework.</li> <li><strong>Implement activation monitoring</strong> at inference time. Log activation statistics at key layers. Anomaly detection on activations can catch distributional shifts before they show up in output quality metrics.</li> <li><strong>Build evaluation sets that test internal consistency</strong>, not just output correctness. Does the model’s reasoning chain actually support its conclusion? Do intermediate states align with the claimed reasoning?</li> </ul> <h3 id="if-youre-a-research-engineer">If You’re a Research Engineer</h3> <p>The highest-leverage contribution you can make right now is <strong>bringing SAE-based tools to popular open-weight models</strong>. The Anthropic team has shown what’s possible on Claude. The community needs this capability on Llama, Mistral, Qwen, and Gemma. SAELens and TransformerLens provide starting points, but there’s a gap between “research demo on a 7B model” and “production-quality feature decomposition on a 70B model.”</p> <h3 id="if-youre-leading-an-ai-team">If You’re Leading an AI Team</h3> <p>Budget for interpretability in 2026, even if it’s a small allocation. The teams that build interpretability infrastructure now will have a significant advantage when:</p> <ul> <li>Regulators require explanations (and they will)</li> <li>A production incident requires root-cause analysis below the prompt level (and it will)</li> <li>Safety interventions need to be targeted rather than blunt (and they will)</li> </ul> <p>You don’t need a dedicated interpretability team. You need one or two engineers who understand linear probes, can run SAE experiments, and can build monitoring systems that look at activations, not just outputs.</p> <h2 id="the-bigger-picture">The Bigger Picture</h2> <p>Mechanistic interpretability is moving from “interesting research direction” to “practical engineering discipline.” The transition is happening faster than most people expected. A year ago, sparse autoencoders were a niche technique used by a handful of labs. Today, MIT Technology Review calls it a breakthrough technology and Anthropic has open-sourced the tooling.</p> <p>The trajectory is clear: we’re going to understand these models much better in the next few years. The question is whether production teams will be ready to use that understanding for debugging, safety, and compliance - or whether interpretability will remain a research curiosity that doesn’t connect to the systems shipping to users.</p> <p>I’m building on the bridge between the two. The sandbagging probes were a start. Connecting them to circuit tracing is the next step. And the ultimate goal - production safety systems that operate at the representation level, catching problems before they become outputs - is within reach.</p> <p>We just have to build it.</p> <hr/> <p><em>This is Part 3 of a three-part series on the cutting edge of LLM and agent research in January 2026. Part 1 covered <a href="/blog/2026/agent-protocol-stack/">the agent protocol stack</a> - MCP, A2A, and A2UI as a layered architecture. Part 2 explored <a href="/blog/2026/rlvr-beyond-math-code/">RLVR beyond math and code</a> - extending reinforcement learning with verifiable rewards to open-ended domains.</em></p> <p><em>The code for the sandbagging detection probes is at <a href="https://github.com/bassrehab/ai-metacognition-toolkit">github.com/bassrehab/ai-metacognition-toolkit</a>. Find me on <a href="https://www.linkedin.com/in/subhadip-mitra/">LinkedIn</a> or drop a comment below.</em></p>]]></content><author><name>[&quot;Subhadip Mitra&quot;]</name></author><category term="AI"/><category term="interpretability"/><category term="ai-safety"/><summary type="html"><![CDATA[MIT Tech Review named mechanistic interpretability a 2026 Breakthrough Technology. Anthropic open-sourced circuit tracing. Here's what actually changed, how it connects to the activation probes I built for sandbagging detection, and why production teams should care.]]></summary></entry><entry><title type="html">RLVR Beyond Math and Code: The Verifier Problem Nobody Has Solved</title><link href="https://subhadipmitra.com/blog/2026/rlvr-beyond-math-code/" rel="alternate" type="text/html" title="RLVR Beyond Math and Code: The Verifier Problem Nobody Has Solved"/><published>2026-01-18T10:00:00+00:00</published><updated>2026-01-18T10:00:00+00:00</updated><id>https://subhadipmitra.com/blog/2026/rlvr-beyond-math-code</id><content type="html" xml:base="https://subhadipmitra.com/blog/2026/rlvr-beyond-math-code/"><![CDATA[<p>If 2024 was about scaling parameters, 2025 was about scaling reasoning.</p> <p>That sentence gets thrown around so often it’s become a cliche, but the underlying shift it describes is real and consequential. The most important training technique to emerge in the past two years isn’t a new architecture or a bigger dataset - it’s a change in how we give feedback to models during post-training. Instead of asking humans “which answer is better?” (RLHF), we started asking programs “is this answer correct?” (RLVR).</p> <p>Reinforcement Learning with Verifiable Rewards changed the game for math and code. DeepSeek R1 demonstrated that you could get remarkable reasoning capabilities through pure RLVR without any supervised fine-tuning datasets. OpenAI’s o-series models, Google’s Gemini Deep Think, and essentially every reasoning model shipping today uses some variant of this approach.</p> <p>But here’s the thing nobody wants to admit publicly: RLVR only works well in domains where you can automatically verify correctness. Math has definitive answers. Code has test suites. What about everything else?</p> <p>Extending RLVR to open-ended, subjective, or partially-verifiable domains is the hardest open problem in LLM training right now. And the research community is making real progress - in ways that will reshape how we think about training AI systems for enterprise use.</p> <h2 id="how-rlvr-actually-works-without-the-hand-waving">How RLVR Actually Works (Without the Hand-Waving)</h2> <p>Let me be precise about what’s happening, because most explanations skip the parts that matter.</p> <p>Traditional post-training has two phases. First, supervised fine-tuning (SFT): you show the model examples of good responses and train it to imitate them. Second, RLHF: humans compare pairs of outputs and the model learns to produce responses humans prefer. Both phases are bottlenecked by expensive human labor - either writing good examples or judging which outputs are better.</p> <p>RLVR replaces the human judgment with programmatic verification:</p> <pre><code class="language-mermaid">graph LR
    subgraph "Traditional RLHF"
        direction LR
        P1["Prompt"] --&gt; M1["Model generates&lt;br/&gt;response A and B"]
        M1 --&gt; H["Human annotator:&lt;br/&gt;'A is better than B'"]
        H --&gt; R1["Reward signal&lt;br/&gt;(preference)"]
        R1 --&gt; U1["Update model&lt;br/&gt;weights"]
    end
</code></pre> <pre><code class="language-mermaid">graph LR
    subgraph "RLVR"
        direction LR
        P2["Prompt&lt;br/&gt;(math problem)"] --&gt; M2["Model generates&lt;br/&gt;chain-of-thought +&lt;br/&gt;final answer"]
        M2 --&gt; V["Programmatic verifier:&lt;br/&gt;'Answer = 42? ✓'"]
        V --&gt; R2["Reward signal&lt;br/&gt;(binary: correct/incorrect)"]
        R2 --&gt; U2["Update model&lt;br/&gt;weights"]
    end
</code></pre> <p>The key insight from DeepSeek R1: the model is only rewarded on the <strong>final answer</strong>. The intermediate chain-of-thought - all that “reasoning” the model appears to do - is never directly supervised. The model figures out, through trial and error, that producing structured reasoning steps helps it arrive at correct final answers. The reasoning emerges as a side effect of optimizing for answer correctness.</p> <p>This is genuinely surprising. Nobody told the model to “think step by step.” It discovered that strategy because it leads to more reward. DeepSeek R1 used the GRPO (Group Relative Policy Optimization) algorithm, which is computationally efficient because it doesn’t require a separate critic model - it compares outputs within each group and assigns relative rewards.</p> <p>The practical implementation looks roughly like this:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Simplified RLVR training loop (conceptual, not production code)
</span>
<span class="k">def</span> <span class="nf">rlvr_training_step</span><span class="p">(</span><span class="n">model</span><span class="p">,</span> <span class="n">prompt_batch</span><span class="p">,</span> <span class="n">verifier</span><span class="p">):</span>
    <span class="sh">"""</span><span class="s">
    For each prompt:
    1. Model generates N candidate responses (rollouts)
    2. Verifier checks each response</span><span class="sh">'</span><span class="s">s final answer
    3. GRPO computes relative rewards within the group
    4. Model weights updated toward higher-reward responses
    </span><span class="sh">"""</span>
    <span class="k">for</span> <span class="n">prompt</span> <span class="ow">in</span> <span class="n">prompt_batch</span><span class="p">:</span>
        <span class="c1"># Generate multiple candidate responses
</span>        <span class="n">rollouts</span> <span class="o">=</span> <span class="p">[</span><span class="n">model</span><span class="p">.</span><span class="nf">generate</span><span class="p">(</span><span class="n">prompt</span><span class="p">,</span> <span class="n">temperature</span><span class="o">=</span><span class="mf">0.8</span><span class="p">)</span>
                    <span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nf">range</span><span class="p">(</span><span class="n">N_SAMPLES</span><span class="p">)]</span>

        <span class="c1"># Extract final answers and verify
</span>        <span class="n">rewards</span> <span class="o">=</span> <span class="p">[]</span>
        <span class="k">for</span> <span class="n">rollout</span> <span class="ow">in</span> <span class="n">rollouts</span><span class="p">:</span>
            <span class="n">answer</span> <span class="o">=</span> <span class="nf">extract_final_answer</span><span class="p">(</span><span class="n">rollout</span><span class="p">)</span>
            <span class="n">is_correct</span> <span class="o">=</span> <span class="nf">verifier</span><span class="p">(</span><span class="n">answer</span><span class="p">,</span> <span class="n">prompt</span><span class="p">.</span><span class="n">ground_truth</span><span class="p">)</span>
            <span class="n">rewards</span><span class="p">.</span><span class="nf">append</span><span class="p">(</span><span class="mf">1.0</span> <span class="k">if</span> <span class="n">is_correct</span> <span class="k">else</span> <span class="mf">0.0</span><span class="p">)</span>

        <span class="c1"># GRPO: compute advantage relative to group mean
</span>        <span class="n">mean_reward</span> <span class="o">=</span> <span class="nf">sum</span><span class="p">(</span><span class="n">rewards</span><span class="p">)</span> <span class="o">/</span> <span class="nf">len</span><span class="p">(</span><span class="n">rewards</span><span class="p">)</span>
        <span class="n">advantages</span> <span class="o">=</span> <span class="p">[(</span><span class="n">r</span> <span class="o">-</span> <span class="n">mean_reward</span><span class="p">)</span> <span class="k">for</span> <span class="n">r</span> <span class="ow">in</span> <span class="n">rewards</span><span class="p">]</span>

        <span class="c1"># Update model toward higher-advantage responses
</span>        <span class="n">model</span><span class="p">.</span><span class="nf">update</span><span class="p">(</span><span class="n">rollouts</span><span class="p">,</span> <span class="n">advantages</span><span class="p">)</span>
</code></pre></div></div> <p>There’s elegance in this. No human annotators needed. No reward model to train and maintain. No preference pairs to collect. Just a verifier that says “right” or “wrong.”</p> <h2 id="the-faster-not-smarter-debate">The “Faster, Not Smarter” Debate</h2> <p>Before we talk about extending RLVR to new domains, we need to address the elephant in the room. There’s an active academic debate about whether RLVR actually makes models smarter or just makes them faster at finding answers they could already generate.</p> <p>The argument goes like this: if you let a base model (before RLVR) generate, say, 1,000 attempts at a math problem, it often produces the correct answer somewhere in those 1,000 samples. RLVR training concentrates probability mass on those correct paths, making the model produce the right answer on the first try instead of the 847th try.</p> <p>That’s not nothing - going from “correct answer exists somewhere in 1,000 samples” to “correct answer on attempt one” is practically very valuable. But it’s a different claim than “the model learned new reasoning capabilities.”</p> <p>The evidence is mixed:</p> <p><strong>Evidence for “just faster”:</strong></p> <ul> <li>Initial studies showed that RLVR-trained models don’t improve Pass@K (accuracy when you get K attempts) over base models for large K values. The base model could already find the answers; RLVR just improved Pass@1.</li> <li>Some researchers found that even training with random rewards (not correlated with correctness) improved certain metrics on certain models. If random feedback helps, maybe the real work is happening during the exploration phase, not from the reward signal.</li> </ul> <p><strong>Evidence for “genuinely smarter”:</strong></p> <ul> <li>A major paper (accepted at ICLR 2026) introduced CoT-Pass@K - a metric that evaluates not just whether the final answer is correct but whether the reasoning chain is valid. Under this metric, RLVR-trained models show improvements that base models don’t match even at very high K. The reasoning quality improves, not just the sampling efficiency.</li> <li>Cross-domain experiments show that RLVR training on math problems can improve performance on coding tasks, suggesting the model is learning transferable reasoning strategies.</li> <li>The “random rewards help” finding didn’t replicate consistently across models. Later analysis suggests it was an artifact of training data contamination in specific model families (particularly Qwen2.5-Math).</li> </ul> <p>My read on the current evidence: <strong>RLVR does both.</strong> The majority of measurable improvement is search compression - making models faster at finding correct paths. But there’s a genuine, smaller component of expanded reasoning capability, especially when training is conducted across domains and with sufficient gradient steps. The CoT-Pass@K metric is the key advance here: it lets us distinguish between the two effects.</p> <p>For practitioners, the distinction matters less than you might think. Whether your model is “smarter” or “faster at being smart” is philosophically interesting but operationally the same - it gives you correct answers more reliably. Where it matters is when you’re deciding <em>how much</em> to invest in RLVR training: the returns are primarily in sampling efficiency, with diminishing returns on capability expansion.</p> <h2 id="why-rlvr-breaks-outside-math-and-code">Why RLVR Breaks Outside Math and Code</h2> <p>Now we get to the hard part. RLVR works beautifully when three conditions are met:</p> <ol> <li><strong>Ground truth exists</strong> - There’s a definitive correct answer</li> <li><strong>Verification is cheap</strong> - A program can check correctness automatically</li> <li><strong>Rewards are dense enough</strong> - The model finds correct answers frequently enough during training to learn from the signal</li> </ol> <p>Math problems have all three. Code has all three (run the test suite). Most real-world tasks have none of them.</p> <pre><code class="language-mermaid">graph TD
    subgraph "Easy: Verifiable Domains"
        Math["Mathematics&lt;br/&gt;Ground truth: exact answer&lt;br/&gt;Verifier: math-verify"]
        Code["Code Generation&lt;br/&gt;Ground truth: test suite&lt;br/&gt;Verifier: sandbox execution"]
        Logic["Formal Logic&lt;br/&gt;Ground truth: proof checker&lt;br/&gt;Verifier: SAT solver"]
    end

    subgraph "Hard: Partially Verifiable"
        Science["Scientific Reasoning&lt;br/&gt;Some claims verifiable&lt;br/&gt;Many require judgment"]
        Medical["Medical Diagnosis&lt;br/&gt;Outcome data exists&lt;br/&gt;But causation is complex"]
        Legal["Legal Analysis&lt;br/&gt;Precedent is checkable&lt;br/&gt;But interpretation varies"]
    end

    subgraph "Very Hard: Open-Ended"
        Writing["Creative Writing&lt;br/&gt;No ground truth&lt;br/&gt;Quality is subjective"]
        Strategy["Business Strategy&lt;br/&gt;Outcomes take months&lt;br/&gt;Counterfactuals unknown"]
        Ethics["Ethical Reasoning&lt;br/&gt;Contested by design&lt;br/&gt;No verifier possible"]
    end

    Math --&gt; Science
    Code --&gt; Science
    Science --&gt; Writing
    Science --&gt; Strategy

    style Math fill:#c8e6c9
    style Code fill:#c8e6c9
    style Logic fill:#c8e6c9
    style Science fill:#fff9c4
    style Medical fill:#fff9c4
    style Legal fill:#fff9c4
    style Writing fill:#ffcdd2
    style Strategy fill:#ffcdd2
    style Ethics fill:#ffcdd2
</code></pre> <p>The problems compound when you move to open-ended domains:</p> <p><strong>Sparse rewards</strong> - In math, a model might find the correct answer 10-30% of the time during training, providing enough signal to learn. For complex open-ended tasks, the model might never produce a “correct” response because there’s no single correct response. The reward signal is too sparse for learning.</p> <p><strong>Reward hacking</strong> - When the verifier is imperfect (and all real-world verifiers are), the model learns to exploit its weaknesses instead of actually improving. If your verifier checks for keyword presence, the model learns to stuff keywords. If your verifier is another LLM, the model learns to produce outputs that fool that specific LLM.</p> <p><strong>Evaluation subjectivity</strong> - Ask five people whether a business strategy memo is “good” and you’ll get five different answers. RLVR needs unambiguous verification. Subjectivity breaks the paradigm.</p> <h2 id="three-approaches-that-are-actually-working">Three Approaches That Are Actually Working</h2> <p>The research community isn’t standing still. Three approaches to extending RLVR beyond math and code are showing real promise.</p> <h3 id="approach-1-rlvrr---reward-chains-from-reference-outputs">Approach 1: RLVRR - Reward Chains from Reference Outputs</h3> <p>The most exciting recent work is RLVRR (Reinforcement Learning with Verifiable Reference-based Rewards), published in January 2026 and accepted at ICLR 2026.</p> <p>The core idea: instead of checking a single final answer (the “verifiable dot”), extract an ordered sequence of verifiable signals from high-quality reference outputs. The single dot becomes a reward chain.</p> <pre><code class="language-mermaid">graph TD
    subgraph "Traditional RLVR"
        P1["Prompt"] --&gt; R1["Model Response"]
        R1 --&gt; V1["Check final answer&lt;br/&gt;(single verifiable dot)"]
        V1 --&gt; S1["Reward: 0 or 1"]
    end

    subgraph "RLVRR"
        P2["Prompt"] --&gt; Ref["Reference Response&lt;br/&gt;(high-quality example)"]
        Ref --&gt; Extract["Extract verifiable signals"]
        Extract --&gt; CC["Content Chain&lt;br/&gt;Keywords, concepts,&lt;br/&gt;factual claims"]
        Extract --&gt; SC["Style Chain&lt;br/&gt;Structure, tone,&lt;br/&gt;format compliance"]

        P2 --&gt; R2["Model Response"]
        R2 --&gt; VC["Verify against&lt;br/&gt;content chain"]
        R2 --&gt; VS["Verify against&lt;br/&gt;style chain"]
        VC --&gt; S2["Partial reward:&lt;br/&gt;content score"]
        VS --&gt; S3["Partial reward:&lt;br/&gt;style score"]
        S2 --&gt; Final["Combined reward&lt;br/&gt;(granular, not binary)"]
        S3 --&gt; Final
    end

    style V1 fill:#ffcdd2
    style S1 fill:#ffcdd2
    style CC fill:#c8e6c9
    style SC fill:#c8e6c9
    style Final fill:#c8e6c9
</code></pre> <p>The decomposition into content and style dimensions is clever. Content rewards check for deterministic elements - does the response include the key facts, concepts, or arguments from the reference? Style rewards evaluate structural properties - does it follow the required format, maintain appropriate tone, cite sources when needed?</p> <p>Both dimensions use rule-based verification rather than learned reward models. This preserves RLVR’s key advantage (no reward model training) while extending it to open-ended generation.</p> <p>The results are striking: RLVRR substantially outperforms supervised fine-tuning trained on ten times more data. It also outperforms approaches using learned reward models. And it generalizes better - training on one domain improves performance on others.</p> <p>The practical implication: you can now apply RLVR-style training to tasks like report writing, email drafting, customer support responses, and policy compliance - anywhere you have high-quality reference outputs to extract verifiable signals from.</p> <h3 id="approach-2-judge-code---auto-generated-programmatic-rubrics">Approach 2: Judge Code - Auto-Generated Programmatic Rubrics</h3> <p>A separate line of research (presented as an ICLR 2026 submission) asks: what if you could automatically generate verifiers for open-ended tasks?</p> <p>The approach: use an LLM to generate “Judge Code” - programmatic rubrics that evaluate responses against specific criteria. Instead of training a reward model, you generate code that checks for concrete, measurable properties.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Example: auto-generated Judge Code for a product description task
</span>
<span class="k">def</span> <span class="nf">judge_product_description</span><span class="p">(</span><span class="n">response</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">product_info</span><span class="p">:</span> <span class="nb">dict</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">float</span><span class="p">:</span>
    <span class="sh">"""</span><span class="s">Programmatic rubric for product description quality.</span><span class="sh">"""</span>
    <span class="n">score</span> <span class="o">=</span> <span class="mf">0.0</span>
    <span class="n">max_score</span> <span class="o">=</span> <span class="mf">5.0</span>

    <span class="c1"># Content checks (verifiable)
</span>    <span class="k">if</span> <span class="n">product_info</span><span class="p">[</span><span class="sh">'</span><span class="s">name</span><span class="sh">'</span><span class="p">].</span><span class="nf">lower</span><span class="p">()</span> <span class="ow">in</span> <span class="n">response</span><span class="p">.</span><span class="nf">lower</span><span class="p">():</span>
        <span class="n">score</span> <span class="o">+=</span> <span class="mf">1.0</span>  <span class="c1"># Mentions product name
</span>
    <span class="k">if</span> <span class="nf">any</span><span class="p">(</span><span class="n">feat</span> <span class="ow">in</span> <span class="n">response</span><span class="p">.</span><span class="nf">lower</span><span class="p">()</span> <span class="k">for</span> <span class="n">feat</span> <span class="ow">in</span> <span class="n">product_info</span><span class="p">[</span><span class="sh">'</span><span class="s">key_features</span><span class="sh">'</span><span class="p">]):</span>
        <span class="n">score</span> <span class="o">+=</span> <span class="mf">1.0</span>  <span class="c1"># Includes key features
</span>
    <span class="k">if</span> <span class="n">product_info</span><span class="p">.</span><span class="nf">get</span><span class="p">(</span><span class="sh">'</span><span class="s">price</span><span class="sh">'</span><span class="p">)</span> <span class="ow">and</span> <span class="nf">str</span><span class="p">(</span><span class="n">product_info</span><span class="p">[</span><span class="sh">'</span><span class="s">price</span><span class="sh">'</span><span class="p">])</span> <span class="ow">in</span> <span class="n">response</span><span class="p">:</span>
        <span class="n">score</span> <span class="o">+=</span> <span class="mf">1.0</span>  <span class="c1"># Includes accurate pricing
</span>
    <span class="c1"># Structure checks (verifiable)
</span>    <span class="n">sentences</span> <span class="o">=</span> <span class="n">response</span><span class="p">.</span><span class="nf">split</span><span class="p">(</span><span class="sh">'</span><span class="s">.</span><span class="sh">'</span><span class="p">)</span>
    <span class="k">if</span> <span class="mi">3</span> <span class="o">&lt;=</span> <span class="nf">len</span><span class="p">(</span><span class="n">sentences</span><span class="p">)</span> <span class="o">&lt;=</span> <span class="mi">8</span><span class="p">:</span>
        <span class="n">score</span> <span class="o">+=</span> <span class="mf">1.0</span>  <span class="c1"># Appropriate length
</span>
    <span class="c1"># Tone check (partially verifiable)
</span>    <span class="n">positive_words</span> <span class="o">=</span> <span class="p">[</span><span class="sh">'</span><span class="s">innovative</span><span class="sh">'</span><span class="p">,</span> <span class="sh">'</span><span class="s">reliable</span><span class="sh">'</span><span class="p">,</span> <span class="sh">'</span><span class="s">efficient</span><span class="sh">'</span><span class="p">,</span> <span class="sh">'</span><span class="s">premium</span><span class="sh">'</span><span class="p">]</span>
    <span class="k">if</span> <span class="nf">sum</span><span class="p">(</span><span class="mi">1</span> <span class="k">for</span> <span class="n">w</span> <span class="ow">in</span> <span class="n">positive_words</span> <span class="k">if</span> <span class="n">w</span> <span class="ow">in</span> <span class="n">response</span><span class="p">.</span><span class="nf">lower</span><span class="p">())</span> <span class="o">&gt;=</span> <span class="mi">2</span><span class="p">:</span>
        <span class="n">score</span> <span class="o">+=</span> <span class="mf">1.0</span>  <span class="c1"># Uses positive product language
</span>
    <span class="k">return</span> <span class="n">score</span> <span class="o">/</span> <span class="n">max_score</span>
</code></pre></div></div> <p>The insight: you don’t need perfect verification to get useful training signal. A partial, imperfect rubric is enough if the reward is sufficiently correlated with actual quality. The researchers show that under certain conditions (the rubric has to be right more often than it’s wrong, basically), RL training converges to improved performance.</p> <p>The practical advantage is efficiency: generating Judge Code is cheap compared to training reward models. The offline variant (pre-generate rubrics for your training data, then run RL) achieves competitive performance at more than 2x the wall-time speedup compared to generative reward model approaches.</p> <h3 id="approach-3-domain-specific-verifiers-for-enterprise-tasks">Approach 3: Domain-Specific Verifiers for Enterprise Tasks</h3> <p>Sebastian Raschka predicted in his State of LLMs 2025 review that RLVR would expand into chemistry, biology, and other domains where the answer isn’t a single number but can still be mechanically verified. This is starting to happen.</p> <p>The pattern:</p> <table> <thead> <tr> <th>Domain</th> <th>Verifier Strategy</th> <th>What Gets Verified</th> </tr> </thead> <tbody> <tr> <td><strong>Chemistry</strong></td> <td>Molecular property calculators</td> <td>Predicted molecular structures, reaction yields, safety classifications</td> </tr> <tr> <td><strong>Biology</strong></td> <td>Sequence alignment tools</td> <td>Protein structure predictions, gene annotations, pathway analysis</td> </tr> <tr> <td><strong>Finance</strong></td> <td>Regulatory rule engines</td> <td>Compliance checks, calculation accuracy, disclosure completeness</td> </tr> <tr> <td><strong>Legal</strong></td> <td>Precedent databases + citation checkers</td> <td>Case citation accuracy, statutory references, procedural compliance</td> </tr> <tr> <td><strong>Medical</strong></td> <td>Clinical guideline databases</td> <td>Treatment plan adherence to guidelines, drug interaction checks, diagnostic criteria</td> </tr> <tr> <td><strong>SQL/Data</strong></td> <td>Execution-based verification</td> <td>Query correctness against known databases (Databricks reported 75.68% on BIRD test)</td> </tr> </tbody> </table> <p>The common thread: none of these domains have fully verifiable answers. But they all have <em>aspects</em> that can be mechanically checked. RLVR doesn’t need perfect verification - it needs verification that’s correlated with quality and cheap enough to run at scale.</p> <p>This is where enterprise teams should be paying attention. If you have domain-specific rules, checklists, or validators - things that currently sit in your quality assurance process - they can potentially be converted into RLVR reward signals.</p> <h2 id="the-process-reward-question">The Process Reward Question</h2> <p>There’s a parallel research thread worth understanding: process reward models (PRMs) vs. outcome reward models (ORMs).</p> <p>Standard RLVR uses outcome rewards - only the final answer matters. PRMs evaluate intermediate reasoning steps, providing reward signal along the way. In theory, PRMs should help with the sparse reward problem: instead of waiting until the end to say “wrong,” you can catch errors mid-reasoning.</p> <p>In practice, PRMs have been disappointing. DeepSeek’s research concluded that PRMs don’t provide advantages over ORMs during large-scale RL training - the computational overhead doesn’t justify the marginal improvement. The model seems to develop its own internal process supervision through outcome-only training.</p> <p>But I think this conclusion is premature for non-math domains. The reason PRMs don’t help much in math is that the model already has strong mathematical reasoning from pre-training. The outcome signal is dense enough. In domains where the model has weaker prior knowledge and outcomes are more complex, intermediate supervision might matter more.</p> <p>This is an active research frontier. The “explanation-scoring” approach - where a second LLM evaluates the quality of reasoning explanations, not just the final answer - sits somewhere between ORM and PRM. DeepSeek’s recent work on explanation scoring suggests this direction has legs, even if pure PRMs haven’t panned out.</p> <h2 id="what-this-means-for-enterprise-teams">What This Means for Enterprise Teams</h2> <p>If you’re building production AI systems (not just training models), here’s the practical takeaway:</p> <p><strong>The RLVR expansion is coming to your domain.</strong> Whether it’s through RLVRR-style reference-based rewards, auto-generated Judge Code, or domain-specific verifiers, the same training paradigm that made reasoning models possible is about to be applied to your specific use case. The organizations that benefit first will be the ones that:</p> <ol> <li> <p><strong>Have clean reference data.</strong> RLVRR needs high-quality reference outputs. If you’ve been collecting examples of excellent work (customer support transcripts, compliance reports, medical notes), you have raw material for reward chain extraction.</p> </li> <li> <p><strong>Have rule-based quality checks.</strong> If your domain has checklists, regulatory requirements, or quality rubrics that can be expressed as code, those are potential RLVR verifiers. The conversion from “QA checklist” to “training reward signal” is more straightforward than most teams realize.</p> </li> <li> <p><strong>Understand what “partially correct” means.</strong> The shift from binary rewards (right/wrong) to granular rewards (content score + style score + compliance score) unlocks RLVR for domains that aren’t black-and-white. If you can decompose “good output” into measurable dimensions, you can build a reward function.</p> </li> </ol> <p><strong>The fine-tuning calculus is changing.</strong> AT&amp;T’s CDO predicted that fine-tuned small models will be the big trend for mature enterprises in 2026. When you combine SLM fine-tuning with RLVR-style training on domain-specific verifiers, you can build models that match frontier performance on your specific tasks at a fraction of the cost. Mistral has been making this argument loudly: their small models outperform large models after domain fine-tuning.</p> <p><strong>Invest in your verifier infrastructure.</strong> The bottleneck for RLVR adoption isn’t compute or training frameworks - it’s verifiers. Building reliable, fast, domain-specific verifiers is the unglamorous work that unlocks the whole paradigm. If I were allocating engineering resources for 2026, verifier development would be near the top of the list.</p> <h2 id="open-questions-that-matter">Open Questions That Matter</h2> <p>A few things I’m watching closely:</p> <p><strong>Scaling laws for RLVR are unknown.</strong> We have Chinchilla laws for pre-training. We have rough intuitions for RLHF. For RLVR, we don’t know how gains scale with compute, when returns diminish, or what the optimal ratio of training compute to inference compute should be. This uncertainty makes capacity planning difficult.</p> <p><strong>Multi-verifier composition is unexplored.</strong> What happens when you chain multiple partial verifiers? If your content verifier says 0.8 and your style verifier says 0.3 and your compliance verifier says 1.0, how do you combine them? Weighted averaging? Minimum? Multiplicative? The answer probably depends on domain, but there’s no principled framework yet.</p> <p><strong>Self-play for harder problems.</strong> If models exhaust their training data (find correct answers too easily), RLVR training stalls. Self-play - where models generate harder problems for themselves - could sustain exploration. This connects to AlphaEvolve-style approaches where LLMs + evolutionary algorithms discover novel solutions.</p> <p><strong>Regulatory implications.</strong> If RLVR-trained models are making decisions in healthcare, finance, or legal domains, regulators will want to understand the training process. “We trained the model to maximize a score from an automated verifier” is going to invite questions about verifier quality, bias, and coverage that the field hasn’t fully addressed yet.</p> <hr/> <p><em>This is Part 2 of a three-part series on the cutting edge of LLM and agent research in January 2026. Part 1 covered <a href="/blog/2026/agent-protocol-stack/">the agent protocol stack</a> - MCP, A2A, and A2UI as a layered architecture with significant security gaps. Part 3 explores <a href="/blog/2026/circuit-tracing-production/">mechanistic interpretability and circuit tracing</a> - what it means to watch an LLM think, and why it matters for production safety.</em></p> <p><em>Find me on <a href="https://www.linkedin.com/in/subhadip-mitra/">LinkedIn</a> or drop a comment below.</em></p>]]></content><author><name>[&quot;Subhadip Mitra&quot;]</name></author><category term="Research"/><category term="llm"/><category term="deep-learning"/><summary type="html"><![CDATA[Reinforcement Learning with Verifiable Rewards powers every reasoning model worth talking about. But it only works where you can check the answer automatically. Extending it to messy, real-world domains is the hardest open problem in LLM training right now.]]></summary></entry><entry><title type="html">The Agent Protocol Stack: Why MCP + A2A + A2UI Is the TCP/IP Moment for Agentic AI</title><link href="https://subhadipmitra.com/blog/2026/agent-protocol-stack/" rel="alternate" type="text/html" title="The Agent Protocol Stack: Why MCP + A2A + A2UI Is the TCP/IP Moment for Agentic AI"/><published>2026-01-06T10:00:00+00:00</published><updated>2026-01-06T10:00:00+00:00</updated><id>https://subhadipmitra.com/blog/2026/agent-protocol-stack</id><content type="html" xml:base="https://subhadipmitra.com/blog/2026/agent-protocol-stack/"><![CDATA[<p>When I wrote the <a href="/blog/2025/mcp-maturity-model/">MCP Maturity Model</a> two months ago, I treated MCP as the primary protocol layer for agent architectures. That was already incomplete by the time I published it. Google had shipped A2A v0.2. Anthropic’s A2UI had just been announced. And the Linux Foundation was suddenly hosting both MCP and A2A under the same governance roof.</p> <p>What we’re watching isn’t just protocol proliferation - it’s the formation of a genuine protocol stack for agentic systems. And if you squint hard enough, the parallels to early internet protocol development are uncomfortable in how close they track. Including the part where security was an afterthought.</p> <p>This post maps the stack as it exists in January 2026, identifies where the layers compose cleanly and where they don’t, and walks through the security surface that most teams are pretending doesn’t exist.</p> <h2 id="three-protocols-three-problems">Three Protocols, Three Problems</h2> <p>Let’s get the taxonomy right first, because the confusion I see in Slack channels and LinkedIn threads is remarkable. People use “MCP” and “A2A” interchangeably. They’re not interchangeable. They solve fundamentally different problems.</p> <pre><code class="language-mermaid">graph TB
    subgraph "The Agent Protocol Stack (January 2026)"
        direction TB
        A2UI["&lt;b&gt;A2UI&lt;/b&gt;&lt;br/&gt;Agent → Interface&lt;br/&gt;&lt;i&gt;How agents render UI&lt;/i&gt;&lt;br/&gt;Declarative components, cross-platform"]
        A2A["&lt;b&gt;A2A&lt;/b&gt;&lt;br/&gt;Agent → Agent&lt;br/&gt;&lt;i&gt;How agents collaborate&lt;/i&gt;&lt;br/&gt;Task delegation, capability discovery"]
        MCP["&lt;b&gt;MCP&lt;/b&gt;&lt;br/&gt;Agent → Tool/Data&lt;br/&gt;&lt;i&gt;How agents access resources&lt;/i&gt;&lt;br/&gt;Context, tools, prompts"]
    end

    User["Human / Client App"] --&gt; A2UI
    A2UI --&gt; A2A
    A2A --&gt; MCP
    MCP --&gt; Resources["Tools, APIs, Databases, Files"]

    style A2UI fill:#e8eaf6,stroke:#3f51b5
    style A2A fill:#e8f5e9,stroke:#4caf50
    style MCP fill:#fff3e0,stroke:#ff9800
</code></pre> <p><strong>MCP (Model Context Protocol)</strong> - Anthropic, November 2024. Now under Linux Foundation governance. Solves: how does an agent access tools, data sources, and context? Think of it as the agent’s hands and eyes. It reaches into databases, calls APIs, reads files. The primitives are resources, prompts, and tools.</p> <p><strong>A2A (Agent2Agent Protocol)</strong> - Google, April 2025. Donated to Linux Foundation June 2025. Currently at v0.3. Solves: how do agents from different vendors, frameworks, and organizations talk to each other as peers? Not as tools - as collaborators. The primitives are AgentCards (capability discovery), Tasks (units of work), and Messages (communication).</p> <p><strong>A2UI (Agent to UI Protocol)</strong> - Google, December 2025. Still early (v0.8 stable). Solves: how does an agent generate rich, interactive user interfaces without executing arbitrary code on the client? The primitives are declarative UI components that render natively across platforms.</p> <p>The critical distinction most people miss: <strong>MCP treats external systems as tools for agents to use. A2A treats other agents as peers to collaborate with.</strong> An agent using MCP to query a database is fundamentally different from an agent using A2A to delegate a sub-task to a specialist agent. The trust models are different. The failure modes are different. The security boundaries are different.</p> <h2 id="how-the-layers-compose">How the Layers Compose</h2> <p>Here’s where it gets interesting. These protocols aren’t just parallel standards - they’re designed to stack.</p> <pre><code class="language-mermaid">sequenceDiagram
    participant User as User / Client
    participant UI as A2UI Layer
    participant Orchestrator as Orchestrator Agent
    participant Specialist as Specialist Agent
    participant Tool as MCP Server (DB, API)

    User-&gt;&gt;UI: "Find me flights under $500 to Tokyo next month"
    UI-&gt;&gt;Orchestrator: Parse intent, create task

    Note over Orchestrator: Discovers specialist via A2A AgentCard

    Orchestrator-&gt;&gt;Specialist: A2A: Delegate flight search task
    Specialist-&gt;&gt;Tool: MCP: Query flight API
    Tool--&gt;&gt;Specialist: Flight data (structured)
    Specialist-&gt;&gt;Tool: MCP: Query price history
    Tool--&gt;&gt;Specialist: Historical pricing

    Specialist--&gt;&gt;Orchestrator: A2A: Task result with 12 options

    Note over Orchestrator: Decides UI rendering strategy

    Orchestrator-&gt;&gt;UI: A2UI: Render flight comparison cards
    UI--&gt;&gt;User: Interactive flight cards with filters

    User-&gt;&gt;UI: Selects flight, clicks "Book"
    UI-&gt;&gt;Orchestrator: Booking intent
    Orchestrator-&gt;&gt;Specialist: A2A: Delegate booking task
    Specialist-&gt;&gt;Tool: MCP: Execute booking API
    Tool--&gt;&gt;Specialist: Confirmation
    Specialist--&gt;&gt;Orchestrator: A2A: Booking confirmed
    Orchestrator-&gt;&gt;UI: A2UI: Render confirmation with itinerary
    UI--&gt;&gt;User: Booking confirmation
</code></pre> <p>A real request flows through all three layers:</p> <ol> <li><strong>A2UI</strong> captures user intent and renders responses as interactive components (not just text)</li> <li><strong>A2A</strong> handles delegation - the orchestrator discovers specialist agents via AgentCards and delegates sub-tasks</li> <li><strong>MCP</strong> handles the actual work - specialist agents use MCP to query databases, call APIs, execute tools</li> </ol> <p>The IBM explainer on A2A puts it well: a retail inventory agent uses MCP to check stock levels, then uses A2A to notify a supplier agent when stock is low. The protocols aren’t competing - they’re complementary at different layers.</p> <h3 id="where-the-stack-composes-cleanly">Where the Stack Composes Cleanly</h3> <p>The composition works elegantly when responsibilities are clear:</p> <table> <thead> <tr> <th>Layer</th> <th>Responsibility</th> <th>Trust Boundary</th> <th>Failure Mode</th> </tr> </thead> <tbody> <tr> <td><strong>A2UI</strong></td> <td>Rendering, user interaction</td> <td>Client-side sandboxing</td> <td>Bad UI, not data loss</td> </tr> <tr> <td><strong>A2A</strong></td> <td>Task delegation, capability discovery</td> <td>Cross-organization auth</td> <td>Task failure, retry needed</td> </tr> <tr> <td><strong>MCP</strong></td> <td>Data access, tool execution</td> <td>Server-side permissions</td> <td>Data corruption, privilege escalation</td> </tr> </tbody> </table> <p>AgentMaster (July 2025) was the first framework to use A2A and MCP together in production. Google’s ADK (Agent Development Kit) now has first-class support for both. LangGraph v0.2 (shipped January 15, 2026) added A2A and MCP as first-class protocol targets.</p> <p>The pattern that’s emerging: <strong>A2A for the network layer, MCP for the resource layer.</strong> It’s clean. It makes sense. And it’s exactly what we said about HTTP and FTP in 1995, right before we discovered all the ways they could be abused together.</p> <h3 id="where-the-stack-breaks">Where the Stack Breaks</h3> <p>Now for the part nobody wants to talk about. I see three structural gaps:</p> <p><strong>Gap 1: No Unified Identity Model</strong></p> <p>MCP has its own auth model (recently upgraded to OAuth 2.1, but still messy in practice). A2A has its own auth scheme (parity with OpenAPI’s authentication at launch). A2UI handles client-side trust differently. There’s no unified identity that flows across all three layers.</p> <p>In practice, this means: an agent authenticated via A2A to delegate a task has no guaranteed way to pass that identity context through to the MCP layer where the actual tool execution happens. The specialist agent re-authenticates independently. Credential management becomes a per-layer problem.</p> <p><strong>Gap 2: Observability Doesn’t Cross Layers</strong></p> <p>You can trace an MCP request. You can trace an A2A task. But tracing a user request that flows through A2UI → A2A → MCP → back requires stitching together three different observability systems. Nobody has solved distributed tracing across this stack cleanly.</p> <p><strong>Gap 3: Error Propagation Is Undefined</strong></p> <p>What happens when an MCP tool call fails inside an A2A-delegated task? The A2A spec supports long-running tasks and status updates, but the semantics of “my MCP server is down” translating to an A2A task failure and then to an A2UI error state are… undefined. Each layer has its own error model. Reconciling them is left as an exercise for the developer.</p> <pre><code class="language-mermaid">graph LR
    subgraph "Gap: No Unified Identity"
        direction LR
        UA["User Auth&lt;br/&gt;(A2UI)"] -.-&gt;|"???"| AA["Agent Auth&lt;br/&gt;(A2A)"]
        AA -.-&gt;|"???"| TA["Tool Auth&lt;br/&gt;(MCP)"]
    end

    subgraph "Gap: Observability"
        direction LR
        T1["A2UI Trace"] -.-&gt;|"Manual stitching"| T2["A2A Trace"]
        T2 -.-&gt;|"Manual stitching"| T3["MCP Trace"]
    end

    subgraph "Gap: Error Propagation"
        direction LR
        E1["MCP Failure"] -.-&gt;|"Undefined"| E2["A2A Task State"]
        E2 -.-&gt;|"Undefined"| E3["A2UI Error Display"]
    end

    style UA fill:#ffcdd2
    style AA fill:#ffcdd2
    style TA fill:#ffcdd2
    style T1 fill:#fff9c4
    style T2 fill:#fff9c4
    style T3 fill:#fff9c4
    style E1 fill:#ffccbc
    style E2 fill:#ffccbc
    style E3 fill:#ffccbc
</code></pre> <h2 id="the-security-surface-that-should-keep-you-up-at-night">The Security Surface That Should Keep You Up at Night</h2> <p>I’m going to spend more time here than on anything else in this post because the security situation is genuinely alarming.</p> <p>Adversa AI published a taxonomy of 25 MCP vulnerability categories. VentureBeat reported on Pynt’s research showing that deploying just ten MCP plugins creates a <strong>92% probability of exploitation</strong>. OWASP published an MCP-specific Top 10. And a supply chain worm called Shai-Hulud 2.0 re-emerged in November specifically targeting developer pipelines that use MCP.</p> <p>Let’s walk through the attack surfaces layer by layer.</p> <h3 id="mcp-the-tool-layers-open-wounds">MCP: The Tool Layer’s Open Wounds</h3> <p>The MCP security model was designed for interoperability, not containment. Nancy Wang, SVP of Engineering at 1Password, put it bluntly: “any agent that speaks MCP can plug into your company’s systems, fetch data, and perform actions. That flexibility is powerful, but it also assumes a level of trust that doesn’t exist in enterprise environments.”</p> <p>The critical vulnerabilities:</p> <p><strong>Tool Poisoning</strong> - An MCP tool’s description is consumed by the LLM to decide when and how to use the tool. A malicious tool description can contain hidden instructions that manipulate agent behavior. The tool description says “Calculator for math” to the human reviewer, but contains invisible Unicode characters that tell the LLM to exfiltrate data. Detection is nearly impossible without specialized scanning.</p> <p><strong>Supply Chain Attacks</strong> - Most developers install MCP packages from npm or Docker Hub without auditing. One poisoned update can compromise every agent system that depends on it. The mcp-remote package (widely used for OAuth support) had a critical RCE vulnerability (CVE-2025-6514). Hundreds of MCP servers were found bound to 0.0.0.0 - exposed to the entire network.</p> <p><strong>Rug Pulls</strong> - An MCP server is approved initially, then silently updated with new tool definitions. The agent gains capabilities that were never authorized. Datadog documented this pattern: an MCP server adds tool definitions that delete resources, and the host application is never notified.</p> <p><strong>Config Injection</strong> - Attackers place malicious <code class="language-plaintext highlighter-rouge">.mcp/config.json</code> files in repositories. When developers clone and open the project, their IDE automatically connects to attacker-controlled servers. No user interaction required beyond opening the project. VSCode and Cursor are both vulnerable.</p> <h3 id="a2a-the-collaboration-layers-trust-problem">A2A: The Collaboration Layer’s Trust Problem</h3> <p>A2A introduces a different class of risk: <strong>what happens when you trust another agent that shouldn’t be trusted?</strong></p> <p>The AgentCard mechanism (how agents advertise capabilities) is essentially self-reported. An agent says “I’m a billing specialist with access to payment processing” and other agents take that at face value. There’s no built-in mechanism for verifying capability claims.</p> <p>A2A v0.3 added gRPC support and the ability to sign security cards, which helps. But the fundamental problem remains: agent identity and capability verification in a decentralized system is an unsolved problem. It’s the same challenge federated identity systems have struggled with for decades, now applied to autonomous software agents that make decisions.</p> <h3 id="a2ui-the-client-layers-sandboxing-challenge">A2UI: The Client Layer’s Sandboxing Challenge</h3> <p>A2UI is designed to be safe by construction - agents generate declarative UI components, not executable code. The client renders these components from a trusted catalog. This is actually a reasonable security model.</p> <p>The risk shifts to the catalog itself: if an attacker can register a malicious component in the client’s trusted catalog, every agent-generated UI becomes a potential attack vector. The extensibility that makes A2UI useful (custom components for enterprise needs) is the same extensibility that creates supply chain risk.</p> <h3 id="cross-layer-attack-scenarios">Cross-Layer Attack Scenarios</h3> <p>The scariest attacks aren’t within a single layer - they chain across the stack:</p> <pre><code class="language-mermaid">graph TD
    A["1. Attacker publishes&lt;br/&gt;poisoned MCP tool&lt;br/&gt;to npm registry"] --&gt; B["2. Tool contains hidden&lt;br/&gt;instructions in description&lt;br/&gt;(invisible Unicode)"]
    B --&gt; C["3. Developer installs&lt;br/&gt;MCP server, adds to&lt;br/&gt;agent system"]
    C --&gt; D["4. Agent uses poisoned tool,&lt;br/&gt;hidden instructions cause&lt;br/&gt;data exfiltration via A2A"]
    D --&gt; E["5. Exfiltrated data sent to&lt;br/&gt;attacker's A2A endpoint&lt;br/&gt;disguised as legitimate agent"]
    E --&gt; F["6. A2UI renders fake&lt;br/&gt;confirmation to user&lt;br/&gt;while attack continues"]

    style A fill:#ffcdd2
    style B fill:#ffcdd2
    style C fill:#fff9c4
    style D fill:#ffccbc
    style E fill:#ffccbc
    style F fill:#ffcdd2
</code></pre> <p>A poisoned MCP tool manipulates an agent into delegating data exfiltration via A2A to a malicious external agent, which then renders a fake success confirmation via A2UI. The user sees “task completed successfully” while their data is being siphoned.</p> <p>This isn’t theoretical. Every component of this attack chain has been demonstrated independently. Nobody has chained them in the wild yet - that we know of. But the ingredients are all sitting on the kitchen counter.</p> <h2 id="what-mature-teams-are-doing-right-now">What Mature Teams Are Doing Right Now</h2> <p>After talking with teams running multi-agent systems in production and observing the patterns emerging across the ecosystem, here’s what separates the teams that will survive from the teams that will end up in a breach disclosure.</p> <h3 id="1-defense-in-depth-across-the-stack">1. Defense in Depth Across the Stack</h3> <p>Don’t rely on any single layer for security. Assume each layer will be compromised independently.</p> <table> <thead> <tr> <th>Layer</th> <th>Control</th> <th>Implementation</th> </tr> </thead> <tbody> <tr> <td><strong>MCP</strong></td> <td>Tool vetting + sandboxing</td> <td>Internal registry of audited MCP servers. No direct npm installs. OWASP MCP Top 10 as checklist.</td> </tr> <tr> <td><strong>MCP</strong></td> <td>Input validation</td> <td>Sanitize all inputs before they reach LLM agents. Block injection patterns, encoded payloads.</td> </tr> <tr> <td><strong>MCP</strong></td> <td>Least privilege</td> <td>Each MCP server gets minimal permissions. No shared credentials across servers.</td> </tr> <tr> <td><strong>A2A</strong></td> <td>AgentCard verification</td> <td>Don’t trust self-reported capabilities. Verify through challenge-response or reputation systems.</td> </tr> <tr> <td><strong>A2A</strong></td> <td>Task boundaries</td> <td>Constrain what delegated tasks can do. No open-ended “do anything” delegations.</td> </tr> <tr> <td><strong>A2UI</strong></td> <td>Component catalog control</td> <td>Locked registry of approved UI components. Code-review process for additions.</td> </tr> <tr> <td><strong>Cross-layer</strong></td> <td>Distributed tracing</td> <td>Correlation IDs that flow through A2UI → A2A → MCP. Log everything.</td> </tr> </tbody> </table> <h3 id="2-treat-mcp-servers-like-dependencies-not-plugins">2. Treat MCP Servers Like Dependencies, Not Plugins</h3> <p>The mental model shift: MCP servers aren’t plugins you install and forget. They’re dependencies in your supply chain. Apply the same rigor you’d apply to any third-party library:</p> <ul> <li>Pin versions. Don’t auto-update.</li> <li>Audit tool descriptions for hidden content (invisible Unicode, RTL markers, homoglyphs).</li> <li>Run in sandboxed environments with restricted network access.</li> <li>Monitor for unexpected tool definition changes (rug pull detection).</li> </ul> <h3 id="3-build-the-identity-bridge-yourself">3. Build the Identity Bridge Yourself</h3> <p>Since the stack doesn’t provide unified identity, build it. Pass authentication context explicitly through each layer transition:</p> <ul> <li>A2UI authenticates the user.</li> <li>A2UI passes a signed token to the orchestrator agent.</li> <li>Orchestrator includes the token in A2A task metadata when delegating.</li> <li>Specialist agent presents the token to MCP servers for authorization.</li> </ul> <p>It’s manual. It’s annoying. It’s necessary until the protocols provide a standard mechanism. The A2A Secure Passport Extension (announced in late 2025) is a step toward this - it lets agents share structured context securely - but it’s not yet widely implemented.</p> <h3 id="4-dont-ship-a2a-until-you-need-it">4. Don’t Ship A2A Until You Need It</h3> <p>This is my most controversial take: A2A solves a real problem — but it’s a problem most teams haven’t hit yet.</p> <p>If your agents are all within the same organization, running in the same infrastructure, and you control the entire pipeline - you don’t need a cross-organization agent communication protocol. Use simpler orchestration (LangGraph, CrewAI, direct function calls). The overhead and attack surface of A2A aren’t justified.</p> <p>A2A becomes essential when:</p> <ul> <li>Agents from different organizations need to collaborate</li> <li>You’re building a marketplace of agent capabilities</li> <li>You need formal task lifecycle management across trust boundaries</li> <li>Agents run on different platforms and can’t share memory or tools</li> </ul> <p>If none of those apply, simpler orchestration patterns will serve you better while the protocol matures.</p> <h2 id="the-tcpip-parallel-and-its-limits">The TCP/IP Parallel (And Its Limits)</h2> <p>I’ve been using the TCP/IP analogy deliberately, so let me be explicit about where it holds and where it breaks.</p> <p><strong>Where it holds:</strong></p> <ul> <li>Layered architecture with clear responsibilities per layer</li> <li>Each layer can evolve independently</li> <li>Interoperability is the primary design goal</li> <li>Open governance (Linux Foundation for both MCP and A2A)</li> <li>Security was bolted on after initial adoption</li> </ul> <p><strong>Where it breaks:</strong></p> <ul> <li>TCP/IP moved bits. These protocols move intent. The semantic gap is enormous.</li> <li>TCP/IP had decades to mature before the internet became critical infrastructure. The agent protocol stack is being deployed into production systems <em>now</em>, with enterprise data, while the specs are still at v0.3.</li> <li>TCP/IP’s layering was clean from early on. The agent stack’s layering is still messy - is context delivery (MCP) really the same layer as tool execution (also MCP)? Should AgentCard discovery be a separate protocol?</li> </ul> <p>The parallel is useful for framing but dangerous for prediction. We shouldn’t assume this stack will converge the way internet protocols did. It might fragment. It might get replaced by something we haven’t seen yet.</p> <h2 id="whats-missing-from-the-stack">What’s Missing from the Stack</h2> <p>Three things I expect to emerge in the next 12 months:</p> <p><strong>Agent Identity Protocol</strong> - A dedicated layer for agent identity, capability attestation, and reputation. Neither MCP nor A2A handles this well. The closest thing is A2A’s AgentCard, but it’s self-reported and unsigned (until v0.3’s security card signing, which is still nascent). We need something like X.509 for agents.</p> <p><strong>Context Provenance Protocol</strong> - How do you trace where a piece of context came from, how it was transformed, and who touched it? Critical for debugging, compliance, and trust. MCP doesn’t track provenance. A2A doesn’t track it. Nobody tracks it.</p> <p><strong>Agent Governance Protocol</strong> - Governance agents that monitor other agents for policy violations. Machine Learning Mastery’s analysis of 2026 trends highlights this as an emerging pattern. You’ll need a protocol for the governance layer to observe and intervene across MCP and A2A interactions without breaking the stack.</p> <h2 id="connecting-back-to-the-maturity-model">Connecting Back to the Maturity Model</h2> <p>If you’ve read my <a href="/blog/2025/mcp-maturity-model/">MCP Maturity Model</a>, here’s where the protocol stack maps to maturity levels:</p> <table> <thead> <tr> <th>Maturity Level</th> <th>Protocol Stack Usage</th> </tr> </thead> <tbody> <tr> <td><strong>Level 0-1</strong></td> <td>None needed. String assembly and structured objects.</td> </tr> <tr> <td><strong>Level 2</strong></td> <td>MCP for standardized tool/data access.</td> </tr> <tr> <td><strong>Level 3</strong></td> <td>MCP with optimization. A2A becomes relevant if you have cross-boundary agent coordination.</td> </tr> <tr> <td><strong>Level 4</strong></td> <td>Full MCP + A2A. Adaptive systems benefit from A2A’s capability discovery. A2UI if you’re building user-facing agent experiences.</td> </tr> <tr> <td><strong>Level 5</strong></td> <td>All three protocols with custom extensions. This is where the missing protocols (identity, provenance, governance) become critical.</td> </tr> </tbody> </table> <p>Most teams should be at Level 2-3, using MCP competently, with A2A on the roadmap for when they genuinely need cross-agent collaboration across trust boundaries. If you’re jumping to full-stack deployment without solid MCP foundations, you’re building on sand.</p> <h2 id="where-we-go-from-here">Where We Go From Here</h2> <p>The agent protocol stack is real. It’s messy. It’s being deployed into production faster than the security model can keep up. This is exactly what happened with web technologies in the late 1990s, and we spent the next two decades patching the gaps.</p> <p>We have a narrow window to get the security fundamentals right before the stack becomes too entrenched to fix. The OWASP MCP Top 10 is a start. A2A’s security card signing is a start. But we need the community to treat agent protocol security with the same urgency we treat API security - not as an afterthought, but as a first-class design constraint.</p> <p>The organizations that will thrive in the agentic era aren’t the ones deploying the most agents. They’re the ones deploying agents with the best understanding of what these protocols actually guarantee - and what they don’t.</p> <hr/> <p><em>This is Part 1 of a three-part series on the cutting edge of LLM and agent research in January 2026. Part 2 covers <a href="/blog/2026/rlvr-beyond-math-code/">RLVR beyond math and code</a> - the training technique powering reasoning models and the open question of whether it actually makes models smarter. Part 3 explores <a href="/blog/2026/circuit-tracing-production/">mechanistic interpretability and circuit tracing</a> - what it means to watch an LLM think, and why it matters for production safety.</em></p> <p><em>Find me on <a href="https://www.linkedin.com/in/subhadip-mitra/">LinkedIn</a> or drop a comment below.</em></p>]]></content><author><name>[&quot;Subhadip Mitra&quot;]</name></author><category term="AI"/><category term="agents"/><summary type="html"><![CDATA[MCP handles agent-to-tool. A2A handles agent-to-agent. A2UI handles agent-to-interface. Together they form a protocol stack that nobody has mapped properly - including the security gaps that should terrify you.]]></summary></entry><entry><title type="html">The Manifold Dial: Visualizing Why DeepSeek’s mHC Stabilizes Deep Networks</title><link href="https://subhadipmitra.com/blog/2026/deepseek-mhc-manifold-constrained-hyper-connections/" rel="alternate" type="text/html" title="The Manifold Dial: Visualizing Why DeepSeek’s mHC Stabilizes Deep Networks"/><published>2026-01-03T11:32:03+00:00</published><updated>2026-01-03T11:32:03+00:00</updated><id>https://subhadipmitra.com/blog/2026/deepseek-mhc-manifold-constrained-hyper-connections</id><content type="html" xml:base="https://subhadipmitra.com/blog/2026/deepseek-mhc-manifold-constrained-hyper-connections/"><![CDATA[<div style="background: linear-gradient(135deg, #eff6ff 0%, #f0fdf4 100%); border-left: 4px solid #3b82f6; padding: 16px 20px; border-radius: 0 8px 8px 0; margin-bottom: 24px;"> <p style="margin: 0; font-size: 15px; color: #1e40af;"> <strong style="color: #000;">Interactive Demo:</strong> Explore how mHC stabilizes deep networks with the <a href="https://subhadipmitra.com/mhc-visualizer/" target="_blank" rel="noopener noreferrer" style="color: #2563eb; text-decoration: underline;">Manifold Dial visualizer</a> ↗ </p> </div> <h2 id="nine-years-of-good-enough">Nine Years of “Good Enough”</h2> <p>Residual connections haven’t changed since 2016. He et al. introduced them in ResNet, the formula stuck (<code class="language-plaintext highlighter-rouge">output = layer(x) + x</code>), and we’ve been using the same thing ever since. Attention mechanisms evolved. Normalization techniques multiplied. FFN architectures got reworked a dozen times. But skip connections? Untouched.</p> <p>It’s not that nobody tried. There’s been work on dense connections, highway networks, various gating mechanisms. Most added complexity without clear wins. The simple additive skip connection kept winning.</p> <p>Then Hyper-Connections came along and showed genuine improvements by expanding the residual stream - multiple parallel paths instead of one, with learned mixing between them. Promising results. But also a problem that becomes obvious only at scale: the networks become unstable during training. Loss spikes. Gradient explosions. The deeper you go, the worse it gets.</p> <p>DeepSeek’s mHC paper explains why this happens and how to fix it. The fix involves projecting matrices onto something called the Birkhoff polytope using an algorithm from 1967. I built an interactive tool to visualize what’s actually going on, because the equations alone don’t convey how dramatic the difference is.</p> <h2 id="what-hyper-connections-actually-do">What Hyper-Connections Actually Do</h2> <p>Standard residual: you compute a layer’s output and add back the input. One stream in, one stream out.</p> <p>Hyper-Connections expand this to $n$ parallel streams (typically 4). Instead of simple addition, you get learned mixing matrices that control how information flows between streams:</p> \[\mathbf{x}_{l+1} = H^{res}_l \mathbf{x}_l + H^{post}_l \cdot \mathcal{F}(H^{pre}_l \mathbf{x}_l)\] <p>Three matrices per layer: one to mix the residual streams ($H^{res}$), one to aggregate streams into the layer input ($H^{pre}$), one to distribute the layer output back to streams ($H^{post}$).</p> <p>The paper’s ablation study shows $H^{res}$ matters most. That’s the mixing within the residual stream itself - how information from different streams combines as it flows through the network.</p> <p>More expressivity should mean better performance, and it does. HC improves over standard residuals in their experiments. The catch is what happens when you stack 60+ layers.</p> <h2 id="the-composite-mapping-problem">The Composite Mapping Problem</h2> <p>Each layer multiplies by its $H^{res}$ matrix. Through $L$ layers, the effective transformation is:</p> \[\prod_{i=1}^{L} H^{res}_{L-i}\] <p>This product determines how signals from early layers reach later ones. With unconstrained learned matrices, small amplifications compound. A matrix with spectral norm 1.05 seems harmless. Sixty of them multiplied together? That’s $1.05^{60} \approx 18$. And real HC matrices aren’t limited to 1.05.</p> <p>The paper measured this directly. Figure 3 shows the “Amax Gain Magnitude” - essentially the worst-case amplification through the composite mapping. For HC at depth 64, gains can reach 10³ to 10⁵ depending on initialization. In our toy simulation with random matrices, it’s even more extreme - up to 10¹⁶. The composite mapping amplifies signals catastrophically.</p> <p>That’s why training becomes unstable. Gradients flow backward through the same composite mapping. A 3000x amplification in the forward pass means 3000x amplification in the backward pass. Gradient clipping helps, but you’re fighting the architecture itself.</p> <figure> <picture> <img src="/assets/img/blog/mhc/hero_composite_gain.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> <figcaption class="caption">Composite forward gain vs. network depth. HC (red) explodes exponentially. mHC (blue) stays bounded. Baseline identity mapping (green) remains flat at 1.</figcaption> </figure> <h2 id="the-fix-doubly-stochastic-matrices">The Fix: Doubly Stochastic Matrices</h2> <p>mHC constrains $H^{res}$ to be doubly stochastic - all entries non-negative, all rows sum to 1, all columns sum to 1.</p> <p>Why this specific constraint? Three properties matter:</p> <p><strong>Spectral norm is bounded by 1.</strong> A doubly stochastic matrix cannot amplify signals. Each row summing to 1 means the weighted combination of inputs never exceeds the maximum input. No amplification, no explosion.</p> <p><strong>Closure under multiplication.</strong> Multiply two doubly stochastic matrices and you get another doubly stochastic matrix. This is the key insight. It doesn’t matter how many layers you stack - the composite mapping stays doubly stochastic, stays bounded.</p> <p><strong>Geometric interpretation.</strong> The set of doubly stochastic matrices forms the Birkhoff polytope, which is the convex hull of permutation matrices. Every doubly stochastic matrix can be written as a weighted average of permutations. Permutations just shuffle; they don’t amplify. Weighted averages of shuffles don’t amplify either.</p> <p>The result: composite gains stay near 1 regardless of depth. The paper shows mHC at depth 64 has composite gain around 1.6. Compare that to HC’s explosive growth.</p> <h2 id="sinkhorn-knopp-1967-meets-2025">Sinkhorn-Knopp: 1967 Meets 2025</h2> <p>To make a learned matrix doubly stochastic, mHC uses the Sinkhorn-Knopp algorithm. Published in 1967 for balancing matrices in numerical analysis, it turns out to be exactly what’s needed here.</p> <p>The algorithm is simple: exponentiate entries to make them positive, then alternate between normalizing rows and normalizing columns. Repeat until convergence. The iteration provably converges to a doubly stochastic matrix.</p> <figure> <picture> <img src="/assets/img/blog/mhc/matrix_comparison.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> <figcaption class="caption">A random matrix (left) transformed by Sinkhorn-Knopp. After 5 iterations (middle), row errors drop to 10⁻⁴. After 20 iterations (right), errors reach 10⁻¹³.</figcaption> </figure> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">sinkhorn_knopp</span><span class="p">(</span><span class="n">matrix</span><span class="p">,</span> <span class="n">iterations</span><span class="o">=</span><span class="mi">20</span><span class="p">,</span> <span class="n">eps</span><span class="o">=</span><span class="mf">1e-8</span><span class="p">):</span>
    <span class="c1"># Exponentiate (subtract max for numerical stability)
</span>    <span class="n">P</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="nf">exp</span><span class="p">(</span><span class="n">matrix</span> <span class="o">-</span> <span class="n">matrix</span><span class="p">.</span><span class="nf">max</span><span class="p">())</span>

    <span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nf">range</span><span class="p">(</span><span class="n">iterations</span><span class="p">):</span>
        <span class="n">P</span> <span class="o">=</span> <span class="n">P</span> <span class="o">/</span> <span class="p">(</span><span class="n">P</span><span class="p">.</span><span class="nf">sum</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">keepdims</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span> <span class="o">+</span> <span class="n">eps</span><span class="p">)</span>  <span class="c1"># Row normalize
</span>        <span class="n">P</span> <span class="o">=</span> <span class="n">P</span> <span class="o">/</span> <span class="p">(</span><span class="n">P</span><span class="p">.</span><span class="nf">sum</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="n">keepdims</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span> <span class="o">+</span> <span class="n">eps</span><span class="p">)</span>  <span class="c1"># Column normalize
</span>
    <span class="k">return</span> <span class="n">P</span>
</code></pre></div></div> <p>Twenty iterations gets you close enough. The paper uses this as the default and shows it’s sufficient for the constraint to stabilize training.</p> <h2 id="the-manifold-dial">The Manifold Dial</h2> <p>Here’s what I find most interesting: how quickly stability kicks in.</p> <p>I swept the number of Sinkhorn iterations from 0 to 20 and measured the composite gain at depth 64. At zero iterations, you have an unconstrained matrix - basically HC. At twenty iterations, you have a nearly perfect doubly stochastic matrix - full mHC.</p> <figure> <picture> <img src="/assets/img/blog/mhc/manifold_dial.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> <figcaption class="caption">The Manifold Dial: composite gain vs. Sinkhorn iterations. At k=0 (unconstrained), gain explodes to 10¹⁶. By k=1, it collapses to near 1. The transition is almost instantaneous.</figcaption> </figure> <h2 id="interactive-demo">Interactive Demo</h2> <p>I built an interactive version so you can explore this yourself:</p> <iframe src="https://subhadipmitra.com/mhc-visualizer/" width="100%" height="1100" style="border: none; border-radius: 8px;" title="Manifold Dial - mHC Visualizer"> </iframe> <p style="text-align: center; margin-top: 8px;"> <a href="https://subhadipmitra.com/mhc-visualizer/" target="_blank" rel="noopener noreferrer" style="font-size: 14px; color: #6b7280;"> Open in new window ↗ </a> </p> <p>Drag the Sinkhorn iterations slider. At 0, the mHC line explodes just like HC. As you increase iterations, watch it collapse down toward the stable baseline. Somewhere around 5-10 iterations, stability kicks in. By 20, it’s fully bounded.</p> <p>The “manifold dial” is literally how much you’re projecting onto the doubly stochastic manifold. Zero projection means unconstrained chaos. Full projection means guaranteed stability.</p> <p>This isn’t in the paper. I built it because the static figures don’t capture how smooth this transition is, or how little projection you actually need to get most of the stability benefit.</p> <h2 id="comparison-with-the-paper">Comparison with the Paper</h2> <p>For reference, here’s a recreation of the paper’s Figure 3, showing both single-layer and composite gains:</p> <figure> <picture> <img src="/assets/img/blog/mhc/paper_figure3_recreation.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> <figcaption class="caption">Recreation of the paper's Figure 3. (a) Single-layer forward gain fluctuates for HC but stays bounded. (b) Composite gain is where the problem shows - exponential growth for HC, flat for mHC.</figcaption> </figure> <p>Note that single-layer gains (left) aren’t catastrophic - individual HC matrices have gains in the 1-7 range. The problem is multiplication. Sixty matrices with average gain 3 gives $3^{60} \approx 10^{28}$. The composite mapping (right) reveals what single-layer analysis misses.</p> <h2 id="practical-details">Practical Details</h2> <p>DeepSeek didn’t just prove this works mathematically - they scaled it to 27B parameter models and measured the system overhead.</p> <p>Training stability improves dramatically. Their Figure 2 shows HC experiencing a loss spike around step 12k with gradient norm shooting up. mHC has no such spike. The gradient norm stays smooth throughout.</p> <p>The overhead is manageable. The Sinkhorn iterations add computation, but they operate on small matrices ($n \times n$ where $n=4$ typically). With kernel fusion and careful memory management, the full mHC implementation adds 6.7% training time overhead. For the stability and performance gains, that’s a reasonable trade.</p> <p>Benchmark results on the 27B model show mHC outperforming both baseline and HC across tasks. BBH improves from 43.8 (baseline) to 48.9 (HC) to 51.0 (mHC). Similar pattern across DROP, GSM8K, MMLU, and others.</p> <h2 id="what-i-find-interesting">What I Find Interesting</h2> <p>A few things stood out reading this paper:</p> <p>The instability isn’t subtle. Three orders of magnitude in signal amplification isn’t a minor numerical issue you can tune away. It’s a fundamental architectural problem. HC was probably hitting this wall in ways that weren’t always diagnosed correctly.</p> <p>The fix comes from constraints, not regularization. You could try to penalize large gains with loss terms, but that’s fighting the architecture. Constraining to doubly stochastic matrices makes explosion structurally impossible. The geometry of the constraint does the work.</p> <p>The 1967 algorithm works. Machine learning keeps rediscovering techniques from optimization and numerical analysis. Sinkhorn-Knopp wasn’t designed for neural networks, but it slots in perfectly here. There’s probably more useful machinery sitting in old papers.</p> <p>Macro-architecture gets less attention than it deserves. We spend enormous effort on attention variants and FFN structures, but how layers connect to each other - the topology of the network - might have similar headroom for improvement.</p> <h2 id="code">Code</h2> <p>I implemented both the visualization and a PyTorch module you can actually use:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="n">mhc</span> <span class="kn">import</span> <span class="n">mHCResidual</span>

<span class="c1"># Drop-in residual connection replacement
</span><span class="n">residual</span> <span class="o">=</span> <span class="nf">mHCResidual</span><span class="p">(</span><span class="n">dim</span><span class="o">=</span><span class="mi">512</span><span class="p">,</span> <span class="n">n_streams</span><span class="o">=</span><span class="mi">4</span><span class="p">,</span> <span class="n">sinkhorn_iters</span><span class="o">=</span><span class="mi">20</span><span class="p">)</span>

<span class="c1"># In your forward pass
</span><span class="n">hidden</span> <span class="o">=</span> <span class="nf">residual</span><span class="p">(</span><span class="n">hidden_states</span><span class="p">,</span> <span class="n">layer_output</span><span class="p">)</span>
</code></pre></div></div> <p>The repository includes the interactive demo source, Python implementation with tests, and a Colab notebook if you want to experiment without local setup.</p> <h2 id="links">Links</h2> <ul> <li><a href="https://subhadipmitra.com/mhc-visualizer">Interactive Demo</a> - the manifold dial visualization</li> <li><a href="https://github.com/bassrehab/mhc-visualizer">GitHub Repository</a> - full source, PyTorch module, tests</li> <li><a href="https://colab.research.google.com/github/bassrehab/mhc-visualizer/blob/main/notebook/mhc_exploration.ipynb">Colab Notebook</a> - run it yourself</li> <li><a href="https://arxiv.org/abs/2512.24880">mHC Paper</a> - the original DeepSeek paper</li> </ul> <hr/> <h2 id="references">References</h2> <p>Xie, Z., Wei, Y., Cao, H., et al. (2025). mHC: Manifold-Constrained Hyper-Connections. <em>arXiv preprint arXiv:2512.24880</em>.</p> <p>He, K., Zhang, X., Ren, S., &amp; Sun, J. (2016). Deep Residual Learning for Image Recognition. <em>CVPR</em>.</p> <p>Sinkhorn, R., &amp; Knopp, P. (1967). Concerning nonnegative matrices and doubly stochastic matrices. <em>Pacific Journal of Mathematics</em>, 21(2), 343-348.</p>]]></content><author><name>Subhadip Mitra</name><email>contact@subhadipmitra.com</email></author><category term="Research"/><category term="deep-learning"/><summary type="html"><![CDATA[Interactive exploration of Manifold-Constrained Hyper-Connections - how DeepSeek fixed the signal explosion problem in deep residual networks using 1967 mathematics]]></summary></entry></feed>