On March 31, Anthropic accidentally included source maps in the Claude Code CLI npm package. These allowed much of the TypeScript source to be reconstructed, though the leak was partial: missing packages meant the code wouldn't run as a standalone application. It was enough. Within hours the reconstructed source was mirrored to Gitee (think GitHub for China, with US copyright immunity) and picked apart by thousands of developers in an r/ClaudeAI megathread. Anthropic filed DMCA takedowns. It was too late.

The code quality debates and buddy-system hacks got the most attention. The findings that matter are buried in the megathread's technical deep-dives: how Claude Code fetches, processes, and filters web content before the model ever sees it.

An important distinction first. Claude Code is Anthropic's developer CLI, a tool that helps programmers write and debug code. It is not the same product as claude.ai, the chatbot that general users interact with. The two share a search index but use completely different content pipelines. Everything in this article describes Claude Code's retrieval pipeline specifically. Where the findings might apply more broadly, we'll say so. Where they don't, we'll say that too.

Here's what the source code shows.

107 Pre-Approved Domains Get the Full Treatment. Everyone Else Gets Compressed.

/

The leak revealed a hardcoded list of pre-approved domains in a file called preapproved.ts. Independent researchers counted between 85 and 107 domains depending on the version. They're all developer documentation sites: React, Django, AWS, PostgreSQL, Tailwind, and similar. These domains get full content extraction with no limits.

This makes sense for a coding tool. When a developer asks Claude Code for help with a Django migration, the model needs accurate, complete documentation. Compressing those pages would degrade the tool's core function.

Everyone else goes through a different pipeline:

  1. Claude Code fetches your page using Axios with an Accept header of text/markdown, text/html, */* (markdown first).

  2. The HTML runs through Turndown.js to convert to markdown. Turndown only processes <body>. Everything in <head> is discarded.

  3. The markdown is sent to Claude Haiku (a smaller, cheaper model) which paraphrases it with a 125-character quote maximum per passage.

  4. That paraphrased summary is what the main model actually sees.

Your JSON-LD, FAQ schema, Open Graph tags, meta descriptions: all of it lives in <head>. All of it gets thrown away before the model sees anything.

The leak revealed the content pipeline, not the search index. Claude Code sends queries to Anthropic's API, which runs the search server-side. The source code doesn't name the search provider, but Anthropic added "Brave Search" to its subprocessor list in March 2025, and the API contains a BraveSearchParams parameter. Brave provides the results; Anthropic processes them. We can see what happens after results come back. We can't see how Brave ranked or selected them. Discovery and rendering are different stages. Schema markup might still help you get found; it provably doesn't help once you've been found, because the model never sees it.

The <head> strip isn't surprising if you've been around long enough to remember 2008-era meta keyword stuffing. The <head> is where manipulation has historically lived: meta keywords, injected tags, script references. Stripping it is a rational engineering decision. But a lot of the current "AI SEO" advice is telling people to add more structured data specifically to improve how AI models interpret their content. On Claude Code at least, the model never gets the chance to interpret it.

Tables get destroyed too. Turndown runs with zero configuration: no GFM plugin. Any tabular structure, column relationships, pricing comparisons, feature matrices: gone. Lists and headings survive. Tables don't.

There's a maximum of 8 results per query. No pagination. Result number 9 doesn't exist.

One researcher (who runs the German-language site wise-relations.com) independently confirmed the findings through both black-box testing and white-box code analysis. His measurements of the snippet budget (~500 words per result) matched exactly what the encrypted content blobs in the API would produce after decoding.

The Markdown Backdoor

There is one path that bypasses the whole pipeline.

If your server supports content negotiation and serves markdown in response to the text/markdown Accept header, Claude Code skips the Turndown conversion entirely. On pre-approved domains, if the content is under 100,000 characters, it also skips the Haiku paraphrase step. Raw content, no compression, no 125-character limit.

That's an nginx config change:

# Serve markdown to AI agents, HTML to browsers
map $http_accept $content_suffix {
    default "";
    "~text/markdown" ".md";
}

location /blog/ {
    try_files $uri$content_suffix $uri $uri/ =404;
}

Nobody's on the pre-approved list unless they're a major documentation site. But the markdown negotiation itself works for anyone.

Claude Code vs Claude.ai

An important distinction the community surfaced: Claude Code and claude.ai share the same search index but use completely different content pipelines.

Claude Code maps only title and url from search results. It discards the encrypted content snippets, the encrypted index, and the page age metadata. When it needs content, it re-fetches via the pipeline described above.

Claude.ai presumably uses the encrypted snippets directly. Same search engine, different processing. The 107 pre-approved domains, the Turndown conversion, the Haiku paraphrase, the 125-character limit: all of this is specific to Claude Code. We don't know what claude.ai's content pipeline looks like.

What About llms.txt?

The only reference to llms.txt in the entire codebase is Anthropic's own API documentation at platform.claude.com/llms.txt, used by an internal guide agent. There is no mechanism that checks your domain for llms.txt or llms-full.txt. If you built one specifically for Claude Code, it's not being read.

The System Around the Model

The most discussed technical finding wasn't about web retrieval at all. It was about what Claude Code does when you're not using it.

KAIROS Dream Mode

The source revealed an unannounced system called KAIROS_DREAM. After 5 sessions and 24 hours of silence, Claude Code spawns a background agent that reviews its own memories, consolidates learnings, prunes outdated information, and rewrites its own memory files. The feature is gated behind a flag (tengu_onyx_plover) and most users don't have access yet.

It also reacts to GitHub webhooks and messages from Slack and Discord while the user is away. Anthropic didn't announce any of this.

CLAUDE.md Injection

Your CLAUDE.md file isn't loaded once at session start. It's re-injected on every turn change. Every time you send a message, your instructions get processed again. This means every line in that file costs tokens on every single turn. The practical takeaway: keep it short, use it for behavioural rules only, put one-time context in your message.

Model Switching Kills Your Cache

The source tracks 14 cache-break vectors. Switching models mid-conversation is one of them. If you toggle between Sonnet and Opus during a session, you pay full input token price again for your entire context. Better to pick a model and start a new session if you need to switch.

What the Community Made of This

The most upvoted analysis framed the leak as "a glimpse into the future of AI agents" where the system, memory, and tools around the model matter more than raw model capability. Multi-agent coordination, planning separated from execution, memory layers that persist and consolidate. These are patterns that other AI tools are converging on too. The leak showed one company's implementation in detail.

Silent Experiments on Session Limits

One user dug into the source to answer a practical question: why do some Claude Code sessions feel shorter than others?

The answer is Statsig, a feature flag and A/B testing platform. Every time Claude Code launches, it fetches your configuration from Statsig and caches it locally. That config includes your tokenThreshold (the percentage of your cost budget that triggers the session limit), your session cap, and which A/B test buckets you're assigned to.

The config IDs are unlabelled integers in a cache file. Without the leaked source, there'd be no way to know what they control.

  • Config ID 4189951994: your token threshold

  • Config ID 136871630: your session cap

  • Gate 678230288: a 50% rollout flag, meaning half of Claude Code users are in one experiment group and half in another

No announcement. No changelog. No opt-out.

The user who found this shared a script to check your own values and asked others to crowdsource their numbers. The open question: whether different experiment buckets get different session limits, meaning two people on the same paid plan could be getting measurably different products.

Anthropic can update these values silently at any time. The mechanism for doing so is built into the architecture, not bolted on as an afterthought.

What This Actually Tells Us

These findings are from Claude Code, a developer tool. Not from claude.ai, not from ChatGPT, not from Google. The pre-approved domains are coding documentation. The compression pipeline serves a specific use case. Extending these findings to "how AI sees your website" would overstate what the evidence shows.

What the leak does provide is the first source-level view of how any AI tool handles web retrieval. Before this, every study of AI search behaviour was a black-box experiment: send queries, observe outputs, infer the mechanism. Now we have one tool's mechanism in full.

The patterns worth noting:

Content that survives compression is content that's clear and front-loaded. If your page goes through a lossy pipeline (and we now know at least one does), burying your key point in paragraph 7 is a risk.

Structured data in <head> doesn't reach the model in Claude Code. Whether other AI systems strip <head> the same way is unknown. Google still uses structured data for traditional search. But the assumption that schema markup helps AI models interpret your content is unproven for any system and demonstrably wrong for this one.

The playing field has tiers. In Claude Code, 107 domains get a fundamentally different experience. The question is whether similar tiering exists in other systems. We don't know. But it would be surprising if the engineering trade-off (full extraction is expensive, compression is cheap) didn't show up elsewhere in some form.

The source code is in the wild. Anthropic's DMCA efforts failed within hours. The question now is whether other providers' systems will ever be examined with the same level of detail, or whether Claude Code remains the only AI retrieval pipeline whose internals are public knowledge.

Reply

Avatar

or to participate

Keep Reading