Burn, Baby, Burn 🔥
AI’s money moves fast; physics does not
OpenAI has an expensive appetite. Updated financial projections tell us that they burn in 2025 to be over $8B, and project a total of $115B in burn through 2029. That’s a massive jump from the prior forecast, to the tune of $80B more than expected.
In parallel, the company reportedly signed a $300B, five-year deal with Oracle starting in 2027 — one of the largest cloud contracts ever inked (ever!). This puts a dollar figure to the prior gigawatt number; 4.5 gigawatts of power capacity, electricity comparable to about >2 Hoover Dams, or 4 million homes.
Why this all matters:
This is the clearest signal yet that frontier labs don’t trust capacity to materialize fast enough without building and pre-buying it themselves.
Partnerships are diversifying. OpenAI is still deeply tied to Microsoft (see the end of this newsletter for their “situationship”), but Oracle now gets a seat at the table — and potentially the AI halo effect it’s been chasing. No surprise that Oracle stock jumped massively on the news (Larry Ellison was the richest man in the world on Wednesday, literally!).
The $300B commitment also raises an obvious question: OpenAI doesn’t have $300B in cash. This will rely on staggered payments, infrastructure financing, and possibly partner capital through the Stargate project. It’s a bet that future revenues and investor confidence will carry the load.
The scale of spend sets a new baseline. If $115B is what it takes for a single lab’s roadmap, the rest of the ecosystem has to adapt to second-order effects: higher barriers to entry, tighter GPU markets, and increasing pressure on smaller players to specialize.
It’s a lot of investment. A lot a lot a lot. And so, it’s with particular interest that I share the rest of this week’s news. We haven’t had a crazy “models got that much smarter” moment recently (boo to you, GPT-5), but what we do have is a lot of focus on improvements. This week’s news: How to make make models more reliable. How to address the non-determinism inherent in LLMs. How to test our product - evals or no evals? MCP registries and a coalescing of an agentic tools ecosystem. Performance and latency. Rethinking of the open web.
Alone, perhaps none of these are headline-worthy. But piece by piece we innovate, and bring together a tighter, more reliable ecosystem – one we hope will prove worth the effort, for ourselves and for humanity. To ourselves; to humanity. Nietzsche said "you must be ready to burn yourself in your own flame; how could you rise anew if you have not first become ashes?"
And so we keep burning.
Evals, evals, no evals?
If you, like me, are also chronically online, you might’ve seen that the beginning of this week was all about evals - or rather, a debate on evals! I have to admit there is something very endearing in seeing a very fiery “fight” online over such a technical (though important) issue. Lest you think anyone here is getting riled up over Love is Blind contestants, nope - in Silicon Valley we are going to have an “all shots fired” moment over evals.
It seems like this all started when OpenAI acquired Statsig last week, which prompted Ben Hylak from Raindrop to say:
Which was then followed by a series of posts and debates -
Ankur Goyal from Braintrust (potentially in response to Ben’s tweet), published a post titled A/B testing can’t keep with AI:
“A/B testing is no longer sufficient for AI product optimization. The future is evals.”
Which was promptly followed by a rebuttal from Ben, Thoughts on Evals (a quite 🌶️ rebuttal at that!):
“I’m writing this because Ankur, the CEO of Braintrust, recently wrote a blog post directly dismissing A/B tests, and Raindrop specifically (without naming us). In the blog post, Ankur claims that evals are the future. He claims that they help you measure how good your product is, that they are key for rapid experimentation. He also claims that evals will become increasingly important as software becomes more personalized. I believe the opposite to be true for each of these claims.”
Which then prompted another reply, this time from Shreya Shankar, In Defense of Evals:
“When people say they “don’t do evals,” they are usually lying to themselves. Every successful product does evals somewhere in the lifecycle.”
There are two sides here:
One is - evals are a must have! 🫡
On the other side - skepticism! 🤨
I’m summarizing the debate more because I think the debate itself is interesting, rather than that I have a really strong view on this debate. In my mind, I’m swayed by the idea that rather obviously, you want to test your products in some way. This probably involves evals. And also maybe A/B tests in addition! The shape of that test may differ on your use case, and where you are on the build journey. So honestly, my favorite thread on the subject was:
You tell me if you think differently!
Interface Power: MCP goes registry, dev mode goes write
Two MCP stories to watch:
MCP Registry (preview)
An official launch of a central registry for MCP! It creates a discovery layer where servers can be listed, searched, and eventually governed. This is pretty cool as it means we are heading to a world away from bespoke integrations to something closer to an app store. A few important points:
Open catalog + API for discoverability: Anyone running an MCP server can add it via the “Add Server” guide; clients can retrieve registry data via open APIs. This makes it easier for tools and agents to find servers reliably.
Public and private sub-registries: The Registry doesn’t try to swallow everything. Public servers go into the core, but organizations can build sub-registries — for example, private versions or opinionated “marketplaces” tied to specific clients. These should sit on top of the central registry but use its schemas and tools.
Moderation / quality control: There are guidelines; maintainers can flag or remove servers that are spammy, impersonating other services, or otherwise harmful. It’s community-driven, under permissive open source licensing.
Preview with caveats: During preview the registry doesn’t guarantee long-term durability or backward compatibility. That means early adopters should expect changes (format, data, APIs) as it evolves.
ChatGPT “Developer Mode”
OpenAI added a developer mode that makes ChatGPT a full read/write MCP client. That transforms it from a chat box into an integration surface. It lowers friction for connecting tools, but also raises security stakes — prompt injection and data exfiltration aren’t theoretical when a client can write. OpenAI literally wrote in its opening paragraph that developer mode is “powerful but dangerous” and comments on a HackerNews thread validate the concern:
Putting both of these news together: the registry defines the catalog and the client defines the default. That’s the new distribution battle: who controls discovery, and who controls the surface where work actually gets done.
Tools That Ship Work: Claude makes files
Anthropic released native file creation/editing (spreadsheets, decks, PDFs) in Claude. It’s mundane on paper and meaningful in practice: fewer hops between “ask” and “artifact,” and a wedge into actual team workflows.
In case you’re wondering how this feature stacks up against other players in the space, someone tagged Grok on an X thread, so it’s helpfully already answered for us:
Model Corner: efficiency, images, and determinism
ByteDance Seedream 4.0: Seedream 4.0 is an image creation model that integrates image generation and image editing capabilities into a single, unified architecture. Artificial Analysis says that it has surpassed Google’s Gemini 2.5 Flash (Nano-Banana); across both text to image and image editing. Pricing is the same as Seedream 3.0, at $30/1K generations.
Thinking Machines Lab: Mira Murati’s company (Thinky, IYKYK), released a paper trying to solve an important point in LLM research - namely that LLMs are not reproducible. The paper, authored by Horace He, argues “nondeterminism” at the user level comes from the stack (kernel orchestration/server) more than math; they outline paths to deterministic inference for reproducibility. This is pretty cool and could have a major impact if in the future RFPs require some level of determinism!
Qwen3-Next-80B-A3B: Alibaba introduced a new sparse MoE model with 80B total parameters, but only about 3B plus one shared expert activated per token. The design uses 512 experts and combines Gated Attention, Gated DeltaNet, and Zero-Centered RMSNorm to squeeze more efficiency out of the architecture. The result: up to ten-fold cheaper training and faster inference compared to the earlier Qwen3-32B, with strong gains on long-context tasks. In internal tests, the “Thinking” variant even outpaces Gemini-2.5-Flash-Thinking, while the Instruct variant edges close to larger flagship models. Translation: a cost-efficient way to push long-context performance without melting GPU budgets.
Open Web Watch: oxygen thinning
Google’s court filing says the open web is “already in rapid decline” (Google later narrowed that to “open-web display advertising” and accused reporting to have cherry-picked that line, but alas – they did say it, so the headline is doing damage). Independent of semantics, the trend is clearer each week: AI Overviews reduce clicks and centralize attention. Multiple studies peg AI summaries in ~18–20% of searches and show materially lower link-through rates when summaries appear. Also troubling: a rising share of sources cited by AI outputs appear to be AI-written themselves – amplifying the “dead internet” concern.
I think it is truly critical for builders & any enterprises to pay attention to this trend; I can’t tell you how often I’m hearing in board meetings that customers discovered a company through ChatGPT. I try not to be too prescriptive here in my blog but I’ll say that if you are not already thinking about how the changing dynamics of search affect your business, you are behind.
In case you want some experts to help you think through this, I’ll do a shameless plug here for our portfolio company Scrunch – they do brand monitoring & optimization for AI, and are also building an infrastructure layer to deliver content specifically made for AI agents & crawlers.
Why models hallucinate: we rewarded guessers
Ok so when I read this paper I laughed because, OF COURSE. Remember sitting down for a multiple choice test? Since the SAT doesn’t penalize you for a wrong answer, if you don’t know what the answer is, might as well pick a letter (apparently everyone loves C, though there’s no statistical evidence it’s better) that you use consistently as a best guess. You don’t want to leave the question blank!
Turns out, that’s exactly how we’ve trained LLMs 😂😂
OpenAI’s new paper argues hallucinations persist because our training and eval regimes reward confident guesses over calibrated uncertainty. The fix won’t come from more data alone; it needs objective functions that favor “I don’t know,” verifier architectures, and loss functions that penalize over-claiming. Following this learning, I would expect that training runs with explicit abstention incentives will become standard, even if it costs a few tokens per answer.
Dev tools musical chairs
As I’ve said many times, to the foundation model companies, developers are the real prize. And so I closely watch which foundation model company holds the dev crown. As of late, we’re seeing waves of teams shift to OpenAI’s coding agents – a post went viral this week showing that in the r/ClaudeCode thread, everyone is talking about OpenAI’s Codex instead:
Funnily enough Sam’s reaction to that Reddit screenshot was basically like - “are these bots in the internet?”
What?! Sam saying that developers saying they are moving from Anthropic to OpenAI… is fake news? But the trend is real? What!
Maybe this is all a hilarious 4D chess move by developers who are exceedingly loyal to Claude and want less people crowding out latency:
In all seriousness, it does seem like there was something real behind this (it can’t be that all those redditors are bots!). In fact, I dug into it, and it seems like there have been some real performance concerns and accusations of model degradation (specifically Claude Ops 4.1) this week – Anthropic officially acknowledged the feedback and has fixed two bugs and promised further investigation. Looks like we can expect a post-mortem to come next week:
Who’s really winning the hearts of devs? I think it’ll only keep shifting. Expect another swing when the next Claude update lands. Underneath the hype cycles, the decision function is stable: latency under load, repo-scale context handling, tool calling that doesn’t flake, and enterprise controls.
Parting Thoughts
Just, lol:
‘Til next week, y’all! Burn, baby, burn!























One side says evals are broken, the other says they’re essential. I say: have we tried turning the benchmark off and on again? 😂
I'm a bit surprised that AI Evals debate is getting this much attention. It seemed like mostly pointless intellectualizing + marketing to me