The Agent OS Grows Up
From Shopify's AI-first policy to Google's A2A protocol, the pieces of a real agent OS are falling into place
2025 is shaping up to be the year the agent ecosystem starts to actually mature.
There’s plenty of excitement around agents, but also some real friction points – especially in the enterprise. Much of the tooling still feels bespoke, and a fair amount of engineering work is required to make agentic systems reliable and effective. One piece of the puzzle is the rise of specialized agent startups that focus on narrow, high-impact workflows (like Dropzone automating SOC alert reviews, Strella tackling UX & market research, or Aomni streamlining sales research). These are turnkey agents you can just… buy.
What’s interesting from where I sit is how often these startups are building ahead of where models and infra are today – betting that both will keep getting better. Their edge is in building data and feedback loops that improve fast. Models, after all, are not enough. You can’t just parachute a PhD into your systems with no context! Think of it like hiring someone brilliant who also already knows your tools, industry, and quirks. So there’s a lot to love already with everyone building agents in the app layer.
But for the broader ecosystem to scale, the underlying infrastructure has to keep up. This week, we got a real glimpse of what that evolution looks like:
🤝 Interoperability got a boost with universal adoption of MCP and Google’s launch of A2A
🧠 Memory & context made quiet but meaningful progress at OpenAI
🏢 Enterprise AI behavior hit a new milestone, with Shopify’s hiring policy shift
And of course, a little ~~drama~~ via Meta’s Llama 4 release. Let's get into it!
🤝 Interoperability
We’re finally seeing the interoperability layer take shape. At Google Cloud Next, Sundar Pichai confirmed support for MCP, as did Demis Hassabis (CEO DeepMind). With Anthropic, OpenAI, Google/DeepMind, Microsoft (via a C# SDK), and AWS now on board, it’s fair to say: MCP is sticking around.
Google also introduced Agent-to-Agent Protocol (A2A) as a complement to MCP. Where MCP is about connecting models to tools and data sources, A2A is about letting agents talk to each other.
Importantly, Google is not trying to compete with MCP (hence the express support for it from both Sunday & Demis), but rather to complement it. A2A specifically focuses on the integrations among different agents that are using the A2A protocol, so it is easier for developers to build multi-agent systems and make them actually viable. Judging by the number of enterprises already experimenting with A2A (logos below), that future is coming fast. I’m personally very excited about this – for agents to be truly effective in the enterprise they need to be able to interface with other agents and systems, and the push towards universal interoperability makes that dream much easier to realize.
🧠 Memory
I’m not sure I’d call this a release per say (there’s not much more info around it other than Sam’s tweet, and of course if you go into ChatGPT you’ll see that the memory function has improved by a lot).
Why I think this is interesting is because memory is a core building block for useful AI systems. How frustrating it is to have to repeat the same instructions over and over, or to have to over prompt. Memory has to do both with customization of the AI output based on the user, as well as context. With memory, agents can evolve: personalize responses, keep context across sessions, and behave less like a tool and more like a teammate.
Still early days, but the direction is encouraging.
🏢 The Shopify Memo on AI Usage
Of all the news this week, the most impactful might’ve come from Shopify.
In an internal memo, Tobi Lutke laid out a new rule: before you add headcount, you now have to prove AI can’t do the job. Not “consider AI” – rather, “AI is the default.” The assumption is that AI can handle it, and the burden is on the team to justify why a human is still needed.
It’s hard to overstate how big of a shift this is as an enterprise mandate. We’ve seen AI framed as a productivity boost, a copilot, a helper. This is very different. This is AI as a gating function on org growth. In Tobi’s own words:
“Using AI effectively is now a fundamental expectation of everyone at Shopify.”
The implications for enterprises are massive. If more companies follow suit, this could fundamentally reshape how hiring plans are made, how teams are structured, and how quickly new software is built (or not built).
It’s the clearest articulation yet of what an AI-first operating model looks like for an enterprise. The memo outlined several other expectations, the most salient I am sharing below:
I also thought one of Tobi’s answer on the thread was interesting – always great to hear where companies are actually seeing meaningful ROI:
🦙 Llama 4
The Llama 4 launch was… dramatic.
On one hand, the model itself is impressive: better multilingual reasoning, stronger performance on coding and math, and a full multimodal stack (with audio coming soon). Meta even teased two companion models: Llama 4 Scout (lightweight, optimized for fast inference) and Llama 4 Maverick (larger, instruction-tuned variant). A 288B active parameter model with 16 experts called Behemoth (and according to Meta, outperforming GPT-4.5, Claude Sonnet 3.7, and Gemini 2.0 Pro on several STEM benchmarks) is still in training but the launch is on the horizon, too.
But the actual rollout left a lot to be desired. The largest model wasn’t directly available at launch, and the much-hyped “Maverick” variant that shot to #2 on the LM Arena leaderboard wasn’t even part of the public release. Some developers reported major issues in real-world coding tasks (16% on the aider polyglot benchmark), and Meta had to publicly deny accusations that the model had been trained on test data. On top of that, running even the smallest version of Llama 4 (Scout) requires serious hardware – think 96GB of RAM for 4bit. This piece from The Neuron summed it up well: “Llama 4 is here… but who is it actually for?”
It’s a good reminder that in the race for open(ish) models, vibes matter. Meta’s contribution to the ecosystem is still massive, but the mismatch between expectations and reality made this launch feel off-key. The random Saturday launch didn’t help, either. Though we did get this beautifully crisp reply from Zuck:
📊 Evals API
To very little fanfare, OpenAI shipped something that might have huge implications: the Evals API.
This new API lets you systematically benchmark model performance on your own datasets and tasks—something that’s been sorely missing from the ecosystem. Until now, a lot of evals were stitched together with duct tape and Notion docs. Now, there’s an official framework to run regression tests across models, versions, prompts, and tasks.
Why does this matter? Because if agents are going to be used in high-stakes environments, you need to be able to trust them – and that means measuring them. Consistently. The evals API is a step toward a world where we can more easily track performance drift, model upgrades, and task completion rates over time.
Bonus: it also supports RAG-specific evaluations, which is a nice touch.
📈 Stanford HAI report
The annual AI Index Report from Stanford HAI came out this week, and it's packed with insights (as usual).
A few highlights that stood out to me:
The U.S. now leads in both total AI investments and the number of significant model releases.
Industry dominates academic research output, further tipping the scales in favor of commercial AI labs.
AI-generated content now accounts for a measurable percentage of all internet traffic (!!), especially in image and text domains.
But this is the stat that really caught my eye: The number of AI regulations passed globally doubled year-over-year. Doubled!!! That’s a reflection of both how fast the tech is moving and how seriously policymakers are starting to take its impact. It’s clear that 2025 is shaping up to be the year the infrastructure, usage, and governance of AI all start to harden.
That’s it for this week! Between interoperability wins, memory improvements, major enterprise behavior changes, a mixed-bag launch from Meta, and some real movement in evals and adoption, the foundation for serious agent ecosystems is getting stronger.
Oh, and what are tariffs again? Worries about the semiconductor supply chain? That was soooooo week of March 31st vibes.
Bye Barbie!! 💅