The Model Doesn't Matter as Much as Your Agent Stack

cursor shipped composer 2 and people loved it. fast, capable, cheaper than anything else at that tier. then an X user named Fynn looked at the code and found it was running on top of kimi k2.5, an open-source model from chinese startup moonshot AI.

cursor's cofounder aman sanger confirmed it: "we've evaluated a lot of base models on perplexity-based evals and kimi k2.5 proved to be the strongest." they didn't mention the base model in the original blog post. "it was a miss to not mention the kimi base in our blog from the start," sanger said.

the pricing tells the story. composer 2 costs $0.50 per million input tokens and $2.50 per million output tokens. anthropic's claude opus 4.6 costs $5 and $25 respectively. that's roughly one-tenth the price. and users were praising its coding ability before anyone knew what model was underneath.

this should make every company selling base model magic nervous.

what cursor and kimi say about commoditization

a $29.3 billion coding tool built its flagship product on a base model most of its users had never heard of. the model came from a chinese lab that isn't in the top five names people think of when they think about AI. and the product worked well enough that nobody noticed until someone reverse-engineered it.

that's commoditization. not in the future-tense "models might become commodities" framing that analysts use. it already happened. a strong product rode a model that buyers didn't recognize, and the product quality didn't suffer.

one X user put it directly: "as someone who basically lives in opus 4.6, seeing an open-weight kimi 2.5 fine-tune actually beat it on coding benchmarks is wild." another was more pointed: "cursor is becoming a model routing layer, not an IDE. they pick the cheapest model that clears a quality bar per task, wrap it in their UX, and pocket the margin."

that second take is probably closer to the truth than cursor would prefer, but it's also a smart business strategy. if the base model is interchangeable, the value is in everything above it.

coding agents win on loop design, not model branding

what makes cursor's composer good isn't the base model. it's the loop: how the agent reads your codebase, plans changes, executes edits, runs tests, and iterates. that loop is the product. the base model is one component.

cursor acknowledged this indirectly. lee robinson, their VP of developer education, said "only ~1/4 of the compute spent on the final model came from the base, the rest is from our training." they did their own reinforcement learning on top of kimi. the base model was a starting point, not the finished product.

this is the pattern i keep seeing at hype lab when we build agent systems for clients. the model matters less than the orchestration around it. how the agent decides what to do next, how it recovers from errors, how it maintains context across a multi-step task. swap the model and the system still works. break the orchestration and it doesn't matter how smart the model is.

event-driven automations are where enterprise value starts

cursor also shipped automations recently. agents that trigger on events: a new commit, a slack message, a pagerduty incident, a scheduled timer. the agent runs in the background without anyone prompting it.

this is where the enterprise conversation is heading. not "which model should we use?" but "what workflows can agents handle autonomously?" cursor is processing hundreds of automations per hour internally. their customers are setting up agents that review PRs, audit dependencies at 2am, triage bugs from slack, and post weekly digests of repo activity.

none of that value comes from the base model. it comes from trigger design, MCP integrations, failure handling, and output routing. the model is the engine. the automation framework is the car.

the stack that clients are paying for

mckinsey's research shows 88% of organizations use AI in at least one function but only 7% have scaled it. gartner estimates 40% of enterprise apps will include agents by end of year. that gap between "using AI" and "scaling AI" is where the money is.

when i talk to enterprise buyers at hype lab, the conversation has shifted noticeably in the last six months. it used to be: "should we use GPT-4 or claude?" now it's: "what does the agent stack look like?"

the stack they're asking about:

orchestration. how does the agent decide what to do, in what order, with what tools? this is the workflow logic that cursor built on top of kimi and that every production agent system needs.

memory. how does the agent maintain context across sessions and tasks? raw model context windows aren't enough for agents that run for days or weeks.

evals. how do you know the agent is doing a good job? not benchmarks on public datasets, but domain-specific evaluation of whether the agent actually completed the task correctly in your environment.

tool access and governance. which systems can the agent reach? what permissions does it have? who approved that scope? this is the MCP governance layer i wrote about in my last post.

failure handling. what happens when the model hallucinates, the API times out, or the agent gets stuck? production systems need explicit error recovery, not hope.

none of those layers depend on a specific base model. all of them determine whether an agent system actually works in production.

how to talk to buyers when every lab claims frontier performance

if you're selling agent services or products, the cursor/kimi story is useful framing for client conversations. when a buyer says "we need to use the best model," the honest answer is: for most agent use cases, several models clear the quality bar. the differences between them matter less than the quality of the stack around them.

that's not comfortable for companies whose pitch is "we use the most advanced model." but it's where the market is going. cursor proved it by shipping a product people loved on a model people hadn't heard of.

the value is in the stack. the model is a component. building that stack well is harder than picking the right model, and it's where production agent systems actually succeed or fail. if you're still leading with model selection, you're optimizing the wrong variable.