Most likely they built this as a post-train of an open model that is already strong on coding like Qwen 2.5.
This data is very valuable if you're trying to create fully automated SWEs, while most foundation model providers have probably been scraping together second hand data to simulate long horizon engineering work. Cursor probably has way more of this data, and I wonder how Microsoft's own Copilot is doing (and how they share this data with the foundation model providers)...
open source alternative https://huggingface.co/SWE-bench/SWE-agent-LM-32B
though I haven't been able to find a mlx quant that wasn't completely broken.
for coding you use anthropic or google models, I haven't found anyone who swears by openAI models for coding... Their reasoning models are either too expensive or hallucinate massively to the point of being useless... I would assume the gpt 4.1 family will be popular for SWE's
Having a smaller scope model (agentic coding only) allows for much cheaper inference and windsurf building its own moat (so far agentic IDE's haven't had a moat)
This suggests OpenAI models do have tasks they're better at than the "less rounded" competition, who have taks they're weaker in. Could you name a single sucg task (except for image generation, which is an entirely different usecase), that OpenAI models are better at than Gemini 2.5 and Claude 3.7 without costing at least 5x as much?
It is very puzzling why "wrapper" companies don't (and religiously say they won't ever) do something on this front. The only barrier is talents.
That being said I am sure a lot of the so called wrapper companies are paying insanely well too, but competing with FAANGMULA might be trickier for them.
Their stated goal is to improve on the frontier models. It's ambitious, but on the other hand they were a model company before they were an IDE company (IIRC) and they have a lot of data, and the scope is to make a model which is specialized for their specific case.
At the very least I would expect they would succeed in specializing a fronteir model for their use-case by feeding their pipeline of data (whether they should have that data to begin with is another question).
The blog post doesn't say much about the model itself, but there's a few candidates to fine tune from.
Cynical take: describing yourself as a full stack AI IDE company sounds very invest-able in a "what if they're right" kind of way. They could plausibly ask for higher valuations, etc.
Optimistic take: fine tuning a model for their use-case (incomplete code snippets with a very specific data model of context) should work. Or even has from their claims. It certainly sounds plausible that fine-tuning a frontier model would make it better for their needs. Whether it's reasonable to go beyond fine-tuning and consider pre-training etc. I don't know. If I remember correctly they were a model company before Windsurf, so they have the skillset.
Bonus take: doesn't this mean they're basically training on large-scale gathered user data?
First, most of the major players already have their own models or have been developing them for some time. Your take feels a bit reductive. Take Windsurf pre-acquisition, for example, their risk was being too tightly coupled to third-party vendors. It’s only logical to assume that building task- or language-specific models will ultimately help reduce costs and offer more control.
As for the other point: in my experience, trying to fully leverage LLMs actually makes me more prescriptive in my designs. I spend more time thinking through architecture and making my code modular, more so than when I wasn’t using an LLM. I’m sure others may design less or take shortcuts, but for me it’s pushed the opposite behavior. Is it the “right” way? I’m not sure, but I’m enjoying it and staying productive.
OAI is trying frantically to build a moat without doing any digging.
I do think that’s an overly cynical way to look at this though.
- OpenAI is buying WindSurf and probably did diligence on these models before it decided to invest.
- WindSurf may have collected valuable data from it users that is helpful in training a coding-focused AI model. The data would give a 6 month lead to OpenAI which is probably worth the $3B.
- Even if Windsurf's frontier models are not better than other models for coding, if they excel in a few key areas it would justify significant investment in their methodology (see points above).
- There are still areas of coding where even the top frontier models falter that would seemingly be ripe for improvement via more careful training. Notably, making the model better at working within a particular framework and version, programming language version, etc. Also better support for more obscure languages and libraries/versions and the ability to "lock in" on the versions that the developer is using. I've wasted a lot of time trying to convince OpenAI models to use OpenAI's latest Python API -- even when given docs and explicit constraints to use the new API, OpenAI frontier models routinely (incorrectly) update my code to use old API conventions and even methods that have been removed!
Consider that the basic competency of doing a frontier coding model well is likely one of the biggest opportunities in AI right now (second to reasoning and in my opinion tied with image analysis and production). An LLM that can both reason and code accurately could read a chapter in a textbook and code a 3D animation illustrating all of the concepts as a one-shot exercise. We are far from that at present even in OpenAI's best stuff.
It is a bit of a shame that we’ll never get to see what they could do on their own. But I hope their clearly very talented employees do very well out of this.
> you can't do what many of us do: have three subscriptions and use each for its best
I don't think has anything to do with whether or not AI is in the editor so much as it is the difference between a subscription (Cursor) vs. a BYOK approach (VS Codium + Cline, Zed, etc). Most BYOK plug-ins will let you set up multiple profiles against various providers so that you can choose the most optimal LLM for the given problem you're trying to solve.
Note: I'm not saying that's a bad thing! It's significantly more convenient for many use cases, so I can see why it's a default. But the incentive being created is to accept first, analyze later.
We are paying for more "manageable" AI agents to get stuff done, not a chaotic "genius-hacker" to hack together quick prototypes.
Then the were the the MS Access and Excel amateur efforts. I worked at a company that for years had a very profitable business replacing in house MS Access spaghetti with our well designed application.
Aaaand..... here we go.... deja vu all over again....