Recent AI model progress feels mostly like bullshit

(www.lesswrong.com)

579 points paulpauper | 2 comments | 06 Apr 25 18:01 UTC | HN request time: 0s | source

Show context

wg0 ◴[07 Apr 25 09:37 UTC] No.43609507[source]▶

Unlike many - I find author's complaints on the dot.

Once all the AI batch startups have sold subscriptions to the cohort and there's no more further market growth because businesses outside don't want to roll the dice on a probabilistic model that doesn't have an understanding of pretty much anything rather is a clever imitation machine on the content it has seen, the AI bubble will burst when more statups would start packing up by end of 2026 or max 2027.

replies(1): >>43612749 #

consumer451 ◴[07 Apr 25 15:42 UTC] No.43612749[source]▶

>>43609507 #

I would go even further than TFA. In my personal experience using Windsurf daily, Sonnet 3.5 is still my preferred model. 3.7 makes many more changes that I did not ask for, often breaking things. This is an issue with many models, but it got worse with 3.7.

replies(3): >>43612845 #>>43612928 #>>43616646 #

1. cootsnuck ◴[07 Apr 25 16:00 UTC] No.43612928[source]▶

>>43612749 #

Yea, I've experienced this too with 3.7. Not always though. It has been helpful for me more often than not helpful. But yea 3.5 "felt" better to me.

Part of me thinks this is because I expected less of 3.5 and therefore interacted with it differently.

It's funny because it's unlikely that everyone interacts with these models in the same way. And that's pretty much guaranteed to give different results.

Would be interesting to see some methods come out for individuals to measure their own personal success rate/ productivity / whatever with these different models. And then have a way for people to compare them with each other so we can figure out who is working well with these models and who isn't and figure out why the difference.

replies(1): >>43614303 #

2. consumer451 ◴[07 Apr 25 18:15 UTC] No.43614303[source]▶

>>43612928 (TP) #

> Would be interesting to see some methods come out for individuals to measure their own personal success rate/ productivity / whatever with these different models. And then have a way for people to compare them with each other so we can figure out who is working well with these models and who isn't and figure out why the difference.

This would be so useful. I have thought about this missing piece a lot.

Different tools like Cursor vs. Windsurf likely have their own system prompts for each model, so the testing really needs to be done in the context of each tool.

This seems somewhat straightforward to do using a testing tool like Playwright, correct? Whoever first does this successfully with have a popular blog/site on their hands.

↑