←back to thread

579 points paulpauper | 5 comments | | HN request time: 0.235s | source
1. ants_everywhere ◴[] No.43604454[source]
There are real and obvious improvements in the past few model updates and I'm not sure what the disconnect there is.

Maybe it's that I do have PhD level questions to ask them, and they've gotten much better at it.

But I suspect that these anecdotes are driven by something else. Perhaps people found a workable prompt strategy by trial and error on an earlier model and it works less well with later models.

Or perhaps they have a time-sensitive task and are not able to take advantage of the thinking of modern LLMs, which have a slow thinking-based feedback loop. Or maybe their code base is getting more complicated, so it's harder to reason about.

Or perhaps they're giving the LLMs a poorly defined task where older models made assumptions about but newer models understand the ambiguity of and so find the space of solutions harder to navigate.

Since this is ultimately from a company doing AI scanning for security, I would think the latter plays a role to some extent. Security is insanely hard and the more you know about it the harder it is. Also adversaries are bound to be using AI and are increasing in sophistication, which would cause lower efficacy (although you could tease this effect out by trying older models with the newer threats).

replies(4): >>43604490 #>>43604882 #>>43610080 #>>43620429 #
2. pclmulqdq ◴[] No.43604490[source]
In the last year, things like "you are an expert on..." have gotten much less effective in my private tests, while actually describing the problem precisely has gotten better in terms of producing results.

In other words, all the sort of lazy prompt engineering hacks are becoming less effective. Domain expertise is becoming more effective.

replies(1): >>43604593 #
3. ants_everywhere ◴[] No.43604593[source]
yes that would explain the effect I think. I'll try that out this week.
4. DebtDeflation ◴[] No.43610080[source]
The issue is the scale of the improvements. GPT-3.5 Instruct was an utterly massive leap over everything that came before it. GPT-4 was a very big jump over that. Everything since has seemed incremental. Yes we got multimodal but that was part of GPT-4, they just didn't release it initially, and up until very recently it mostly handed off to another model. Yes we got reasoning models, but people had been using CoT for awhile so it was just a matter of time before RL got used to train it into models. Witness the continual delays of GPT-5 and the back and forth on whether it will be its own model or just a router model that picks the best existing model to hand a prompt off to.
5. stafferxrr ◴[] No.43620429[source]
It is like how I am not impressed by the models when it comes to progress with chemistry knowledge.

Why? Because I know so little about chemistry myself that I wouldn't even know what to start asking the model as to be impressed by the answer.

For the model to be useful at all, I would have to learn basic chemistry myself.

Many though I suspect are in this same situation with all subjects. They really don't know much of anything and are therefore unimpressed by the models response in the same way I am not impressed with chemistry responses.