←back to thread

579 points paulpauper | 1 comments | | HN request time: 0.324s | source
Show context
ants_everywhere ◴[] No.43604454[source]
There are real and obvious improvements in the past few model updates and I'm not sure what the disconnect there is.

Maybe it's that I do have PhD level questions to ask them, and they've gotten much better at it.

But I suspect that these anecdotes are driven by something else. Perhaps people found a workable prompt strategy by trial and error on an earlier model and it works less well with later models.

Or perhaps they have a time-sensitive task and are not able to take advantage of the thinking of modern LLMs, which have a slow thinking-based feedback loop. Or maybe their code base is getting more complicated, so it's harder to reason about.

Or perhaps they're giving the LLMs a poorly defined task where older models made assumptions about but newer models understand the ambiguity of and so find the space of solutions harder to navigate.

Since this is ultimately from a company doing AI scanning for security, I would think the latter plays a role to some extent. Security is insanely hard and the more you know about it the harder it is. Also adversaries are bound to be using AI and are increasing in sophistication, which would cause lower efficacy (although you could tease this effect out by trying older models with the newer threats).

replies(4): >>43604490 #>>43604882 #>>43610080 #>>43620429 #
1. stafferxrr ◴[] No.43620429[source]
It is like how I am not impressed by the models when it comes to progress with chemistry knowledge.

Why? Because I know so little about chemistry myself that I wouldn't even know what to start asking the model as to be impressed by the answer.

For the model to be useful at all, I would have to learn basic chemistry myself.

Many though I suspect are in this same situation with all subjects. They really don't know much of anything and are therefore unimpressed by the models response in the same way I am not impressed with chemistry responses.