GPT-4 and professional benchmarks: the wrong answer to the wrong question

(aisnakeoil.substack.com)

340 points agomez314 | 3 comments | 21 Mar 23 13:12 UTC | HN request time: 0s | source

Show context

Ozzie_osman ◴[21 Mar 23 14:56 UTC] No.35247111[source]▶

Looking at human labor, we have some generalists (eg college grad with a general major) who can do some broad range of tasks but can't do very specialized tasks, then experts who can do specialized tasks with very high accuracy (and are much more expensive).

My guess is LLMs will proceed the same way. You will have general, base models like GPT4 (I'm assuming we will solve the hallucination problem), then folks will build highly specialized "expert" LLMs for specific domains.

You could totally imagine a base LLM delegating to the expert LLMs using some agent/toolformer model, too.

replies(1): >>35247282 #

1. thwayunion ◴[21 Mar 23 15:07 UTC] No.35247282[source]▶

>>35247111 #

> I'm assuming we will solve the hallucination problem

It's unclear what this would even mean, since "hallucination" carries a surprising number of different definitions and commentators are rarely precise about what they mean when they say hallucination.

But, color me skeptical. We will never solve the problem of a token prediction engine being able to generate a sequence of tokens that the vast majority of humans interpret as not corresponding to a true statement. Perhaps in very particular and constrained domains we can build systems that, through a variety of mechanisms, are capable of providing trustworthy automation despite the ever-present risk of hallucination. Something like mathematical proofs checked by a computer are an obvious case where the model can hallucinate because the overall system can gate-keep truth. Doing this in any other domain will, of course, be more difficult.

In other words: we may be able to mitigate and systemically manage the risk for some types of particular tasks, but the problem of generating untrue statements is fundamental to the technology and will always require effort to manage and mitigate. In that sense, the whole conversation around hallucination is reminiscent of the frame problem.

replies(1): >>35258861 #

2. textninja ◴[22 Mar 23 09:57 UTC] No.35258861[source]▶

>>35247282 (TP) #

> But, color me skeptical

That is not a creative color.

> We will never solve the problem of a token prediction engine being able to generate a sequence of tokens that the vast majority of humans interpret as not corresponding to a true statement.

I think we already solved that problem by making sure the vast majority of humans never agree about anything.

It is true that we probably won’t ever get a machine trained on human output to ever be completely accurate (GIGO) but with the right systems and sensors we can at least get probabilistic accuracy. Let’s not forget how human consensus gets shaken up every so many centuries.

replies(1): >>35269449 #

3. thwayunion ◴[23 Mar 23 01:33 UTC] No.35269449[source]▶

>>35258861 #

> I think we already solved that problem by making sure the vast majority of humans never agree about anything.

Hah! clever :)

I guess the issue is "behaves how the skip manager of the ICs you are tying to automate expects, modulo the normal amount of filtering/butt-covering from the front line managers".

Which, TBF, in many orgs is an almost vacuous bar.

↑