GPT-4 and professional benchmarks: the wrong answer to the wrong question

(aisnakeoil.substack.com)

Show context

thwayunion ◴[21 Mar 23 13:28 UTC] No.35245821[source]▶

Absolutely correct.

We already know this is about self-driving cars. Passing a driver's test was already possible in 2015 or so, but SDCs clearly aren't ready for L5 deployment even today.

There are also a lot of excellent examples of failure modes in object detection benchmarks.

Tests, such as driver's tests or standardized exams, are designed for humans. They make a lot of entirely implicit assumptions about failure modes and gaps in knowledge that are uniquely human. Automated systems work differently. They don't fail in the same way that humans fail, and therefore need different benchmarks.

Designing good benchmarks that probe GPT systems for common failure modes and weaknesses is actually quite difficult. Much more difficult than designing or training these systems, IME.

replies(12): >>35245981 #>>35246141 #>>35246208 #>>35246246 #>>35246355 #>>35246446 #>>35247376 #>>35249238 #>>35249439 #>>35250684 #>>35251205 #>>35252879 #

1. dcolkitt ◴[21 Mar 23 13:56 UTC] No.35246141[source]▶

>>35245821 #

I'd also add that the almost all standardized tests are designed for introductory material across millions of people. That kind of information is likely to be highly represented in the training corpus. Whereas most jobs require highly specialized domain knowledge that's probably not well represented in the corpus, and probably too expansive to fit into the context window.

Therefore standardized tests are probably "easy mode" for GPT, and we shouldn't over-generalize its performance there to its ability to actually add economic value in actually economically useful jobs. Fine-tuning is maybe a possibility, but its expensive and fragile, and I don't think its likely that every single job is going to get a fine-tuned version of GPT.

replies(2): >>35246365 #>>35246438 #

2. Tostino ◴[21 Mar 23 14:12 UTC] No.35246365[source]▶

>>35246141 (TP) #

From what i've gathered, fine tuning should be used to train the model on a task, such as: "the user asks a question, please provide an answer or follow up with more questions for the user if there are unfamiliar concepts."

Fine tuning should not be used to attempt to impart knowledge that didn't exist in the original training set, as it is just the wrong tool for the job.

Knowledge graphs and vector similarity search seem like the way forward for building a corpus of information that we can search and include within the context window for the specific question a user is asking without changing the model at all. It can also allow keeping only relevant information within the context window when the user wants to change the immediate task/goal.

Edit: You could think of it a little bit like the LLM as an analog to the CPU in a Von Neumann architecture and the external knowledge graph or vector database as RAM/Disk. You don't expect the CPU to be able to hold all the context necessary to complete every task your computer does; it just needs enough to store the complete context of the task it is working on right now.

replies(2): >>35247310 #>>35248711 #

3. kolbe ◴[21 Mar 23 14:16 UTC] No.35246438[source]▶

>>35246141 (TP) #

To add further, these parlor tricks are nothing new. Watson won Jeopardy in 2011, and never produced anything useful. Doing well on the SAT is just another slight-of-hand trick to distract us from the fact that it doesn't really do anything beyond aggregate online information.

replies(1): >>35248521 #

4. fud101 ◴[21 Mar 23 15:09 UTC] No.35247310[source]▶

>>35246365 #

>From what i've gathered, fine tuning should be used to train the model on a task, such as: "the user asks a question, please provide an answer or follow up with more questions for the user if there are unfamiliar concepts."

That isn't what finetuning usually means in this context. It usually means to retrain the model using the existing model as a base to start training.

replies(1): >>35247858 #

5. Tostino ◴[21 Mar 23 15:45 UTC] No.35247858{3}[source]▶

>>35247310 #

I may have not been clear, because I was talking about the RLHF dataset/training that OpenAI fine-tuned their models on which includes a whole bunch of question/answer format data to enable their fine-tuned models to handle that type of query better (as well as constraining the model with a reward mechanism). I'm not saying the fine-tuned models won't contain some representation of the information from the dataset you used to fine tune it. I'm just saying that from what i've researched, it is often not the magic trick many people think it is.

I've seen plenty of discussion on "fine-tuneing" for a different dataset of say: company documents, database schema structure of an internal application, or summarized logs of your previous conversations with the bot.

Those seem like pretty bad targets IMO.

replies(1): >>35248810 #

6. WalterSear ◴[21 Mar 23 16:28 UTC] No.35248521[source]▶

>>35246438 #

The issue at hand is that a huge number of people make a living by aggregating online information. They might convey this to others via speech, but the 'human touch' isn't always adding anything to the interaction.

7. visarga ◴[21 Mar 23 16:39 UTC] No.35248711[source]▶

>>35246365 #

There can be foot guns in the retrieval approach. Yes, you keep the model fixed and only add new data to your index, then you allow the model to query the index. But when the model gets two snippets from different documents it might combine information between them even when it doesn't make sense. The model has a lack of context when it just retrieves random things based on search.

replies(1): >>35289798 #

8. visarga ◴[21 Mar 23 16:44 UTC] No.35248810{4}[source]▶

>>35247858 #

You're right, the RLHF fine-tuning is not adding any information to the model. It just steers the model towards our intentions.

But the regular fine-tuning is simple language modelling. You can fine-tune a GPT3 on any collection of texts in order to refresh the information that might be stale from 2021 in the public model.

9. Tostino ◴[24 Mar 23 14:51 UTC] No.35289798{3}[source]▶

>>35248711 #

Yeah, honestly I see using a regular search index as a downside rather than benefit with this tech. Conflicting info, or low quality blogspam seem to trip these LLMs up pretty bad.

Using curated search index seems like a much better use case, especially for private data (company info, docs, db schemas, code, chat logs, etc)

↑