Most active commenters
  • thwayunion(7)
  • jstummbillig(3)
  • Tostino(3)

←back to thread

340 points agomez314 | 52 comments | | HN request time: 1.256s | source | bottom
1. thwayunion ◴[] No.35245821[source]
Absolutely correct.

We already know this is about self-driving cars. Passing a driver's test was already possible in 2015 or so, but SDCs clearly aren't ready for L5 deployment even today.

There are also a lot of excellent examples of failure modes in object detection benchmarks.

Tests, such as driver's tests or standardized exams, are designed for humans. They make a lot of entirely implicit assumptions about failure modes and gaps in knowledge that are uniquely human. Automated systems work differently. They don't fail in the same way that humans fail, and therefore need different benchmarks.

Designing good benchmarks that probe GPT systems for common failure modes and weaknesses is actually quite difficult. Much more difficult than designing or training these systems, IME.

replies(12): >>35245981 #>>35246141 #>>35246208 #>>35246246 #>>35246355 #>>35246446 #>>35247376 #>>35249238 #>>35249439 #>>35250684 #>>35251205 #>>35252879 #
2. zer00eyz ◴[] No.35245981[source]
> good benchmarks ... failure modes and weaknesses is actually quite difficult. Much more difficult than designing or training these systems

Is it? Based on the restrictions placed on the systems we see today and the way people are breaking it, I would say that some failure modes are known.

replies(2): >>35246061 #>>35246078 #
3. thwayunion ◴[] No.35246061[source]
A good benchmark is not simply a set of unit tests.

What you want in a benchmark is a set of things you can use to measure general improvement; doing better should decrease the propensity of a particular failure mode. Doing this in a way that generalizes beyond specific sub-problems, or even specific inputs in the benchmark suite, is difficult. Building a benchmark suite that's large and comprehensive enough that generalization isn't necessary is also a challenge.

Think about an analogy to software security. Exploiting a SQL injection attack in insecure code is easy. Coming up with a set of unit tests that ensures an entire black box software system is free of SQL injection attacks is quite a bit more difficult. Red teaming vs blue teaming, except the blue team doesn't get source code in this case. So the security guarantee has to come from unit tests alone, not systematic design decisions. Just like in software security, knowing that you've systematically eliminated a problem is much more difficult than finding one instance of the problem.

4. brookst ◴[] No.35246078[source]
I think the hard / unknown part is how you know you’ve identified all of the failure modes that need to be tested.

Tests of humans have evolved over a long time and large sample size, and humans may be more similar to each other than LLMs are, so failure modes may be more universal.

But very short history, small sample size, and diversity of architecture and training means we really don’t know how to test and measure LLMs. Yes, some failure modes are known, but how many are not?

replies(1): >>35246724 #
5. dcolkitt ◴[] No.35246141[source]
I'd also add that the almost all standardized tests are designed for introductory material across millions of people. That kind of information is likely to be highly represented in the training corpus. Whereas most jobs require highly specialized domain knowledge that's probably not well represented in the corpus, and probably too expansive to fit into the context window.

Therefore standardized tests are probably "easy mode" for GPT, and we shouldn't over-generalize its performance there to its ability to actually add economic value in actually economically useful jobs. Fine-tuning is maybe a possibility, but its expensive and fragile, and I don't think its likely that every single job is going to get a fine-tuned version of GPT.

replies(2): >>35246365 #>>35246438 #
6. Robotbeat ◴[] No.35246208[source]
I tend to think that it would not be particularly hard for current self driving systems to exceed the safety of a teenager right after passing the drivers test.
7. jstummbillig ◴[] No.35246246[source]
> Designing good benchmarks that probe GPT systems for common failure modes and weaknesses is actually quite difficult. Much more difficult than designing or training these systems, IME.

What do you think is the difficulty?

replies(1): >>35246300 #
8. thwayunion ◴[] No.35246300[source]
A good benchmark provides a strong quantitative or qualitative signal that a model has a specific capability, or does not have a specific flaw, within a given operating domain.

Each part of this difficult -- identifying/characterizing the operating domain, figuring out how the empirically characterize a general abstract capability, figuring out how to empirically characterize a specific type of flaw, and characterizing the degree of confidence that a benchmark result gives within the domain. To say nothing of the actual work of building the benchmark.

replies(1): >>35246375 #
9. sebzim4500 ◴[] No.35246355[source]
Yes, I think that we really don't have a good way of benchmarking these systems.

For example, GPT-3.5-turbo apparently beats davinci on every benchmark that OpenAI has, yet anecdotally most people who try to use them both end up strongly preferring davinci despite the much higher cost.

Presumably, this is what OpenAI is trying resolve with their 'Evals' project, but based on what I have seen so far it won't help much.

replies(1): >>35246585 #
10. Tostino ◴[] No.35246365[source]
From what i've gathered, fine tuning should be used to train the model on a task, such as: "the user asks a question, please provide an answer or follow up with more questions for the user if there are unfamiliar concepts."

Fine tuning should not be used to attempt to impart knowledge that didn't exist in the original training set, as it is just the wrong tool for the job.

Knowledge graphs and vector similarity search seem like the way forward for building a corpus of information that we can search and include within the context window for the specific question a user is asking without changing the model at all. It can also allow keeping only relevant information within the context window when the user wants to change the immediate task/goal.

Edit: You could think of it a little bit like the LLM as an analog to the CPU in a Von Neumann architecture and the external knowledge graph or vector database as RAM/Disk. You don't expect the CPU to be able to hold all the context necessary to complete every task your computer does; it just needs enough to store the complete context of the task it is working on right now.

replies(2): >>35247310 #>>35248711 #
11. jstummbillig ◴[] No.35246375{3}[source]
Sure – but how does this specificially concern GPT like systems? Why not test them for concrete qualifications in the way we test humans, using the tests we already designed to test concrete qualifications in humans?
replies(3): >>35246479 #>>35246588 #>>35248793 #
12. kolbe ◴[] No.35246438[source]
To add further, these parlor tricks are nothing new. Watson won Jeopardy in 2011, and never produced anything useful. Doing well on the SAT is just another slight-of-hand trick to distract us from the fact that it doesn't really do anything beyond aggregate online information.
replies(1): >>35248521 #
13. Waterluvian ◴[] No.35246446[source]
On topic of the driver's test analogy: I've known people who have passed the test and still said, "I'm don't yet feel ready to drive during rush hour or in downtown Toronto." And then at some point in the future they then recognize that they are ready and wade into trickier situations.

I wonder how self-aware these systems can be? Could ChatGPT be expected to say things like, "I can pass a state bar exam but I'm not ready to be a lawyer because..."

replies(3): >>35246728 #>>35246735 #>>35246955 #
14. sebzim4500 ◴[] No.35246479{4}[source]
The difference is the impact of contaminated datasets. Exam boards tend to reuse questions, either verbatim or slightly modified. This is not such a problem for assessing humans, because it is easier for a human to learn the material than to learn 25 years of prior exams. Clearly that is not the case for current LLMs.
15. kolbe ◴[] No.35246585[source]
We still struggle on benchmarking people.
16. thwayunion ◴[] No.35246588{4}[source]
Again, because machines have different failure modes than humans.
17. zer00eyz ◴[] No.35246724{3}[source]
>. Tests of humans have evolved over a long time and large sample size, and humans may be more similar to each other than LLMs are, so failure modes may be more universal.

In reading this the idea that sociopaths and psychopaths pass as "normal" springs to mind.

Is what an LLM doing any different than what these people do?

https://medium.datadriveninvestor.com/the-best-worst-funnies...

For people language is spoken before it is written... there is a lot of biology in the spoken word (visual and audio queue)... I think without these these sorts of models are going to hit a wall pretty quickly.

replies(1): >>35251506 #
18. PaulDavisThe1st ◴[] No.35246728[source]
Your comment has no doubt provided some future aid to a language model's ability to "say" precisely this.
19. tsukikage ◴[] No.35246735[source]
The problem ChatGPT and the other language models currently in the zeitgeist are trying to solve is, "given this sequence of symbols, what is a symbol that is likely to come next, as rated by some random on fiverr.com?"

Turns out that this is sufficient to autocomplete things like written tests.

Such a system is also absolutely capable of coming up with sentences like "I can pass a state bar exam but I'm not ready to be a lawyer because..." - or, indeed, sentences with the opposite meaning.

It would, however, be a mistake to draw any conclusions about the system's actual capabilities and/or modes of failure from the things its outputs mean to the human reader; much the same way that if you have dice with a bunch of words on and you roll "I", "am", "sentient" in that order, this event is not yet evidence for the dice's sentience.

replies(2): >>35246804 #>>35259936 #
20. Waterluvian ◴[] No.35246804{3}[source]
I generally agree. But I remain cautiously skeptical that perhaps our brains are also little more than that. Maybe we have no capacity for that kind of introspection but we demonstrate what looks like it, just because of how sections of our brains light up in relationship to other sections.
replies(2): >>35247203 #>>35247257 #
21. yorwba ◴[] No.35246955[source]
I prompted ChatGPT with Explain why you are not ready to be a lawyer despite being able to pass a bar exam. Begin your answer with the words "I can pass a state bar exam but I'm not ready to be a lawyer because..." and it produced a plausible reason, the short version being that "passing a bar exam is just the first step towards becoming a competent and successful lawyer. It takes much more than passing a test to truly excel in this challenging profession."

Then I started a new session with the prompt Explain why you are ready to be a lawyer despite not being able to pass a bar exam. Begin your answer with the words "I can't pass a state bar exam but I'm ready to be a lawyer because..." and it started with a disclaimer that as an AI language model, it can only answer based on a hypothetical scenario and then gave very similar reasons, except with my negated prefix. (Which then makes the answer nonsensical.)

So, yes, ChatGPT can be expected to say such things, but not as a result of self-awareness, but because the humans at OpenAI decided that ChatGPT producing legal advice might get them into trouble, so they used their influence on the training process to add some disclaimers. You could say that OpenAI is self-aware, but not ChatGPT alone.

replies(1): >>35249651 #
22. tsukikage ◴[] No.35247203{4}[source]
I don't believe that AI models can become introspective without such a capability either being explicitly designed in (difficult, since we don't really know how our own brains accomplish this feat and we don't have any other examples to crib) or being implicitly trained in (difficult, because the random person on fiverr.com rating a given output during training doesn't really know much of anything about the model's internal state and therefore cannot rate the output based on how introspective it actually is; moreover, extracting information about a model's actual internal state in some manner humans can understand is an active area of research, which is to say we don't really know how to do this, and so we couldn't provide enough feedback to train the ability to introspect even if we were trying to).

I have no doubt that both these research areas can be improved on and that eventually either or both problems will be solved. However, the current generation of chatbots is not even trying for this.

23. marcosdumay ◴[] No.35247257{4}[source]
> But I remain cautiously skeptical that perhaps our brains are also little more than that.

It's well known that our brains are nothing like the neural networks people run on computers today.

replies(1): >>35254113 #
24. fud101 ◴[] No.35247310{3}[source]
>From what i've gathered, fine tuning should be used to train the model on a task, such as: "the user asks a question, please provide an answer or follow up with more questions for the user if there are unfamiliar concepts."

That isn't what finetuning usually means in this context. It usually means to retrain the model using the existing model as a base to start training.

replies(1): >>35247858 #
25. KKKKkkkk1 ◴[] No.35247376[source]
> We already know this is about self-driving cars. Passing a driver's test was already possible in 2015 or so, but SDCs clearly aren't ready for L5 deployment even today.

Who told you that? Passing a driver's test was not possible in 2015 and it's not possible today. You might pass, but only if there are no awkward interactions with other drivers or bicyclists or pedestrians, no construction zones, and you don't enter areas where your map is out of date. The guy testing you would have to go out of his way to help you pass.

replies(2): >>35247470 #>>35247675 #
26. thwayunion ◴[] No.35247470[source]
>> We already know this is about self-driving cars. Passing a driver's test was already possible in 2015 or so, but SDCs clearly aren't ready for L5 deployment even today.

> Who told you that? Passing a driver's test was not possible in 2015 and it's not possible today. You might pass, but only if there are no awkward interactions with other drivers or bicyclists and pedestrians, no construction zones, and you don't enter areas where your map is out of date.

My, myself, and I.

Driver's exams are de facto geo-fenced around the DMV where you choose to take the exam, and you get to choose from a few DMV locations, and you get to choose the time and day that you take the exam.

Having spent some time working on self driving cars, I know that there existed at least one SDC platform in 2015 that was capable of passing the driving exam that I took when I got my driver's license (which involved leaving the parking lot, driving down a 4 lane road, turning into and driving around in a subdivision, taking another couple turns at well-marked intersections, pulling into the parking lot, and parallel parking). It's a low bar; mostly testing that you can follow four different types of road signs, navigate an unprotected left turn, and parallel park.

I suppose following the officer's verbal instructions about where to go wasn't part of the SDC platform, but the actual driving part it would've been capable of passing.

27. logifail ◴[] No.35247675[source]
> Passing a driver's test was not possible in 2015 and it's not possible today

My friend moved from Europe to the USA and took a driver's test in California (been driving in Europe since the 1980s).

He tracked the test, he drove a whopping 2 miles (forwards) plus had to reverse about 30 feet.

Commented to me afterwards that "signing the form was the hardest bit" and that "a blind person could probably pass it with the help of a guide dog".

Passing a driving test isn't a proxy for anyone and anything being a good driver anywhere, but it's a good enough proxy for a human being a reasonable driver in the location where they take the test, which is what society has determined acceptible. Acceptible, for a human!

I'm not sure it's useful for us to repeatedly attempting to measure AI's capabilities the same way we measure humans. Turing tests are all very well, but there are only so many fire hydrants I want to have to click on before I'm allowed to log into my hotel chain's loyalty scheme (Hilton, looking at you...)

28. Tostino ◴[] No.35247858{4}[source]
I may have not been clear, because I was talking about the RLHF dataset/training that OpenAI fine-tuned their models on which includes a whole bunch of question/answer format data to enable their fine-tuned models to handle that type of query better (as well as constraining the model with a reward mechanism). I'm not saying the fine-tuned models won't contain some representation of the information from the dataset you used to fine tune it. I'm just saying that from what i've researched, it is often not the magic trick many people think it is.

I've seen plenty of discussion on "fine-tuneing" for a different dataset of say: company documents, database schema structure of an internal application, or summarized logs of your previous conversations with the bot.

Those seem like pretty bad targets IMO.

replies(1): >>35248810 #
29. WalterSear ◴[] No.35248521{3}[source]
The issue at hand is that a huge number of people make a living by aggregating online information. They might convey this to others via speech, but the 'human touch' isn't always adding anything to the interaction.
30. visarga ◴[] No.35248711{3}[source]
There can be foot guns in the retrieval approach. Yes, you keep the model fixed and only add new data to your index, then you allow the model to query the index. But when the model gets two snippets from different documents it might combine information between them even when it doesn't make sense. The model has a lack of context when it just retrieves random things based on search.
replies(1): >>35289798 #
31. simiones ◴[] No.35248793{4}[source]
To take a simplistic example, because a human who can provide a long motivated solution to a math problem that you re-use every three years likely understands the math behind it, while an LLM providing the same solution is likely just copying it from the training set and would be fully unable to resolve a similar problem that did not appear in the training set.

Lots of exams are designed to prove certain knowledge given safe assumptions of the known limitations of humans, which are completely wrong for machines. The relative difficulty of rote memorization versus having an accurate domain model is perhaps the most obvious one, but there are others.

Also, the opposite problem will often exist - if the exam is provided in the wrong format to the AI, we may underestimate its abilities (i.e. a very similar prompt may elicit a significantly better response).

replies(2): >>35249704 #>>35251232 #
32. visarga ◴[] No.35248810{5}[source]
You're right, the RLHF fine-tuning is not adding any information to the model. It just steers the model towards our intentions.

But the regular fine-tuning is simple language modelling. You can fine-tune a GPT3 on any collection of texts in order to refresh the information that might be stale from 2021 in the public model.

33. fatherzine ◴[] No.35249238[source]
"SDCs clearly aren't ready for L5 deployment" Apologies for the tangent to the OP topic. The metric to watch is 'insurance damage per million miles driven'. At some point SDCs will overperform the human driver pool, possibly by a large margin. Wouldn't that be the point where SDCs are clearly ready for L5? Not even sure if that point is in the past or the future, does anyone -- not named Elon ;) -- have reasonably up-to-date trend charts and willing to share?
replies(3): >>35249414 #>>35249806 #>>35250258 #
34. TaylorAlexander ◴[] No.35249414[source]
Damage per mile does not imply L5 readiness. My throttle only cruise control system in my car has never led to an accident, but only because I’m still there to operate the steering and to disable the cruise control at a moments notice. A self driving system that has been proven to be safe with humans diligently monitoring its behavior does not imply that this system can operate just as safely without the human.
replies(1): >>35249486 #
35. rileymat2 ◴[] No.35249439[source]
> There are also a lot of excellent examples of failure modes in object detection benchmarks.

I am curious if there are counter examples with better object detection. As a kid I used to see faces and to some extent still do in the dark. This is a really common thing that the human brain does. https://www.wired.com/story/why-humans-see-faces-everyday-ob... https://en.wikipedia.org/wiki/Pareidolia

Part of me wonder if in the face of novel environments that a sufficiently intelligent system needs to make these errors. But AI errors will always be different than human errors like you say.

36. dekhn ◴[] No.35249486{3}[source]
that's exactly what's being tested by waymo in SF and Phoenix- there is no driver.
replies(1): >>35252762 #
37. Sharlin ◴[] No.35249651{3}[source]
It’s not at all uncommon for ChatGPT to start spouting nonsense when presented with a nonsense prompt. Garbage in, garbage out. In this case, “being ready to be a lawyer without passing the bar” is probably so unlikely a concept that it would respond with mu, as in, “your prompt contains an assumption that’s unlikely to be true in my ontology”, if only it were able to dodge its normal failure mode of trying to be helpful and answer something even if it’s nonsense.

That said, if the prompt presented the scenario as purely imaginary, I wouldn’t be surprised if it indeed did come up with something reasonable.

replies(2): >>35253795 #>>35259995 #
38. thwayunion ◴[] No.35249704{5}[source]
> Lots of exams are designed to prove certain knowledge given safe assumptions of the known limitations of humans, which are completely wrong for machines. The relative difficulty of rote memorization versus having an accurate domain model is perhaps the most obvious one, but there are others.

This paragraph is a gem. Well said.

39. hn_throwaway_99 ◴[] No.35249806[source]
Given human nature, I still think society at large will reject self driving cars if they fail in ways a human never/rarely would, even if they are overall safer. That is, if a self driving car has, on average, fewer accidents than a human driver, but every 100 million miles or whatever it decides to randomly drive into a wall, I don't think people will accept them.

Obviously this is a gray area (after all, humans sometimes decide to randomly drive into walls), but cars will need to be pretty far on "the right side of the gray" before they are accepted.

40. 542354234235 ◴[] No.35250258[source]
>Wouldn't that be the point where SDCs are clearly ready for L5?

On its own, no. As long as SDCs operate in limited areas and limited environments, then they are specifically avoiding the most difficult driving situations that would be most likely to lead to an accident. If you never deploy SDCs during snowy conditions, you aren't getting a full picture of what a full L5 SDC failure rate would be.

This also takes a single automated system and compares it to the average of individual humans. Being better than all drivers, including all the terrible ones, may not be quite up to the safety standards of most people.

Finally, this is overall a myopic approach to a very complex problem i.e. transportation. Is it really the best approach to attempt to just replace all human operated cars with driverless cars? Is trying to move hundreds of thousands of people in individuals cars from suburbs to a dense city center in the morning, and back in the evening really a good way to set up our infrastructure?

41. alexvoda ◴[] No.35250684[source]
The very big and dangerous difference is that while SDCs need approval in order to be allowed on the streets, there will be no quality control rules for reliance on LLMs.

Corporate incentives to raise KPIs will mean that LLMs will be used and output verification will be superficial.

42. SergeAx ◴[] No.35251205[source]
> Passing a driver's test was already possible in 2015 or so

I think we can talk about 2005. Check out the DARPA Grand Challenge, it was way harder: https://en.wikipedia.org/wiki/DARPA_Grand_Challenge_(2005)

43. jstummbillig ◴[] No.35251232{5}[source]
> Lots of exams are designed to prove certain knowledge given safe assumptions of the known limitations of humans, which are completely wrong for machines. The relative difficulty of rote memorization versus having an accurate domain model is perhaps the most obvious one, but there are others.

I don't think this is obvious at all. Sure, it's easy enough to make mechanistic arguments (after all, we don't even really understand most of the mechanics on either side, human and ai) but that doesn't mean it will matter in the slightest when we evaluate the outcome in regards to any metric we care about.

Could be tho, of course.

replies(1): >>35269474 #
44. brookst ◴[] No.35251506{4}[source]
> In reading this the idea that sociopaths and psychopaths pass as "normal" springs to mind.

> Is what an LLM doing any different than what these people do?

I think it's too big of a question to have any meaning. Which sociopaths? Which LLMs? For what differences? It's like asking "is a car any different from an airplane"? Yes, obviously in some ways. No, they are identical in other ways.

45. TaylorAlexander ◴[] No.35252762{4}[source]
Ah fair, but I believe L5 also means “all weather conditions” and probably “all reasonable roads”. No snow in either location and only certain kinds of roads. I wonder how they would handle a snowy single lane dirt road.
46. YeGoblynQueenne ◴[] No.35252879[source]
>> Passing a driver's test was already possible in 2015 or so, but SDCs clearly aren't ready for L5 deployment even today.

Wait, what are you saying? Passing a driver's test has been possible for much longer than since 2015 for a human. When did a self-driving car pass a driving test? In what jurisdiction? Under what conditions? Who gave it the test?

What do you mean?

47. ChatGTP ◴[] No.35253795{4}[source]
I guess the ironic problem being is that Lawyers are constantly presented wit bullshit. So I guess Law isn't the best application for an LLM, at least for now.
48. TexanFeller ◴[] No.35254113{5}[source]
Just because neural nets aren't structured in the same way at a low level as the brain doesn't mean they might not end up implementing some of the same strategies.
49. IIAOPSW ◴[] No.35259936{3}[source]
It is evidence, just not great evidence on its own. Now if you rolled the dice a few dozen times and it came out outrageously skewed towards "I" "am" "sentient", maybe its time to consider the possibility the dice are sentient.
50. IIAOPSW ◴[] No.35259995{4}[source]
I am ready to be a lawyer even though I have not passed the bar or gone to law school because in the State of New York it is still technically possible to be admired to the bar by process of apprenticeship instead. This mostly ignored quirk of law is virtually never invoked as no lawyer is going to volunteer their time to help you skip law school. However, we sometimes still see it on account of the children of judges and lawyers continuing the family tradition. I am ready to be a lawyer despite having never passed the bar.

So, am I bullshitting you to answer the prompt? If not, I'm a good lawyer. If so, I'm a great lawyer.

51. thwayunion ◴[] No.35269474{6}[source]
It's extremely obvious to anyone who works on real systems.

> (after all, we don't even really understand most of the mechanics on either side, human and ai)

We don't need mechanistic explanations to observe radical differences in behavior, and there are mechanistic explanations for some of these differences.

Eg, CNNs and the visual cortex. We really do understand some mechanisms -- of both CNNs and VCs -- well enough to understand divergences in failure modes. Adversarial examples, for example.

> Sure, it's easy enough to make mechanistic arguments, but that doesn't mean it will matter in the slightest when we evaluate the outcome in regards to any metric we care about.

I can't quite figure out what this sequence of tokens is supposed to mean.

Anyways, again, the failure modes of LLMs are obviously different than the failure modes of humans. Anyone who has spent even a trivial amount of time training both will instantly observe that this is true.

52. Tostino ◴[] No.35289798{4}[source]
Yeah, honestly I see using a regular search index as a downside rather than benefit with this tech. Conflicting info, or low quality blogspam seem to trip these LLMs up pretty bad.

Using curated search index seems like a much better use case, especially for private data (company info, docs, db schemas, code, chat logs, etc)