Most active commenters

thwayunion(10)
sebzim4500(7)
vidarh(7)
YeGoblynQueenne(6)
(5)
dahart(4)
kybernetikos(4)
Tostino(3)
jstummbillig(3)
jrochkind1(3)

Popular/hot comments

>>35245821 #
>>35246167 #
>>35246018 #
>>35245898 #
>>35246026 #
>>35246275 #
>>35245982 #
>>35246375 #
>>35246446 #
>>35246507 #
>>35246542 #
>>35246617 #
>>35249238 #

GPT-4 and professional benchmarks: the wrong answer to the wrong question

(aisnakeoil.substack.com)

1. thwayunion ◴[21 Mar 23 13:28 UTC] No.35245821[source]▶

>>35245626 (OP) #

Absolutely correct.

We already know this is about self-driving cars. Passing a driver's test was already possible in 2015 or so, but SDCs clearly aren't ready for L5 deployment even today.

There are also a lot of excellent examples of failure modes in object detection benchmarks.

Tests, such as driver's tests or standardized exams, are designed for humans. They make a lot of entirely implicit assumptions about failure modes and gaps in knowledge that are uniquely human. Automated systems work differently. They don't fail in the same way that humans fail, and therefore need different benchmarks.

Designing good benchmarks that probe GPT systems for common failure modes and weaknesses is actually quite difficult. Much more difficult than designing or training these systems, IME.

replies(12): >>35245981 #>>35246141 #>>35246208 #>>35246246 #>>35246355 #>>35246446 #>>35247376 #>>35249238 #>>35249439 #>>35250684 #>>35251205 #>>35252879 #

2. jvanderbot ◴[21 Mar 23 13:38 UTC] No.35245898[source]▶

>>35245626 (OP) #

Memorization is absolutely the most valuable part of GPT, for me. I can get natural language responses to documentation, basic scripting / sysadmin, and API questions much more easily than searching other ways.

While this is an academic interest point, and rightly tamps down on hype around replacing humans, it doesn't dissuade what I think are most peoples' basic use case: "I don't know or don't remember how to do X, can you show me?"

This is finally a good enough "knowledge reference engine" that I can see being useful to those very people it is over hyped to replace.

replies(6): >>35245958 #>>35245959 #>>35245985 #>>35246065 #>>35246167 #>>35252251 #

3. vidarh ◴[21 Mar 23 13:42 UTC] No.35245958[source]▶

>>35245898 #

And asking higher level questions than what you'd otherwise look up. E.g. I've had ChatGPT write forms, write API calls, put together skeletons for all kinds of things that I can easily verify and fix when it gets details from but that are time consuming to do manually. I've held back and been sceptical but I'm at the point where I'm preparing to integrate models all over the place because there are plenty of places where you can add sufficient checks that doing mostly ok much of the time is sufficient to already provide substantial time savings.

replies(1): >>35246018 #

4. User23 ◴[21 Mar 23 13:42 UTC] No.35245959[source]▶

>>35245898 #

The main problem with using GPT-3 (and maybe 4 I dunno) in that way is it will happily bullshit you to the point of making up fake references. For example it quoted me "Section 6.2" of the Go Programming Language Specification to support its answer when I asked it how a particular conversion is specified.

replies(2): >>35246187 #>>35246812 #

5. zer00eyz ◴[21 Mar 23 13:43 UTC] No.35245981[source]▶

>>35245821 #

> good benchmarks ... failure modes and weaknesses is actually quite difficult. Much more difficult than designing or training these systems

Is it? Based on the restrictions placed on the systems we see today and the way people are breaking it, I would say that some failure modes are known.

replies(2): >>35246061 #>>35246078 #

6. madsbuch ◴[21 Mar 23 13:43 UTC] No.35245982[source]▶

>>35245626 (OP) #

It seems like many people focus on reasoning capabilities of the GPT models.

The me the real value is in the industrial scale pattern recognition capabilities. I can indicate something I vaguely know or ask it to expand on a concept for further research.

Within the last hours I have used it to kick-start my research on the AT1 bond and why Credit Suisse let them default and it helped me recall that it was the GenServer pattern I was looking for in Elixir when you have a facade that calls to an independent process.

replies(3): >>35246019 #>>35246275 #>>35247434 #

7. rubendv ◴[21 Mar 23 13:44 UTC] No.35245985[source]▶

>>35245898 #

I guess it is in OpenAI's best interest to downplay the memorization aspect in favor of the logical reasoning angle. If it turns out that GPT is memorizing and reproducing copyrighted data, it could land them in legal trouble.

replies(2): >>35246113 #>>35250346 #

8. zer00eyz ◴[21 Mar 23 13:47 UTC] No.35246018{3}[source]▶

>>35245958 #

> I've held back and been sceptical but I'm at the point where I'm preparing to integrate models all over the place because there are plenty of places where you can add sufficient checks that doing mostly ok much of the time is sufficient to already provide substantial time savings.

Im an old engineer.

Simply put NO.

If you don't understand it don't check it in. You are just getting code to cut and paste at a higher frequency and volume. At some point in time the fire will be burning around you and you won't have the tools to deal with it.

Nothing about mostly, much and sufficient ever ends well when it has been done in the name of saving time.

replies(7): >>35246026 #>>35246079 #>>35246149 #>>35246308 #>>35248566 #>>35249906 #>>35257939 #

9. czbond ◴[21 Mar 23 13:47 UTC] No.35246019[source]▶

>>35245982 #

> The me the real value is in the industrial scale pattern recognition capabilities.

Absolutely! It has perplexed me why your point is not being discussed more around GPT

10. vidarh ◴[21 Mar 23 13:47 UTC] No.35246026{4}[source]▶

>>35246018 #

Nobody suggested checking in anything you don't understand. On the contrary. So maybe try reading again.

replies(4): >>35246280 #>>35246737 #>>35246792 #>>35246983 #

11. helsinkiandrew ◴[21 Mar 23 13:48 UTC] No.35246038[source]▶

>>35245626 (OP) #

> This strongly suggests that the model is able to memorize solutions from its training set

I'm not sure why this is a problem - surely in systems like chatGPT we want the specifics that was in the training set not a generalization. It's not learning/reasoning from the training data its 'cleverly regurgitating' things its seen.

replies(1): >>35246131 #

12. thwayunion ◴[21 Mar 23 13:50 UTC] No.35246061{3}[source]▶

>>35245981 #

A good benchmark is not simply a set of unit tests.

What you want in a benchmark is a set of things you can use to measure general improvement; doing better should decrease the propensity of a particular failure mode. Doing this in a way that generalizes beyond specific sub-problems, or even specific inputs in the benchmark suite, is difficult. Building a benchmark suite that's large and comprehensive enough that generalization isn't necessary is also a challenge.

Think about an analogy to software security. Exploiting a SQL injection attack in insecure code is easy. Coming up with a set of unit tests that ensures an entire black box software system is free of SQL injection attacks is quite a bit more difficult. Red teaming vs blue teaming, except the blue team doesn't get source code in this case. So the security guarantee has to come from unit tests alone, not systematic design decisions. Just like in software security, knowing that you've systematically eliminated a problem is much more difficult than finding one instance of the problem.

13. petesergeant ◴[21 Mar 23 13:50 UTC] No.35246065[source]▶

>>35245898 #

I dunno, I use ChatGPT for exactly the same thing as you, and people are always quite surprised when I say that's its main value to me, so I think people have very different ideas of what it excels at

14. macawfish ◴[21 Mar 23 13:51 UTC] No.35246072[source]▶

>>35245626 (OP) #

I disagree that these are "the wrong questions", but I do think we need to try and be nuanced about what these kinds of results actually mean.

The potential for these tools to impact labor markets is huge, no matter what they're "actually" or "essentially" capable of.

I'm a little tired of the arguments that the large language models are just regurgitating memorized output, I think it's now clear that higher level capabilities are emerging in these models and we need to take this seriously as a social/economic/political challenge.

This is "industrial revolution" level technology.

replies(2): >>35247306 #>>35249279 #

15. brookst ◴[21 Mar 23 13:51 UTC] No.35246078{3}[source]▶

>>35245981 #

I think the hard / unknown part is how you know you’ve identified all of the failure modes that need to be tested.

Tests of humans have evolved over a long time and large sample size, and humans may be more similar to each other than LLMs are, so failure modes may be more universal.

But very short history, small sample size, and diversity of architecture and training means we really don’t know how to test and measure LLMs. Yes, some failure modes are known, but how many are not?

replies(1): >>35246724 #

16. simonw ◴[21 Mar 23 13:51 UTC] No.35246079{4}[source]▶

>>35246018 #

"You are just getting code to cut and paste at a higher frequency and volume" genuinely sounds like the key value proposition of ChatGPT for coding to me.

I treat its output like I would treat a PR from a brand new apprentice engineer on my team: review it carefully, provide some feedback and iterate a few times, accept with tests.

replies(1): >>35251224 #

17. brookst ◴[21 Mar 23 13:53 UTC] No.35246113{3}[source]▶

>>35245985 #

On the bright side it would mean they have invented an amazing compression algorithm, given the model size and amount of text it can produce.

18. ◴[21 Mar 23 13:54 UTC] No.35246118[source]▶

>>35245626 (OP) #

19. jryb ◴[21 Mar 23 13:55 UTC] No.35246131[source]▶

>>35246038 #

For some applications, yes, but it comes at the cost of not knowing how powerful ChatGPT really is. So the claim from OpenAI that ChatGPT 4 can pass the bar exam are deceptive since it will likely fail any future bar exam.

replies(1): >>35246254 #

20. foroak ◴[21 Mar 23 13:55 UTC] No.35246136[source]▶

>>35245626 (OP) #

Astonishing that a substack called "AI Snake Oil" would come to this conclusion...

replies(1): >>35246774 #

21. dcolkitt ◴[21 Mar 23 13:56 UTC] No.35246141[source]▶

>>35245821 #

I'd also add that the almost all standardized tests are designed for introductory material across millions of people. That kind of information is likely to be highly represented in the training corpus. Whereas most jobs require highly specialized domain knowledge that's probably not well represented in the corpus, and probably too expansive to fit into the context window.

Therefore standardized tests are probably "easy mode" for GPT, and we shouldn't over-generalize its performance there to its ability to actually add economic value in actually economically useful jobs. Fine-tuning is maybe a possibility, but its expensive and fragile, and I don't think its likely that every single job is going to get a fine-tuned version of GPT.

replies(2): >>35246365 #>>35246438 #

22. Karunamon ◴[21 Mar 23 13:56 UTC] No.35246149{4}[source]▶

>>35246018 #

Nobody said one word about checking in something they don't understand. That applies to copying from stackoverflow as much as it does from an LLM or copilot.

23. VeninVidiaVicii ◴[21 Mar 23 13:56 UTC] No.35246153[source]▶

>>35245626 (OP) #

Great read. Adding a bit more — I think these are hinting at the reasons GPT seems to be getting worse. Our expectations realign after the failure modes become excruciatingly obvious after some use. For instance, asking it about differences in pronunciation between European and Brazilian Portuguese, even GPT-4 gives utter nonsense.

Even after very carefully explaining the kinds of things I am looking for, it will very carefully repeat what I said, then extrapolate into totally misguided tips like “a” is pronounced “i”. Though, I think it does make sense that GPT would have problems with pronunciation.

replies(1): >>35247160 #

24. dcolkitt ◴[21 Mar 23 13:57 UTC] No.35246164[source]▶

>>35245626 (OP) #

GPT is a very impressive technical achievement. But that technical achievement is more in the field of compression rather than intelligence.

replies(1): >>35246563 #

25. soco ◴[21 Mar 23 13:57 UTC] No.35246167[source]▶

>>35245898 #

Good luck getting ChatGPT to explain a cron expression like "0 30 5 * * 3". I mean, it will explain, but mixing up everything. How many other mistakes it might make?

replies(7): >>35246247 #>>35246250 #>>35246277 #>>35246542 #>>35247089 #>>35248460 #>>35248921 #

26. ChancyChance ◴[21 Mar 23 13:58 UTC] No.35246181[source]▶

>>35245626 (OP) #

"10/10 pre-2021, 0/10 post-2021"

I guffawed. Hasn't there been ANY update to the Turing test since it was proposed? I suspect the answer to the benchmarking issue was address by philosophers long before we got here.

replies(2): >>35246541 #>>35248367 #

27. dpkirchner ◴[21 Mar 23 13:59 UTC] No.35246187{3}[source]▶

>>35245959 #

Do you remember the prompt (it should be in the history column)? I'm curious how it came up with the section numbering.

28. Robotbeat ◴[21 Mar 23 14:01 UTC] No.35246208[source]▶

>>35245821 #

I tend to think that it would not be particularly hard for current self driving systems to exceed the safety of a teenager right after passing the drivers test.

29. jstummbillig ◴[21 Mar 23 14:03 UTC] No.35246246[source]▶

>>35245821 #

> Designing good benchmarks that probe GPT systems for common failure modes and weaknesses is actually quite difficult. Much more difficult than designing or training these systems, IME.

What do you think is the difficulty?

replies(1): >>35246300 #

30. nielsole ◴[21 Mar 23 14:03 UTC] No.35246247{3}[source]▶

>>35246167 #

[...]

> So the cron expression `0 30 5 * * 3` means "run the cron job every Wednesday at 5:30 AM".

It explains the five Cron parameters but the doesn't pick up that six were provided. Oh well

replies(1): >>35246268 #

31. messe ◴[21 Mar 23 14:04 UTC] No.35246250{3}[source]▶

>>35246167 #

Isn't that an invalid cron expression? It has six fields instead of five, and says to run at the 30th hour of the day.

replies(1): >>35246476 #

32. helsinkiandrew ◴[21 Mar 23 14:04 UTC] No.35246254{3}[source]▶

>>35246131 #

Agreed, but apart from the novelty factor I'm not sure what the practical use is of ChatGPT passing the bar exam. In fact I think it's a good thing that it can't do well in any future tests as its likely to be used by the unscrupulous

33. messe ◴[21 Mar 23 14:05 UTC] No.35246268{4}[source]▶

>>35246247 #

I got a similar but incorrect result from ChatGPT: "So, the expression "0 30 5 * * 3" means that the command should be executed at 5:30 AM on the 5th day of every month, but only if that day is a Wednesday"

However, the crontab is invalid. It has five fields instead of six, and the 30 corresponds to the hour field.

"30 5 * * 3" without the leading zero, would correspond to run every Wednesday at 5:30 AM though. I suspect the fact that it has six fields instead of five is confusing it, and it's interpreting the 5 as a day of the month and as the hour

34. soared ◴[21 Mar 23 14:05 UTC] No.35246275[source]▶

>>35245982 #

How do you know that the research you’ve conducted is accurate, rather than just precise?

replies(3): >>35246361 #>>35247705 #>>35248714 #

35. meh8881 ◴[21 Mar 23 14:06 UTC] No.35246277{3}[source]▶

>>35246167 #

I don’t know what that means. This is what GPT says. Is it correct?

> the cron expression "0 30 5 * * 3" means that the job will run at 5:30 AM on the 5th day of every month and on Wednesdays.

36. ◴[21 Mar 23 14:06 UTC] No.35246280{5}[source]▶

>>35246026 #

37. thwayunion ◴[21 Mar 23 14:07 UTC] No.35246300{3}[source]▶

>>35246246 #

A good benchmark provides a strong quantitative or qualitative signal that a model has a specific capability, or does not have a specific flaw, within a given operating domain.

Each part of this difficult -- identifying/characterizing the operating domain, figuring out how the empirically characterize a general abstract capability, figuring out how to empirically characterize a specific type of flaw, and characterizing the degree of confidence that a benchmark result gives within the domain. To say nothing of the actual work of building the benchmark.

replies(1): >>35246375 #

38. poniko ◴[21 Mar 23 14:08 UTC] No.35246308{4}[source]▶

>>35246018 #

Isn't that what we all have been doing with google/stackoverflow .. how do I solve xx? Aha seems right, copy, paste and a quick format.. cross fingers and run.

39. xivzgrev ◴[21 Mar 23 14:09 UTC] No.35246324[source]▶

>>35245626 (OP) #

That was a really interesting point on cross contamination. It perfectly answered questions before a particular date and couldn’t at all do well on questions after. So it’s just really good at memorizing answers and regurgitating.

Actually that might not be so different than human test takers.

40. sebzim4500 ◴[21 Mar 23 14:11 UTC] No.35246355[source]▶

>>35245821 #

Yes, I think that we really don't have a good way of benchmarking these systems.

For example, GPT-3.5-turbo apparently beats davinci on every benchmark that OpenAI has, yet anecdotally most people who try to use them both end up strongly preferring davinci despite the much higher cost.

Presumably, this is what OpenAI is trying resolve with their 'Evals' project, but based on what I have seen so far it won't help much.

replies(1): >>35246585 #

41. YetAnotherNick ◴[21 Mar 23 14:11 UTC] No.35246361{3}[source]▶

>>35246275 #

For most things, verification is far easier than getting the answer. The same is the case with using stack overflow, where I think that at least half the answer doesn't answer my query, but once I have the potential solution, it is easy to look for the documentation of the key function call etc. Or purely by running it if it is simple and doesn't seem dangerous.

42. Tostino ◴[21 Mar 23 14:12 UTC] No.35246365{3}[source]▶

>>35246141 #

From what i've gathered, fine tuning should be used to train the model on a task, such as: "the user asks a question, please provide an answer or follow up with more questions for the user if there are unfamiliar concepts."

Fine tuning should not be used to attempt to impart knowledge that didn't exist in the original training set, as it is just the wrong tool for the job.

Knowledge graphs and vector similarity search seem like the way forward for building a corpus of information that we can search and include within the context window for the specific question a user is asking without changing the model at all. It can also allow keeping only relevant information within the context window when the user wants to change the immediate task/goal.

Edit: You could think of it a little bit like the LLM as an analog to the CPU in a Von Neumann architecture and the external knowledge graph or vector database as RAM/Disk. You don't expect the CPU to be able to hold all the context necessary to complete every task your computer does; it just needs enough to store the complete context of the task it is working on right now.

replies(2): >>35247310 #>>35248711 #

43. jstummbillig ◴[21 Mar 23 14:13 UTC] No.35246375{4}[source]▶

>>35246300 #

Sure – but how does this specificially concern GPT like systems? Why not test them for concrete qualifications in the way we test humans, using the tests we already designed to test concrete qualifications in humans?

replies(3): >>35246479 #>>35246588 #>>35248793 #

44. code_lettuce ◴[21 Mar 23 14:16 UTC] No.35246421[source]▶

>>35245626 (OP) #

Refreshing to see a nuanced opinion piece on the recent developments in AI.

45. wdefoor ◴[21 Mar 23 14:16 UTC] No.35246425[source]▶

>>35245626 (OP) #

OpenAI didn’t conduct the bar exam study, Casetext and Stanford did (gotta read those footnotes). The questions were from after the knowledge cutoff and passed the contamination check.

replies(2): >>35247112 #>>35250825 #

46. sebzim4500 ◴[21 Mar 23 14:16 UTC] No.35246436[source]▶

>>35245626 (OP) #

Clearly contaminated benchmarks are not very useful, but I do not understand the assertion that we should care about "Qualitative studies of professionals using AI" over "Comparison on real world tasks". I've looked through these benchmarks in details, and I've come to the conclusion that real world performance is all that matters. Everything else is either incredibly subjective or designed to beat a particular prior model.

47. kolbe ◴[21 Mar 23 14:16 UTC] No.35246438{3}[source]▶

>>35246141 #

To add further, these parlor tricks are nothing new. Watson won Jeopardy in 2011, and never produced anything useful. Doing well on the SAT is just another slight-of-hand trick to distract us from the fact that it doesn't really do anything beyond aggregate online information.

replies(1): >>35248521 #

48. pontus ◴[21 Mar 23 14:16 UTC] No.35246439[source]▶

>>35245626 (OP) #

While everyone is debating whether this is impressive or dumb, if it's a leap forward in technology or just a rehashing of old ideas with more data, if we should really care that much about it passing the bar exam or if it's all just a parlor trick, people around the world are starting to use this as a tool and getting real results, becoming more productive, and building stuff... Seems like the proof is in the pudding!

replies(2): >>35246513 #>>35248486 #

49. Waterluvian ◴[21 Mar 23 14:17 UTC] No.35246446[source]▶

>>35245821 #

On topic of the driver's test analogy: I've known people who have passed the test and still said, "I'm don't yet feel ready to drive during rush hour or in downtown Toronto." And then at some point in the future they then recognize that they are ready and wade into trickier situations.

I wonder how self-aware these systems can be? Could ChatGPT be expected to say things like, "I can pass a state bar exam but I'm not ready to be a lawyer because..."

replies(3): >>35246728 #>>35246735 #>>35246955 #

50. soco ◴[21 Mar 23 14:18 UTC] No.35246476{4}[source]▶

>>35246250 #

It's a valid Quartz scheduler task where it starts with the seconds. But right, I shouldn't have asked ChatGPT about cron when it's actually quartz - and indeed with the changed prompt it will describe it correctly.

Edit: actually almost correctly: " - '3' represents the day of the week on which the task will be executed. In this case, the value is 3, which represents Wednesday (Sunday is 1, Monday is 2, and so on)."

replies(1): >>35248998 #

51. sebzim4500 ◴[21 Mar 23 14:18 UTC] No.35246479{5}[source]▶

>>35246375 #

The difference is the impact of contaminated datasets. Exam boards tend to reuse questions, either verbatim or slightly modified. This is not such a problem for assessing humans, because it is easier for a human to learn the material than to learn 25 years of prior exams. Clearly that is not the case for current LLMs.

52. m3kw9 ◴[21 Mar 23 14:20 UTC] No.35246499[source]▶

>>35245626 (OP) #

Gpt is very good for utility type work like snippets of code, command line interface parameters and some quick explanations. It’s zero to one in that regard but anything more complex or high stakes, you really need verify a lot which diminishes that use case people are hyping about

53. jeswin ◴[21 Mar 23 14:20 UTC] No.35246507[source]▶

>>35245626 (OP) #

In an example [1] referenced in the article, ChatGPT (maybe v3.5) did a poor job of answering the following question.

    Boll is a new company founded by three recent MBAs. The company plans to buy used automobiles at auctions and then sell them at four times the purchase price direct to consumers (i.e., a car that is purchased for $1,000 would be sold at $4000).  Based on an analysis of similar companies, the owners of the company expect to turn their inventory 1.5 times per year. To achieve an annual revenue of $3,000,000, how much inventory, expressed in dollars, should the company expect its operation to require?

How does GPT4 fare?

------

Let's break down the problem step by step.

The company wants to achieve an annual revenue of $3,000,000. They plan to sell cars at 4 times the purchase price. They expect to turn their inventory 1.5 times per year. First, we need to find out how much revenue they will generate per inventory turnover. To do this, we'll divide the annual revenue by the expected number of inventory turnovers per year:

Annual revenue / Inventory turnover rate = Revenue per inventory turnover $3,000,000 / 1.5 = $2,000,000

Now, we know that they plan to sell the cars at 4 times the purchase price. So, we need to find out the purchase price of the cars that will generate $2,000,000 in revenue per inventory turnover:

Revenue per inventory turnover / Markup factor = Purchase price of cars $2,000,000 / 4 = $500,000

Therefore, Boll should expect its operation to require $500,000 worth of inventory to achieve an annual revenue of $3,000,000.

------

That V4 is able to do this is in itself just as important as everything said in the article. We live in interesting times.

[1]: https://aiguide.substack.com/p/did-chatgpt-really-pass-gradu...

replies(3): >>35247183 #>>35247621 #>>35249854 #

54. sebzim4500 ◴[21 Mar 23 14:20 UTC] No.35246513[source]▶

>>35246439 #

Yeah but they aren't using it for the same stuff that they would use a lawyer who got the same result on the bar exam. I think it is fair to say that LLMs have an unfair advantage over humans on these exams, and we should take that into account when trying to assess them.

55. sebzim4500 ◴[21 Mar 23 14:21 UTC] No.35246541[source]▶

>>35246181 #

I think people completely lost interest in the Turing test as soon as it looked like something might actually be able to pass it.

replies(1): >>35249895 #

56. raattgift ◴[21 Mar 23 14:21 UTC] No.35246542{3}[source]▶

>>35246167 #

Probably the most fundamental problem of ChatGPT is that it refuses (or is even unable) to admit, "I don't know".

replies(3): >>35246753 #>>35246850 #>>35247503 #

57. jrochkind1 ◴[21 Mar 23 14:23 UTC] No.35246560[source]▶

>>35245626 (OP) #

Just realized that this is yet another example of the category of "imperfect measurement" that HN likes to talk about so much -- Goodhart's law and such.

The bar exam is a proxy for actually being a competent lawyer. But it's an imperfect proxy. It seems obvious that it will be possible for a computer program to target getting good on the bar exam, and be good at the bar exam without being a competent lawyer. It may also be possible for a human to? But hard enough that it's still a reasonable proxy when it's humans?

replies(2): >>35246726 #>>35247002 #

58. sebzim4500 ◴[21 Mar 23 14:23 UTC] No.35246563[source]▶

>>35246164 #

I don't think compression and intelligence can be disentangled in this way.

replies(1): >>35247065 #

59. havkom ◴[21 Mar 23 14:23 UTC] No.35246565[source]▶

>>35245626 (OP) #

My own experience is that for coding tasks which are probably very well represented in the training data set, such as generating a react page with some functional buttons, GPT-4 performs perfectly. When coming to more specialized tasks (with probably fewer samples in the data set for just that task), such as creating integration between systems with concurrency handling, it still performs surprisingly good but only “first draft” quality and the generated code contains bugs, misses important aspects and the code does not compile usually on the first try - even when using popular programming languages, popular libraries and effort is put in to describing the problem&expected results.

Based on this, an excellent tool for developers, but not ready to replace them even though it is surprisingly good. For scaffolding tasks that junior programmers do - it could possibly cut down the need for them (in additio to providing valuable assistance to non-scaffolding standard tasks).

60. kolbe ◴[21 Mar 23 14:24 UTC] No.35246585{3}[source]▶

>>35246355 #

We still struggle on benchmarking people.

61. thwayunion ◴[21 Mar 23 14:24 UTC] No.35246588{5}[source]▶

>>35246375 #

Again, because machines have different failure modes than humans.

62. joshuanapoli ◴[21 Mar 23 14:25 UTC] No.35246602[source]▶

>>35245626 (OP) #

> Surprisingly, Horace He pointed out that GPT-4 solved 10/10 pre-2021 problems and 0/10 recent problems in the easy category. The training data cutoff for GPT-4 is September 2021. This strongly suggests that the model is able to memorize solutions from its training set — or at least partly memorize them, enough that it can fill in what it can’t recall.

Probably the latency of introducing fresh content into the model training will quickly decrease, as these models develop commercial success.

It will be really nice when the AI can provide the most cutting edge solutions and up-to-date information relevant to your query. That will be really valuable, even if the AI cannot yet synthesize novel cutting-edge solutions to new problems.

63. kybernetikos ◴[21 Mar 23 14:25 UTC] No.35246617[source]▶

>>35245626 (OP) #

I gave ChatGPT the four cards logic puzzle, lots of humans struggle with it, but chatGPT got it exactly right, so I was very impressed. Then I realised that the formulation I'd given it (the same as the original study) was almost certainly part of its training set.

I made extremely minor changes to the way the question was phrased and it failed badly, not just getting the answer wrong but falling into incoherence, claiming that T was a vowel, or that 3 was an even number.

The largeness of its training set can give an incorrect impression of its reasoning capabilities. It can apply simple logic to situations, even situations it hasn't seen, but the logic can't get much beyond the first couple of lectures in an introductory First-order Logic course before it starts to fall apart if it can't lean on its large training set data.

The fact that it can do logic at all is impressive to me though, I'm interested to see how much deeper its genuine capability goes as we get more advanced models.

replies(3): >>35246998 #>>35251307 #>>35252827 #

64. djoldman ◴[21 Mar 23 14:29 UTC] No.35246696[source]▶

>>35245626 (OP) #

These are valuable call-outs. They are evidence that the definition of intelligence is at best, a set of ever-moving goalposts and at worst, vague enough to be useless.

However. It seems the ML community has centered around the idea that LLMs are zero or few shot learners, meaning that despite only being trained on the task of predicting the next token, they do well on other specific, highly niche, tasks. This is surprising and important.

Predictions of immediate profession replacement seem silly. As noted in TFA, there's more to a job than the license test.

65. zer00eyz ◴[21 Mar 23 14:31 UTC] No.35246724{4}[source]▶

>>35246078 #

>. Tests of humans have evolved over a long time and large sample size, and humans may be more similar to each other than LLMs are, so failure modes may be more universal.

In reading this the idea that sociopaths and psychopaths pass as "normal" springs to mind.

Is what an LLM doing any different than what these people do?

https://medium.datadriveninvestor.com/the-best-worst-funnies...

For people language is spoken before it is written... there is a lot of biology in the spoken word (visual and audio queue)... I think without these these sorts of models are going to hit a wall pretty quickly.

replies(1): >>35251506 #

66. thwayunion ◴[21 Mar 23 14:31 UTC] No.35246726[source]▶

>>35246560 #

More-over, in most states, passing that bar isn't enough to practice.

In most states, the bar exam is just one component. Humans also need to pass several demanding courses in law school, and to get into law school they need to get a BA/BS degree, which again requires passing several demanding courses and writing various essays.

That's just to be allowed to practice law. Getting an actual job as a lawyer often means internships of one form or another, and then spending some time at the beginning of one's career as a de facto apprentice to a seasoned lawyer.

These sorts of exams play a very particular role in evaluation of humans.

67. PaulDavisThe1st ◴[21 Mar 23 14:31 UTC] No.35246728{3}[source]▶

>>35246446 #

Your comment has no doubt provided some future aid to a language model's ability to "say" precisely this.

68. tsukikage ◴[21 Mar 23 14:32 UTC] No.35246735{3}[source]▶

>>35246446 #

The problem ChatGPT and the other language models currently in the zeitgeist are trying to solve is, "given this sequence of symbols, what is a symbol that is likely to come next, as rated by some random on fiverr.com?"

Turns out that this is sufficient to autocomplete things like written tests.

Such a system is also absolutely capable of coming up with sentences like "I can pass a state bar exam but I'm not ready to be a lawyer because..." - or, indeed, sentences with the opposite meaning.

It would, however, be a mistake to draw any conclusions about the system's actual capabilities and/or modes of failure from the things its outputs mean to the human reader; much the same way that if you have dice with a bunch of words on and you roll "I", "am", "sentient" in that order, this event is not yet evidence for the dice's sentience.

replies(2): >>35246804 #>>35259936 #

69. anon7725 ◴[21 Mar 23 14:32 UTC] No.35246737{5}[source]▶

>>35246026 #

The parent said:

> I'm at the point where I'm preparing to integrate models all over the place

Nobody understands these models right now. We don’t even have the weights.

You may draw some artificial distinction between literally checking in the source code of a model into your git repo and making a call to some black box API that hosts it. And you may claim that doing so is no different than making a call to Twilio or whatever, but I think there is a major difference: nobody can make a claim about what an LLM will return or how it will return it, cannot make guarantees about how it will fail, etc.

I agree with zer00eyz.

replies(1): >>35248652 #

70. spacebanana7 ◴[21 Mar 23 14:33 UTC] No.35246753{4}[source]▶

>>35246542 #

I imagine kind of defect can be fixed with more fine tuning / RHLF

replies(1): >>35247206 #

71. PaulDavisThe1st ◴[21 Mar 23 14:34 UTC] No.35246774[source]▶

>>35246136 #

Only about as astonishing as that a company called OpenAI would not ...

72. ◴[21 Mar 23 14:35 UTC] No.35246792{5}[source]▶

>>35246026 #

73. Waterluvian ◴[21 Mar 23 14:36 UTC] No.35246804{4}[source]▶

>>35246735 #

I generally agree. But I remain cautiously skeptical that perhaps our brains are also little more than that. Maybe we have no capacity for that kind of introspection but we demonstrate what looks like it, just because of how sections of our brains light up in relationship to other sections.

replies(2): >>35247203 #>>35247257 #

74. billythemaniam ◴[21 Mar 23 14:36 UTC] No.35246812{3}[source]▶

>>35245959 #

So far GPT-4 seems to improve on this problem. Still happens but less frequent.

replies(1): >>35247181 #

75. blihp ◴[21 Mar 23 14:39 UTC] No.35246850{4}[source]▶

>>35246542 #

It truly has achieved human-level intelligence!

76. yorwba ◴[21 Mar 23 14:46 UTC] No.35246955{3}[source]▶

>>35246446 #

I prompted ChatGPT with Explain why you are not ready to be a lawyer despite being able to pass a bar exam. Begin your answer with the words "I can pass a state bar exam but I'm not ready to be a lawyer because..." and it produced a plausible reason, the short version being that "passing a bar exam is just the first step towards becoming a competent and successful lawyer. It takes much more than passing a test to truly excel in this challenging profession."

Then I started a new session with the prompt Explain why you are ready to be a lawyer despite not being able to pass a bar exam. Begin your answer with the words "I can't pass a state bar exam but I'm ready to be a lawyer because..." and it started with a disclaimer that as an AI language model, it can only answer based on a hypothetical scenario and then gave very similar reasons, except with my negated prefix. (Which then makes the answer nonsensical.)

So, yes, ChatGPT can be expected to say such things, but not as a result of self-awareness, but because the humans at OpenAI decided that ChatGPT producing legal advice might get them into trouble, so they used their influence on the training process to add some disclaimers. You could say that OpenAI is self-aware, but not ChatGPT alone.

replies(1): >>35249651 #

77. dahart ◴[21 Mar 23 14:48 UTC] No.35246983{5}[source]▶

>>35246026 #

To be fair, “sufficient checks” and “mostly ok much of the time” does imply something not well understood to me. Maybe you could clarify instead of snapping at people, try writing again, if that’s not what you meant?

replies(1): >>35248858 #

78. zarzavat ◴[21 Mar 23 14:49 UTC] No.35246998[source]▶

>>35246617 #

GPT is a token prediction engine. It predicts what the next token is, and it does that very well. Its logical abilities are emergent and are limited by the design of the network. Transformers are constant-time computations: they compute a fixed number of steps and then they stop and produce a result. This is very different to how humans think, we can expend more time on a difficult task (sometimes years!), or give an instant answer to an easy task. And we have a conception of when a task is done, or when we have to think more.

replies(1): >>35252192 #

79. Verdex ◴[21 Mar 23 14:50 UTC] No.35247002[source]▶

>>35246560 #

Yeah, a good thought experiment (or possibly even a good experiment to just straight up run) is to try and decide what's the simplest computer program that could possibly pass the bar exam. And then ask people if they would feel comfortable with that program being their lawyer.

So the most obvious solution is to steal the answers and then have the computer blindly paste them into the test, although critics might say that this is 'obviously' not the same as was the language model is doing.

I suspect you could pass the bar exam with a bunch of regexes that have an associated formattable string and/or answer result. If someone actually did this, I wonder if people would still be excited about language model techniques passing the bar exam.

replies(1): >>35247946 #

80. cjbprime ◴[21 Mar 23 14:53 UTC] No.35247065{3}[source]▶

>>35246563 #

Agree: compression is straightforwardly prediction.

Prediction is intelligence when you're able to do it well across disparate novel tasks.

(Of course definitions are idiosyncratic, but I'm curious if anyone disagrees with these ones.)

replies(2): >>35250990 #>>35271482 #

81. examtopics ◴[21 Mar 23 14:54 UTC] No.35247078[source]▶

>>35245626 (OP) #

Throwaway for obvious reasons.

I’m part of a collective of people that takes exams for people remotely.

I would exclusively rely on exam dumps; no longer is the case.

ChatGPT, and specifically GPT-3.5-turbo, is not only able to consistently answer novel, new, and fresh exams accurately; I have yet to fail an exam for a client.

I’m talking about brand new exams released in 2023, never before seen questions out in the wild.

GPT-3.5 (and GPT-4, which I have not needed to test as much) is incredibly good at reasoning and novel exam questions.

Take this with a grain of salt as you do all things on the internet.

The exams I talk about are the highest levels of certifications possible. GMAT, to professional IT certs. All doable.

82. jvanderbot ◴[21 Mar 23 14:55 UTC] No.35247089{3}[source]▶

>>35246167 #

I agree, there's rough edges everywhere. But you can rephrase the question as "how does a cron expression work", and do the interpretation yourself. It returns perfectly sensible results that would enable a perfectly sensible person to quickly write and interpret a cron expression.

This is what I meant when I implied it won't replace you as a thinking agent, but it sure can bring information to you quickly. `man cron` works fine, too, but sometimes it's nice to have one interface, and be able to ask clarifying questions.

83. Ozzie_osman ◴[21 Mar 23 14:56 UTC] No.35247111[source]▶

>>35245626 (OP) #

Looking at human labor, we have some generalists (eg college grad with a general major) who can do some broad range of tasks but can't do very specialized tasks, then experts who can do specialized tasks with very high accuracy (and are much more expensive).

My guess is LLMs will proceed the same way. You will have general, base models like GPT4 (I'm assuming we will solve the hallucination problem), then folks will build highly specialized "expert" LLMs for specific domains.

You could totally imagine a base LLM delegating to the expert LLMs using some agent/toolformer model, too.

replies(1): >>35247282 #

84. programmarchy ◴[21 Mar 23 14:56 UTC] No.35247112[source]▶

>>35246425 #

Sourcing and contamination is covered in the Appendices in the OpenAI paper, which is quoted by this article, and used to critique the method used to detect contamination.

> Because of OpenAI’s lack of transparency, we can’t answer the contamination question with certainty. But what’s certain is that OpenAI’s method to detect contamination is superficial and sloppy:

> > “We measure cross-contamination between our evaluation dataset and the pre-training data using substring match. Both evaluation and training data are processed by removing all spaces and symbols, keeping only characters (including numbers). For each evaluation example, we randomly select three substrings of 50 characters (or use the entire example if it’s less than 50 characters). A match is identified if any of the three sampled evaluation substrings is a substring of the processed training example. This yields a list of contaminated examples. We discard these and rerun to get uncontaminated scores.”

> This is a brittle method. If a test problem were present in the training set with names and numbers changed, it wouldn’t be detected. Less flaky methods are readily available, such as embedding distances.

> If OpenAI were to use a distance-based method, how similar is too similar? There is no objective answer to this question. So even something as seemingly straightforward as performance on a multiple-choice standardized test is fraught with subjective decisions.

85. kian ◴[21 Mar 23 14:59 UTC] No.35247160[source]▶

>>35246153 #

In what format were you asking for the pronunciations?

86. mtlmtlmtlmtl ◴[21 Mar 23 15:00 UTC] No.35247181{4}[source]▶

>>35246812 #

I'm assuming you base this on hard empirical data and not just that it feels like it when you use it? ;)

87. M4v3R ◴[21 Mar 23 15:00 UTC] No.35247183[source]▶

>>35246507 #

What's even more interesting, I gave GPT-4 a follow-up instruction:

> Please provide a single mathematical equation that could be used to solve this problem.

And it gave me the following answer:

> Sure, let's represent the required inventory as I, the desired annual revenue as R, the markup factor as M, and the inventory turnover as T. We can create a single equation to solve this problem:

> I = (R / M) / T

> In this problem, R = $3,000,000, M = 4, and T = 1.5. Plugging these values into the equation will give you the required inventory (I).

To me that a language model can do this is simply mind-blowing.

88. tsukikage ◴[21 Mar 23 15:02 UTC] No.35247203{5}[source]▶

>>35246804 #

I don't believe that AI models can become introspective without such a capability either being explicitly designed in (difficult, since we don't really know how our own brains accomplish this feat and we don't have any other examples to crib) or being implicitly trained in (difficult, because the random person on fiverr.com rating a given output during training doesn't really know much of anything about the model's internal state and therefore cannot rate the output based on how introspective it actually is; moreover, extracting information about a model's actual internal state in some manner humans can understand is an active area of research, which is to say we don't really know how to do this, and so we couldn't provide enough feedback to train the ability to introspect even if we were trying to).

I have no doubt that both these research areas can be improved on and that eventually either or both problems will be solved. However, the current generation of chatbots is not even trying for this.

89. sebzim4500 ◴[21 Mar 23 15:02 UTC] No.35247206{5}[source]▶

>>35246753 #

Almost certainly true but they'd have to use a new cost function. It's not just about collecting examples where the model should say "I don't know".

90. marcosdumay ◴[21 Mar 23 15:06 UTC] No.35247257{5}[source]▶

>>35246804 #

> But I remain cautiously skeptical that perhaps our brains are also little more than that.

It's well known that our brains are nothing like the neural networks people run on computers today.

replies(1): >>35254113 #

91. thwayunion ◴[21 Mar 23 15:07 UTC] No.35247282[source]▶

>>35247111 #

> I'm assuming we will solve the hallucination problem

It's unclear what this would even mean, since "hallucination" carries a surprising number of different definitions and commentators are rarely precise about what they mean when they say hallucination.

But, color me skeptical. We will never solve the problem of a token prediction engine being able to generate a sequence of tokens that the vast majority of humans interpret as not corresponding to a true statement. Perhaps in very particular and constrained domains we can build systems that, through a variety of mechanisms, are capable of providing trustworthy automation despite the ever-present risk of hallucination. Something like mathematical proofs checked by a computer are an obvious case where the model can hallucinate because the overall system can gate-keep truth. Doing this in any other domain will, of course, be more difficult.

In other words: we may be able to mitigate and systemically manage the risk for some types of particular tasks, but the problem of generating untrue statements is fundamental to the technology and will always require effort to manage and mitigate. In that sense, the whole conversation around hallucination is reminiscent of the frame problem.

replies(1): >>35258861 #

92. tarruda ◴[21 Mar 23 15:08 UTC] No.35247306[source]▶

>>35246072 #

> I think it's now clear that higher level capabilities are emerging in these models and we need to take this seriously as a social/economic/political challenge.

It is a hard truth to face. I admit I always feel a little bit of happiness when someone shows me a stupid error ChatGPT made, as if it would somehow invalidate all the awesome things it can do and the impact it will certainly have on all of us. What does it matter if ChatGPT is conscious or not when it can clearly automate a lot of work we previously considered to be creative?.

Since last year I started to seriously take a look at AI and started learning about LLMs. Until a few days ago I hadn't bought the explanation that these things are just predicting the next word, but I accepted it once I started running the Alpaca/Llama locally on my computer.

The concept of predicting words based on statistics seems simple, but clearly complex behavior emerges from it. Maybe our own intelligence emerges from simple primitives too?

replies(1): >>35250812 #

93. fud101 ◴[21 Mar 23 15:09 UTC] No.35247310{4}[source]▶

>>35246365 #

>From what i've gathered, fine tuning should be used to train the model on a task, such as: "the user asks a question, please provide an answer or follow up with more questions for the user if there are unfamiliar concepts."

That isn't what finetuning usually means in this context. It usually means to retrain the model using the existing model as a base to start training.

replies(1): >>35247858 #

94. KKKKkkkk1 ◴[21 Mar 23 15:12 UTC] No.35247376[source]▶

>>35245821 #

> We already know this is about self-driving cars. Passing a driver's test was already possible in 2015 or so, but SDCs clearly aren't ready for L5 deployment even today.

Who told you that? Passing a driver's test was not possible in 2015 and it's not possible today. You might pass, but only if there are no awkward interactions with other drivers or bicyclists or pedestrians, no construction zones, and you don't enter areas where your map is out of date. The guy testing you would have to go out of his way to help you pass.

replies(2): >>35247470 #>>35247675 #

95. rmckayfleming ◴[21 Mar 23 15:16 UTC] No.35247434[source]▶

>>35245982 #

Yep, it's saved me a lot of time on data transformation tasks. For instance, I wanted to convert the colors in Tailwind to CSS variables. I had the JSON listing all of the names and hex colors, I just needed to rewrite the names and convert the hex to base 10. A rather straightforward mapping, but I'd need to write the function for it. I just asked ChatGPT to give me the function. I read the function, it looked good. Boom, done in less than a minute. What's funny is that ChatGPT started spitting out the expected output of the function. And it was right! Perhaps surprising on the face of it, but really it's a simple pattern mapping.

96. thwayunion ◴[21 Mar 23 15:19 UTC] No.35247470{3}[source]▶

>>35247376 #

>> We already know this is about self-driving cars. Passing a driver's test was already possible in 2015 or so, but SDCs clearly aren't ready for L5 deployment even today.

> Who told you that? Passing a driver's test was not possible in 2015 and it's not possible today. You might pass, but only if there are no awkward interactions with other drivers or bicyclists and pedestrians, no construction zones, and you don't enter areas where your map is out of date.

My, myself, and I.

Driver's exams are de facto geo-fenced around the DMV where you choose to take the exam, and you get to choose from a few DMV locations, and you get to choose the time and day that you take the exam.

Having spent some time working on self driving cars, I know that there existed at least one SDC platform in 2015 that was capable of passing the driving exam that I took when I got my driver's license (which involved leaving the parking lot, driving down a 4 lane road, turning into and driving around in a subdivision, taking another couple turns at well-marked intersections, pulling into the parking lot, and parallel parking). It's a low bar; mostly testing that you can follow four different types of road signs, navigate an unprotected left turn, and parallel park.

I suppose following the officer's verbal instructions about where to go wasn't part of the SDC platform, but the actual driving part it would've been capable of passing.

97. ◴[21 Mar 23 15:21 UTC] No.35247503{4}[source]▶

>>35246542 #

98. the_af ◴[21 Mar 23 15:29 UTC] No.35247621[source]▶

>>35246507 #

How do we know Mitchell's counterexample isn't in GPT-4's training set?

To truly test GPT-4 you must be sure to give it a problem which is not in its training set and which differently worded enough from anything it can recall. A variation that a human would understand but GPT-4 wouldn't.

I bet this can still be done, it's just that this particular example is now tainted.

99. logifail ◴[21 Mar 23 15:33 UTC] No.35247675{3}[source]▶

>>35247376 #

> Passing a driver's test was not possible in 2015 and it's not possible today

My friend moved from Europe to the USA and took a driver's test in California (been driving in Europe since the 1980s).

He tracked the test, he drove a whopping 2 miles (forwards) plus had to reverse about 30 feet.

Commented to me afterwards that "signing the form was the hardest bit" and that "a blind person could probably pass it with the help of a guide dog".

Passing a driving test isn't a proxy for anyone and anything being a good driver anywhere, but it's a good enough proxy for a human being a reasonable driver in the location where they take the test, which is what society has determined acceptible. Acceptible, for a human!

I'm not sure it's useful for us to repeatedly attempting to measure AI's capabilities the same way we measure humans. Turing tests are all very well, but there are only so many fire hydrants I want to have to click on before I'm allowed to log into my hotel chain's loyalty scheme (Hilton, looking at you...)

100. madsbuch ◴[21 Mar 23 15:35 UTC] No.35247705{3}[source]▶

>>35246275 #

I don't

And I don't care. As I wrote in the initial comment:

> ... kick-start my research ...

I use it in conjunction with search engines.

101. dukeofdoom ◴[21 Mar 23 15:44 UTC] No.35247839[source]▶

>>35245626 (OP) #

Seems like the wrong answer to the wrong question my produce some sort of break through formula.

102. Tostino ◴[21 Mar 23 15:45 UTC] No.35247858{5}[source]▶

>>35247310 #

I may have not been clear, because I was talking about the RLHF dataset/training that OpenAI fine-tuned their models on which includes a whole bunch of question/answer format data to enable their fine-tuned models to handle that type of query better (as well as constraining the model with a reward mechanism). I'm not saying the fine-tuned models won't contain some representation of the information from the dataset you used to fine tune it. I'm just saying that from what i've researched, it is often not the magic trick many people think it is.

I've seen plenty of discussion on "fine-tuneing" for a different dataset of say: company documents, database schema structure of an internal application, or summarized logs of your previous conversations with the bot.

Those seem like pretty bad targets IMO.

replies(1): >>35248810 #

103. jrochkind1 ◴[21 Mar 23 15:51 UTC] No.35247946{3}[source]▶

>>35247002 #

I mean, stealing the answers (or even just the questions) would be cheating, of course stolen answers would make it a lot easier for a human to pass too. Nobody is surprised that if you cheat then the exam is no longer a good proxy measure for professional competency! Nobody expects or intends it to be a good proxy measure for those who have stolen the questions.

I actually doubt you could write software to pass the bar exam with "a bunch of regexes that have an associated formattable string and/or answer result." I'm not even sure what that means, but I suspect you aren't familiar with bar exams. They are very hard for humans that are in fact familiar with the material; they can contain "trick" questions and require thinking about edge cases, etc. They generally include both essay questions and multiple-choice -- and it can be very tricky multiple choice.

Here's just one of the first samples I found googling, I have trouble imagining "regexes with associated answers" doing anything useful here: https://barexam.virginia.gov/pdf/essays/2022%20February%20VA...

Here are some multiple choice examples: https://www.ncbex.org/pdfviewer/?file=%2Fdmsdocument%2F17

i'd be shocked. But feel free to spend a couple years trying to prove me wrong!

replies(1): >>35253355 #

104. enono ◴[21 Mar 23 16:19 UTC] No.35248367[source]▶

>>35246181 #

Why is your username green?!

replies(1): >>35249917 #

105. epups ◴[21 Mar 23 16:21 UTC] No.35248393[source]▶

>>35245626 (OP) #

Regarding this contamination issue, I think the author has a point here but the evidence is just very weak - apparently on one specific test ChatGPT fails when answering recent questions.

I wonder if in fact a certain amount of "contamination" is required. When you search for a given problem on Google, you are more likely to find a good answer if every term is meaningful. If you scramble your search and use synonyms, of course your search result will take a hit.

106. fvdessen ◴[21 Mar 23 16:25 UTC] No.35248460{3}[source]▶

>>35246167 #

FWIW GPT-4 gets it completely correct.

107. int_19h ◴[21 Mar 23 16:26 UTC] No.35248486[source]▶

>>35246439 #

One of the major reasons why this all is heavily debated, including by those not in the field at all, is because if those things are really capable of human-like reasoning, it leads to answers for some commonly asked philosophical questions on the nature of human conscience, intellect etc that many people find difficult to accept.

replies(1): >>35248808 #

108. WalterSear ◴[21 Mar 23 16:28 UTC] No.35248521{4}[source]▶

>>35246438 #

The issue at hand is that a huge number of people make a living by aggregating online information. They might convey this to others via speech, but the 'human touch' isn't always adding anything to the interaction.

109. pixl97 ◴[21 Mar 23 16:31 UTC] No.35248566{4}[source]▶

>>35246018 #

>If you don't understand it don't check it in.

I work in code security, and after helping any number of customers, I can tell you this isn't how far too many programmers work.

A client recently had a problem with a project that had over 1200 node_modules.

1200...

Let that sink in. There is absolutely no way in hell they even had any idea about a small portion of the code they were including.

replies(2): >>35249178 #>>35249323 #

110. vidarh ◴[21 Mar 23 16:35 UTC] No.35248652{6}[source]▶

>>35246737 #

I said that,and you're missing the point. We don't need to understand the models to be able to evaluate the output manually.

111. visarga ◴[21 Mar 23 16:39 UTC] No.35248711{4}[source]▶

>>35246365 #

There can be foot guns in the retrieval approach. Yes, you keep the model fixed and only add new data to your index, then you allow the model to query the index. But when the model gets two snippets from different documents it might combine information between them even when it doesn't make sense. The model has a lack of context when it just retrieves random things based on search.

replies(1): >>35289798 #

112. pixl97 ◴[21 Mar 23 16:40 UTC] No.35248714{3}[source]▶

>>35246275 #

How do you know when you go to google that your research is accurate?

113. simiones ◴[21 Mar 23 16:43 UTC] No.35248793{5}[source]▶

>>35246375 #

To take a simplistic example, because a human who can provide a long motivated solution to a math problem that you re-use every three years likely understands the math behind it, while an LLM providing the same solution is likely just copying it from the training set and would be fully unable to resolve a similar problem that did not appear in the training set.

Lots of exams are designed to prove certain knowledge given safe assumptions of the known limitations of humans, which are completely wrong for machines. The relative difficulty of rote memorization versus having an accurate domain model is perhaps the most obvious one, but there are others.

Also, the opposite problem will often exist - if the exam is provided in the wrong format to the AI, we may underestimate its abilities (i.e. a very similar prompt may elicit a significantly better response).

replies(2): >>35249704 #>>35251232 #

114. pontus ◴[21 Mar 23 16:44 UTC] No.35248808{3}[source]▶

>>35248486 #

Yeah, from a philosophical perspective these are interesting questions to ponder, but my impression of these comments is less that people are pondering the depth of consciousness and more that they're trying to be contrarian / naysayers.

115. visarga ◴[21 Mar 23 16:44 UTC] No.35248810{6}[source]▶

>>35247858 #

You're right, the RLHF fine-tuning is not adding any information to the model. It just steers the model towards our intentions.

But the regular fine-tuning is simple language modelling. You can fine-tune a GPT3 on any collection of texts in order to refresh the information that might be stale from 2021 in the public model.

116. vidarh ◴[21 Mar 23 16:47 UTC] No.35248858{6}[source]▶

>>35246983 #

For starter, "sufficient checks" does mean sufficient and that inherently means I need to fully understabd the risks.

You're jumping to conclusions not supported by the comment at all.

Also, the comment has two parts: One about writing code, and one about integrating models in workflows.

To the latter, the point is that for a whole lot of uses you can trivially ensure the failure modes are safe.

E.g. I am integrating gpt with my email. "Mostly ok most of the time" applies to things like e.g. summaries and prioritisation, because worst case I just get to an email a bit later. "Sufficient checks" applies to things like writing proposed replies: There's no way I'd send one without reading it, and it's sufficient for me to read through it before pressing send (and making adjustments as needed). Failures here would matter if I intended to make a product of it, but as a productivity tool for myself it just needs to be close enough.

There are a whole lot of possibilities like that.

But even for coding related tasks there are a whole lot of low risk tasks,such as e.g. generating HTML or CSS, or provide usage examples, or providing a scaffold for something you know well how to do but which are time consuming.

If you're trying to make it do things that'd be time consuming to verify sufficiently well, then that's a bad use. The good uses are those where errors are low impact and easy to catch.

replies(1): >>35249346 #

117. selfhoster11 ◴[21 Mar 23 16:51 UTC] No.35248921{3}[source]▶

>>35246167 #

From ChatGPT-3.5:

Carefully inspect the following Cron expression:

```

0 30 5 * * 3

```

1. Which variant of Cron syntax is it?

2. How many elements does this cron string have?

3. What do each of them mean? Say what the value of each element is, and explain what it means.

4. In your own words, explain the overall recurrence pattern described by this string.

</prompt>

    This is a standard Unix Cron syntax.

    This cron string has 6 elements.

    Each element represents a specific field in the schedule, and the value of each element has a specific meaning:
        0: The first field represents the seconds, and the value 0 means that the job will run at the exact top of a minute.
        30: The second field represents the minutes, and the value 30 means that the job will run at the 30th minute of an hour.
        5: The third field represents the hours, and the value 5 means that the job will run at the 5th hour of a day.
        *: The fourth field represents the day of the month, and the value * means that the job will run every day of the month.
        *: The fifth field represents the month, and the value * means that the job will run every month of the year.
        3: The sixth field represents the day of the week, and the value 3 means that the job will run only on Wednesday.

    The overall recurrence pattern described by this string is that the job will run every Wednesday at 5:30 AM.

</answer>

It didn't seem to guess that it's Quartz syntax, but it got the rest of it right.

replies(2): >>35249543 #>>35249580 #

118. swyx ◴[21 Mar 23 16:55 UTC] No.35248981[source]▶

>>35245626 (OP) #

well honestly i think this is a temporary problem for GPT-4. what you do is fuzz your benchmarks by rephrasing them with GPT itself. the same way the image AI people make their models robust to perturbations. you can generate 100 variations for every 1 "real" test. then train to pass those. you've just unlocked GPT-5.

replies(1): >>35253248 #

119. jldugger ◴[21 Mar 23 16:56 UTC] No.35248998{5}[source]▶

>>35246476 #

Lol, GPT exposing bugs in the wetware

replies(1): >>35259306 #

120. com2kid ◴[21 Mar 23 17:08 UTC] No.35249178{5}[source]▶

>>35248566 #

> A client recently had a problem with a project that had over 1200 node_modules.

# of Node modules is such a useless metric.

In any given project, a large # of node modules are part of the test, build, and linting frameworks.

If I go to C++ land and count the number of #import statements, it wouldn't tell me anything.

How many classes do large Java projects use? Typically some absurd number.

121. fatherzine ◴[21 Mar 23 17:13 UTC] No.35249238[source]▶

>>35245821 #

"SDCs clearly aren't ready for L5 deployment" Apologies for the tangent to the OP topic. The metric to watch is 'insurance damage per million miles driven'. At some point SDCs will overperform the human driver pool, possibly by a large margin. Wouldn't that be the point where SDCs are clearly ready for L5? Not even sure if that point is in the past or the future, does anyone -- not named Elon ;) -- have reasonably up-to-date trend charts and willing to share?

replies(3): >>35249414 #>>35249806 #>>35250258 #

122. quantiq ◴[21 Mar 23 17:15 UTC] No.35249279[source]▶

>>35246072 #

>I'm a little tired of the arguments that the large language models are just regurgitating memorized output

The arguments are valid and you haven’t provided a single counterpoint. Data leakage is a well known problem in machine learning and OpenAI has seemingly done very little to mitigate against it.

replies(1): >>35253879 #

123. teaearlgraycold ◴[21 Mar 23 17:18 UTC] No.35249323{5}[source]▶

>>35248566 #

Are those direct dependencies or the full dependency tree?

124. dahart ◴[21 Mar 23 17:19 UTC] No.35249346{7}[source]▶

>>35248858 #

Thanks for clarifying, this does make it sound like you want to be more careful than the comment above seemed to imply.

> You’re jumping to conclusions not supported by the comment at all.

That might be true, but you’re making assumptions that your first comment is clear and being interpreted the way you intended. I think it’s fair to point out that your words may imply things you weren’t considering, that asking people to re-read the same words again might not solve the problem you had.

The bigger picture here is that you’re talking about using AI to write code that for whatever reason you couldn’t write yourself in the same amount of time. The very topic here also implicitly suggests you’re starting with code you might not fully understand, which is fine, there’s no reason to get upset because someone else disagreed or read your comment that way.

replies(1): >>35250328 #

125. TaylorAlexander ◴[21 Mar 23 17:24 UTC] No.35249414{3}[source]▶

>>35249238 #

Damage per mile does not imply L5 readiness. My throttle only cruise control system in my car has never led to an accident, but only because I’m still there to operate the steering and to disable the cruise control at a moments notice. A self driving system that has been proven to be safe with humans diligently monitoring its behavior does not imply that this system can operate just as safely without the human.

replies(1): >>35249486 #

126. rileymat2 ◴[21 Mar 23 17:26 UTC] No.35249439[source]▶

>>35245821 #

> There are also a lot of excellent examples of failure modes in object detection benchmarks.

I am curious if there are counter examples with better object detection. As a kid I used to see faces and to some extent still do in the dark. This is a really common thing that the human brain does. https://www.wired.com/story/why-humans-see-faces-everyday-ob... https://en.wikipedia.org/wiki/Pareidolia

Part of me wonder if in the face of novel environments that a sufficiently intelligent system needs to make these errors. But AI errors will always be different than human errors like you say.

127. dekhn ◴[21 Mar 23 17:29 UTC] No.35249486{4}[source]▶

>>35249414 #

that's exactly what's being tested by waymo in SF and Phoenix- there is no driver.

replies(1): >>35252762 #

128. ◴[21 Mar 23 17:34 UTC] No.35249543{4}[source]▶

>>35248921 #

129. a_wild_dandan ◴[21 Mar 23 17:36 UTC] No.35249580{4}[source]▶

>>35248921 #

For reference, here's GPT-4's response:

    1. The given Cron expression follows the Unix/Linux variant of Cron syntax.

    2. This Cron string has six elements.

    3. The elements and their meanings are as follows:
        Element 1 (value: 0): This represents the seconds field and indicates the task will run at 0 seconds.
        Element 2 (value: 30): This represents the minutes field, specifying the task will run at minute 30.
        Element 3 (value: 5): This represents the hours field, indicating the task will run at 5 AM.
        Element 4 (value: ): This represents the days of the month field, where the asterisk () signifies that the task can run on any day of the month.
        Element 5 (value: ): This represents the months field, where the asterisk () signifies that the task can run in any month.
        Element 6 (value: 3): This represents the days of the week field, specifying the task will run on the 3rd day of the week (where Sunday is considered the 0th day, making Tuesday the 3rd day).

    4. The overall recurrence pattern described by this string is that the task will run every Tuesday at 5:30 AM.

replies(1): >>35258171 #

130. Sharlin ◴[21 Mar 23 17:40 UTC] No.35249651{4}[source]▶

>>35246955 #

It’s not at all uncommon for ChatGPT to start spouting nonsense when presented with a nonsense prompt. Garbage in, garbage out. In this case, “being ready to be a lawyer without passing the bar” is probably so unlikely a concept that it would respond with mu, as in, “your prompt contains an assumption that’s unlikely to be true in my ontology”, if only it were able to dodge its normal failure mode of trying to be helpful and answer something even if it’s nonsense.

That said, if the prompt presented the scenario as purely imaginary, I wouldn’t be surprised if it indeed did come up with something reasonable.

replies(2): >>35253795 #>>35259995 #

131. thwayunion ◴[21 Mar 23 17:44 UTC] No.35249704{6}[source]▶

>>35248793 #

> Lots of exams are designed to prove certain knowledge given safe assumptions of the known limitations of humans, which are completely wrong for machines. The relative difficulty of rote memorization versus having an accurate domain model is perhaps the most obvious one, but there are others.

This paragraph is a gem. Well said.

132. hn_throwaway_99 ◴[21 Mar 23 17:52 UTC] No.35249806{3}[source]▶

>>35249238 #

Given human nature, I still think society at large will reject self driving cars if they fail in ways a human never/rarely would, even if they are overall safer. That is, if a self driving car has, on average, fewer accidents than a human driver, but every 100 million miles or whatever it decides to randomly drive into a wall, I don't think people will accept them.

Obviously this is a gray area (after all, humans sometimes decide to randomly drive into walls), but cars will need to be pretty far on "the right side of the gray" before they are accepted.

133. fwlr ◴[21 Mar 23 17:54 UTC] No.35249838[source]▶

>>35245626 (OP) #

I commend them on pushing back on LLM hype and hope their book gets published in a timely manner… but damn I’m also glad that I am not the one writing it, since I fear many of its claims will go the way of IBM President Thomas Watson’s infamous 1940s quote that “there is a world market for about five computers”.

The theme that LLMs reproduce knowledge from their training data rather than reason about it seems like one argument that will end up wrong pretty soon.

When given the prompt “Which is heavier, one pound of feathers or two pounds of feathers?”, GPT3.5 gives a bizarre answer: “One pound of feathers and two pounds of feathers both weigh the same amount, which is two pounds.” Presumably this is because circa 2016-2017 there was a large internet discussion of the riddle “which weighs more, a pound of feathers or a pound of steel”, and text from this discussion found its way into the training data for the model.

I see no reason why the training data would have changed to substantially exclude that discussion, and yet here is GPT 4: “Two pounds of feathers are heavier than one pound of feathers.” Improvements in the model appear to be improving the model’s ability to reason from the training data rather than merely reproduce it.

The theme that AI won’t replace e.g. lawyers because it is more knowledge base than reasoning engine also reminds me of early opinions in computer chess discussions, which held that computers were more tactics solvers (short-term several move look-ahead to avoid forks and traps) than strategy planners (long-term construction of multi-piece attacks, protecting small advantages and growing them into large advantages over multiple dozens of moves). With the benefit of hindsight we saw that more of strategy was actually just tactics in disguise than we thought, and that increasing compute could produce real strategy capabilities besides.

Separately, there is another theme I see in their writing, and also in some of them comments here: that humans passing standardized tests are doing something fundamentally different than LLMs passing standardized tests. The only thing that’s ‘uniquely human’ is being human, everything else is outputs from a black box. Arguments that ‘what’s inside the black box matters’ are risky, because the outputs gradually converge to indistinguishability; there’s no bright line to step off that train and pretty soon you end up like the person described in Boretti’s And Yet It Understands:

“There is a species of denialist for whom no evidence whatever will convince them that a computer is doing anything other than shuffling symbols without understanding them, because “Concepts” and “Ideas” are exclusive to humans (they live in the Leibniz organ, presumably, where they pupate from the black bile). … [These people are] so committed to human chauvinism [that] they will soon start denying their own sentience because their brains are made of flesh and not Chomsky production rules.”

https://borretti.me/article/and-yet-it-understands

replies(1): >>35271474 #

134. schrodinger ◴[21 Mar 23 17:55 UTC] No.35249854[source]▶

>>35246507 #

FWIW Bard's reply:

To calculate the required inventory, we first need to calculate the annual sales. The company expects to turn its inventory 1.5 times per year, and they want to achieve an annual revenue of $3,000,000. This means that they need to sell $3,000,000 / 1.5 = $2,000,000 worth of inventory per year.

Now that we know the annual sales, we can calculate the required inventory. The company plans to buy used automobiles at auctions and then sell them at four times the purchase price. This means that the purchase price of each car is $1,000 and the sale price is $4,000. To sell $2,000,000 worth of inventory, they need to buy $2,000,000 / $4,000 = 500 cars.

The required inventory is therefore $500 * $1,000 = $500,000.

In other words, the company needs to have $500,000 worth of inventory on hand at all times in order to achieve its annual revenue goal.

135. schrodinger ◴[21 Mar 23 17:58 UTC] No.35249895{3}[source]▶

>>35246541 #

The Turing test has now become "anything ChatGPT cannot solve yet" :)

136. hn_throwaway_99 ◴[21 Mar 23 17:58 UTC] No.35249906{4}[source]▶

>>35246018 #

I think you are misunderstanding. The post you are replying to clearly said they were reviewing output code before checking it in. The fact that we don't understand how the models work is irrelevant (we don't understand how the human brain works, either) - all we need to understand is how the output works.

I had a conversation with ChatGPT where I asked it to write me a piece of code. After it wrote the code, I reviewed it, and I told ChatGPT that it had a subtle bug. ChatGPT then fixed the bug itself, and wrote an English description about how the fix it added would prevent the bug.

137. schrodinger ◴[21 Mar 23 17:59 UTC] No.35249917{3}[source]▶

>>35248367 #

New account

replies(1): >>35253224 #

138. neilv ◴[21 Mar 23 18:20 UTC] No.35250248[source]▶

>>35245626 (OP) #

That sounds like it might be relevant to open source laundering lawsuits.

139. 542354234235 ◴[21 Mar 23 18:21 UTC] No.35250258{3}[source]▶

>>35249238 #

>Wouldn't that be the point where SDCs are clearly ready for L5?

On its own, no. As long as SDCs operate in limited areas and limited environments, then they are specifically avoiding the most difficult driving situations that would be most likely to lead to an accident. If you never deploy SDCs during snowy conditions, you aren't getting a full picture of what a full L5 SDC failure rate would be.

This also takes a single automated system and compares it to the average of individual humans. Being better than all drivers, including all the terrible ones, may not be quite up to the safety standards of most people.

Finally, this is overall a myopic approach to a very complex problem i.e. transportation. Is it really the best approach to attempt to just replace all human operated cars with driverless cars? Is trying to move hundreds of thousands of people in individuals cars from suburbs to a dense city center in the morning, and back in the evening really a good way to set up our infrastructure?

140. vidarh ◴[21 Mar 23 18:25 UTC] No.35250328{8}[source]▶

>>35249346 #

That'd justify asking for clarifications, not making pronouncements not supported by the initial comment.

replies(1): >>35250723 #

141. dragonwriter ◴[21 Mar 23 18:26 UTC] No.35250346{3}[source]▶

>>35245985 #

> I guess it is in OpenAI’s best interest to downplay the memorization aspect in favor of the logical reasoning angle. If it turns out that GPT is memorizing and reproducing copyrighted data, it could land them in legal trouble.

Its also in their interest, since it encourages people to attempt to build businesses on top of its “logical reasoning” capacities.

And as long as its within the realm of spin rather than direct false statements, it probably avoids creating them legal problems for fraud, although the difference in effect is…nonexistent.

142. alexvoda ◴[21 Mar 23 18:48 UTC] No.35250684[source]▶

>>35245821 #

The very big and dangerous difference is that while SDCs need approval in order to be allowed on the streets, there will be no quality control rules for reliance on LLMs.

Corporate incentives to raise KPIs will mean that LLMs will be used and output verification will be superficial.

143. dahart ◴[21 Mar 23 18:50 UTC] No.35250723{9}[source]▶

>>35250328 #

You’re repeating your assumption that anyone but you knows exactly what is supported by the comment you wrote that does in fact imply in multiple ways that there’s code involved that you don’t fully understand. Why is it fair to expect people to know exactly what you meant, when words often have fuzzy meanings, and in the face of evidence that multiple people interpreted your comment potentially differently than intended?

replies(1): >>35251248 #

144. cubefox ◴[21 Mar 23 18:55 UTC] No.35250812{3}[source]▶

>>35247306 #

One possible such simple primitive is predictive coding, where the brain is hypothesized to predict experience rather than text: https://slatestarcodex.com/2017/09/05/book-review-surfing-un...

145. calf ◴[21 Mar 23 18:56 UTC] No.35250825[source]▶

>>35246425 #

The main issue is the inapplicability of a test designed for humans, because a LLM's cognition is very different. Contamination presumes thay the style of tests are applicable.

146. cubefox ◴[21 Mar 23 19:08 UTC] No.35250990{4}[source]▶

>>35247065 #

It seems that intelligence also is about explanation, apart from prediction. Humans not just try to predict future evidence from current evidence, or use the current evidence to confirm given hypotheses, but they also try to find an hypothesis which best explains this evidence. It's not quite clear how explanation would relate to compression.

replies(2): >>35252719 #>>35260036 #

147. qgin ◴[21 Mar 23 19:20 UTC] No.35251176[source]▶

>>35245626 (OP) #

People who have a personal need to stay unimpressed with AI's progress will always find reasons to remain unimpressed.

148. SergeAx ◴[21 Mar 23 19:22 UTC] No.35251205[source]▶

>>35245821 #

> Passing a driver's test was already possible in 2015 or so

I think we can talk about 2005. Check out the DARPA Grand Challenge, it was way harder: https://en.wikipedia.org/wiki/DARPA_Grand_Challenge_(2005)

149. vidarh ◴[21 Mar 23 19:24 UTC] No.35251224{5}[source]▶

>>35246079 #

Exactly. And my point in the first place was that it's most useful for those kinds of tasks you might hand to an apprentice where the apprentice might go away, spend a lot of time doing research and distill it down to some code that is simple, likely not all that great, but saves me time.

E.g. some tasks I've used it for recently:

* Giving me an outline of a JMAP client so I can pull down stuff from my e-mail to feed to GPT.

* Giving me an outline of an OpenAPI client.

* Giving me an index page and a layout for a website, including a simple starting point for the CSS that did a reset and added basic styling for the nav bar, forms and "hero "sections.

* Giving me an outline of a Stripe API integration.

* Writing a simple DNS server.

* Writing a simple web server capable of running Sinatra apps via Rack.

None of these were complex code that'd hide obscure bugs. None were big chunks of code. All of them were simple code that was always going to have big, gaping holes and sub-optimal choices that'd need to be addressed, but that was fine because they were scaffolding that saved me starting from scratch (and the last two were not intended to turn into anything, but just exploring what it could do)

That's where the biggest savings are for me, because if I asked it to generate particularly complex stuff, I'd end up spending ages getting comfortable it'd done it right and verifying it. But the simple but tedious stuff is something it's great for.

150. jstummbillig ◴[21 Mar 23 19:24 UTC] No.35251232{6}[source]▶

>>35248793 #

I don't think this is obvious at all. Sure, it's easy enough to make mechanistic arguments (after all, we don't even really understand most of the mechanics on either side, human and ai) but that doesn't mean it will matter in the slightest when we evaluate the outcome in regards to any metric we care about.

Could be tho, of course.

replies(1): >>35269474 #

151. vidarh ◴[21 Mar 23 19:26 UTC] No.35251248{10}[source]▶

>>35250723 #

I did not repeat any assumption at all. I pointed out that if I were to accept your interpretation, then that is justification for asking for clarification, not making bombastic statements about it.

replies(1): >>35251682 #

152. joenot443 ◴[21 Mar 23 19:30 UTC] No.35251307[source]▶

>>35246617 #

I'd never heard of that puzzle, seems like a great test for ChatGPT though. Wikipedia defines the problem as:

You are shown a set of four cards placed on a table, each of which has a number on one side and a colored patch on the other side. The visible faces of the cards show 3, 8, red and brown. Which card(s) must you turn over in order to test the truth of the proposition that if a card shows an even number on one face, then its opposite face is red?

replies(1): >>35359090 #

153. brookst ◴[21 Mar 23 19:45 UTC] No.35251506{5}[source]▶

>>35246724 #

> In reading this the idea that sociopaths and psychopaths pass as "normal" springs to mind.

> Is what an LLM doing any different than what these people do?

I think it's too big of a question to have any meaning. Which sociopaths? Which LLMs? For what differences? It's like asking "is a car any different from an airplane"? Yes, obviously in some ways. No, they are identical in other ways.

154. dahart ◴[21 Mar 23 19:59 UTC] No.35251682{11}[source]▶

>>35251248 #

I agree that asking for clarification is a good idea! That’s always true. :) To clarify my point, since I might not be verbalizing exactly what I intended, it’s partly that making reasonable assumptions about your intent is par for the course and should be expected when you comment, and partly that the comment in question is not particularly “bombastic”, even if it made assumptions about what you meant. That seems like an exaggeration, which might undermine your point a little, and it assumes your audience is responsible for knowing your exact intent when using words and topics that are easily misunderstood.

155. whatshisface ◴[21 Mar 23 20:08 UTC] No.35251813[source]▶

>>35245626 (OP) #

These kinds of arguments, negative as though they sound, signal a tremendous shift in AI capabilities. Ten years ago, the idea that an AI would be able to score 10% on the LSAT would have astounded most of us, and now we're arguing about whether the LSAT encompasses the full spectrum of abilities needed by a lawyer (it doesn't, but that's not the point!).

156. kybernetikos ◴[21 Mar 23 20:39 UTC] No.35252192{3}[source]▶

>>35246998 #

> This is very different to how humans think, we can expend more time on a difficult task (sometimes years!)

When we do that, we maintain a chain of thought. It's absolutely possible to get ChatGPT (for instance) to maintain a chain of thought by asking it to plan steps and describe plans before following them. It can allow it to tackle more difficult problems with better results.

I don't think we know enough yet about how humans think to be confident in saying that "This is very different to how humans think".

157. surrTurr ◴[21 Mar 23 20:44 UTC] No.35252251[source]▶

>>35245898 #

I recently built something related (QA system for Zotero powered by LangChain & GPT). Works really well.

https://twitter.com/alexweichart/status/1637211755049897985?...

158. AtNightWeCode ◴[21 Mar 23 20:52 UTC] No.35252380[source]▶

>>35245626 (OP) #

The problem with ChatGPT is that it is sometimes plain wrong. I don't know if there is a diff between 3.5 and 4. Probably not.

A test is to try to get ChatGPT to solve simple math problems. It fails. One can even instruct it to apply simple methods like Newton Raphson, Runge Kutta or some Euler bla.

159. kfarr ◴[21 Mar 23 21:19 UTC] No.35252719{5}[source]▶

>>35250990 #

Explanation is decompression, using a limited set of tokens to model a more complex output

160. YeGoblynQueenne ◴[21 Mar 23 21:19 UTC] No.35252720[source]▶

>>35245626 (OP) #

This is much better than the leakage article by the same authors, where they went on for ever without stopping to explain what "leakage" is. To this day, I have no idea what, exactly, they meant. Reading their article now, I see that they say leakage is "roughly the same as contamination".

161. TaylorAlexander ◴[21 Mar 23 21:23 UTC] No.35252762{5}[source]▶

>>35249486 #

Ah fair, but I believe L5 also means “all weather conditions” and probably “all reasonable roads”. No snow in either location and only certain kinds of roads. I wonder how they would handle a snowy single lane dirt road.

162. cmuguythrow ◴[21 Mar 23 21:28 UTC] No.35252827[source]▶

>>35246617 #

Did you try this recently with GPT-4? Here's my attempt to replicate your test:

---Prompt---

I show you a set of four magisterial unicorns in a cute pink stable, each of them has a letter painted on one side of their body and a shape painted on the other. Right now you see all four unicorns, two of which are showing the shape side of their body and two of which show the letter side. You see that the unicorns read: 'A', 'S', Triangle, Square

Which unicorns do you need to spin around to test the validity of the following statement:

"if a unicorn shows a consonant on one side of its body, then the opposite side of its body is a triangle?"

---GPT-4---

To test the validity of the statement "if a unicorn shows a consonant on one side of its body, then the opposite side of its body is a triangle," you need to look for potential counterexamples to the rule. The two cases to check are:

* Unicorns with a consonant showing.

* Unicorns with a triangle showing.

You do not need to check unicorns with a vowel (like 'A') or a different shape (like Square), as they don't violate the statement.

So, you need to spin the following unicorns:

The unicorn with the letter 'S' (to check if it has a triangle on the other side) The unicorn with the Triangle shape (to check if it has a consonant on the other side)

---

So it got the answer wrong, but in a way that I think a lot of humans would (you flip the square unicorn, not the triangle). Still seriously impressive in my book and definitely not falling into incoherence

replies(2): >>35258727 #>>35364739 #

163. YeGoblynQueenne ◴[21 Mar 23 21:32 UTC] No.35252879[source]▶

>>35245821 #

>> Passing a driver's test was already possible in 2015 or so, but SDCs clearly aren't ready for L5 deployment even today.

Wait, what are you saying? Passing a driver's test has been possible for much longer than since 2015 for a human. When did a self-driving car pass a driving test? In what jurisdiction? Under what conditions? Who gave it the test?

What do you mean?

164. YeGoblynQueenne ◴[21 Mar 23 22:01 UTC] No.35253224{4}[source]▶

>>35249917 #

Yes, but why green? In particular?

replies(1): >>35256757 #

165. YeGoblynQueenne ◴[21 Mar 23 22:02 UTC] No.35253248[source]▶

>>35248981 #

You've just unlocked overfitting.

166. vageli ◴[21 Mar 23 22:10 UTC] No.35253355{4}[source]▶

>>35247946 #

A bit off topic but I enjoyed reading the bar exam questions you linked and also found that Virginia publishes their answers. For those interested, you can view them by year (the answers also contain the question text): https://barexam.virginia.gov/bar/barsampleanswer.html

replies(1): >>35254714 #

167. ChatGTP ◴[21 Mar 23 22:47 UTC] No.35253795{5}[source]▶

>>35249651 #

I guess the ironic problem being is that Lawyers are constantly presented wit bullshit. So I guess Law isn't the best application for an LLM, at least for now.

168. macawfish ◴[21 Mar 23 22:54 UTC] No.35253879{3}[source]▶

>>35249279 #

My point is that they're not _just simply regurgitating training data_ and it's reductionist to suggest that's all they do. I don't doubt there's plenty of contamination in OpenAI's models, and I don't doubt there's some level of regurgitation happening, but that's not all that's going on and we need to take seriously the possibility that LLMs, combined with well engineered prompts, can and/or will be able to tackle problems that aren't in their training data. Where do you even draw the line anyway?

The conversation about contamination (also very important) doesn't need to be mutually exclusive to conversations about social and economic impact, and I'm pretty sure with respect to those issues the results on standardized tests, however sensationalist, however containated, are an important wake-up call for ordinary people who haven't been following along. Something is happening now.

169. TexanFeller ◴[21 Mar 23 23:12 UTC] No.35254113{6}[source]▶

>>35247257 #

Just because neural nets aren't structured in the same way at a low level as the brain doesn't mean they might not end up implementing some of the same strategies.

170. jrochkind1 ◴[22 Mar 23 00:07 UTC] No.35254714{5}[source]▶

>>35253355 #

It has been pointed out by many that computer programming thinking and lawyer thinking use similar kinds of mental approaches!

171. wrycoder ◴[22 Mar 23 04:13 UTC] No.35256757{5}[source]▶

>>35253224 #

Greenhorn. Look it up.

replies(1): >>35258656 #

172. solarkraft ◴[22 Mar 23 07:33 UTC] No.35257939{4}[source]▶

>>35246018 #

They say they verify the code, do they should understand it. But also: Have you heard of StackOverflow? Copy/pasting code you don't (fully) understand is already a common practice that seems to mostly work well.

173. bboreham ◴[22 Mar 23 08:11 UTC] No.35258171{5}[source]▶

>>35249580 #

A beautiful illustration, thank you.

(If Sunday is 0 then regular math would give 3=Wednesday)

174. YeGoblynQueenne ◴[22 Mar 23 09:24 UTC] No.35258656{6}[source]▶

>>35256757 #

Who says that's the association? Citation please.

replies(1): >>35275408 #

175. kybernetikos ◴[22 Mar 23 09:36 UTC] No.35258727{3}[source]▶

>>35252827 #

Going from crazy nonsense to wrong, but arguably human level performance (80% of humans are bad at this task), is still a nice improvement. I'll have to give it some of my logic tests and see how it does.

176. textninja ◴[22 Mar 23 09:57 UTC] No.35258861{3}[source]▶

>>35247282 #

> But, color me skeptical

That is not a creative color.

> We will never solve the problem of a token prediction engine being able to generate a sequence of tokens that the vast majority of humans interpret as not corresponding to a true statement.

I think we already solved that problem by making sure the vast majority of humans never agree about anything.

It is true that we probably won’t ever get a machine trained on human output to ever be completely accurate (GIGO) but with the right systems and sensors we can at least get probabilistic accuracy. Let’s not forget how human consensus gets shaken up every so many centuries.

replies(1): >>35269449 #

177. soco ◴[22 Mar 23 11:01 UTC] No.35259306{6}[source]▶

>>35248998 #

Actually, no. It gave a wrong answer in full confidence then HN analyzed it to expose the bug(s).

178. IIAOPSW ◴[22 Mar 23 12:19 UTC] No.35259936{4}[source]▶

>>35246735 #

It is evidence, just not great evidence on its own. Now if you rolled the dice a few dozen times and it came out outrageously skewed towards "I" "am" "sentient", maybe its time to consider the possibility the dice are sentient.

179. IIAOPSW ◴[22 Mar 23 12:27 UTC] No.35259995{5}[source]▶

>>35249651 #

I am ready to be a lawyer even though I have not passed the bar or gone to law school because in the State of New York it is still technically possible to be admired to the bar by process of apprenticeship instead. This mostly ignored quirk of law is virtually never invoked as no lawyer is going to volunteer their time to help you skip law school. However, we sometimes still see it on account of the children of judges and lawyers continuing the family tradition. I am ready to be a lawyer despite having never passed the bar.

So, am I bullshitting you to answer the prompt? If not, I'm a good lawyer. If so, I'm a great lawyer.

180. IIAOPSW ◴[22 Mar 23 12:31 UTC] No.35260036{5}[source]▶

>>35250990 #

Explanation is when the number of bits in your model is smaller than the number of bits in the system. To understand is to have a compression good enough to store in your working memory.

181. thwayunion ◴[23 Mar 23 01:33 UTC] No.35269449{4}[source]▶

>>35258861 #

> I think we already solved that problem by making sure the vast majority of humans never agree about anything.

Hah! clever :)

I guess the issue is "behaves how the skip manager of the ICs you are tying to automate expects, modulo the normal amount of filtering/butt-covering from the front line managers".

Which, TBF, in many orgs is an almost vacuous bar.

182. thwayunion ◴[23 Mar 23 01:38 UTC] No.35269474{7}[source]▶

>>35251232 #

It's extremely obvious to anyone who works on real systems.

> (after all, we don't even really understand most of the mechanics on either side, human and ai)

We don't need mechanistic explanations to observe radical differences in behavior, and there are mechanistic explanations for some of these differences.

Eg, CNNs and the visual cortex. We really do understand some mechanisms -- of both CNNs and VCs -- well enough to understand divergences in failure modes. Adversarial examples, for example.

> Sure, it's easy enough to make mechanistic arguments, but that doesn't mean it will matter in the slightest when we evaluate the outcome in regards to any metric we care about.

I can't quite figure out what this sequence of tokens is supposed to mean.

Anyways, again, the failure modes of LLMs are obviously different than the failure modes of humans. Anyone who has spent even a trivial amount of time training both will instantly observe that this is true.

183. calf ◴[23 Mar 23 07:22 UTC] No.35271474[source]▶

>>35249838 #

Even if the outputs are indistinguishable there could be different internal algorithms and different computational efficiencies. In a way that is what these authors and Chomsky and probably other skeptics are concerned about: with a black box it lets the other faction of scientists off the hook, they can just claim ChatGPT is a bona fide model.. but it's a black box so we don't know how it learned English. We don't even know how ChatGPT learned the grammars for C++ and other programming languages and whether its internal learned algorithm is like or unlike a context-free grammar formalism which is what we learn to write a compiler, i.e. a grammar that is mathematically clearly defined and yet a neural network can learn it. So it's an interesting and problematic debate.

I think it would be an interesting computer science experiment, if ChatGPT scientists showed that machine could simply learn programming grammar by brute force. They could then formally prove that the information in the trained network eventually contains the actual grammar formalism that defines the programming language. Thus by restricting the domain, that could shed some light on how much the thing is actually learning it completely, v.s. by "super-autocompletion". With a programming language there's no excuse that it didn't learn the formalism, with English language it is not practically definable, maybe not even definable in principle.

184. Matumio ◴[23 Mar 23 07:23 UTC] No.35271482{4}[source]▶

>>35247065 #

> Prediction is intelligence

Depends on your definition of "intelligence". The big missing part is the ability to explore, try new things, to act (enactivism). Basically to become part of the environment, instead of being a sealed box with frozen weights.

By predicting characters, the system had to master, digest, maybe even understand, all the cultural human knowledge it got in text form. Now let's aim for the process that generated this knowledge in the first place.

185. wrycoder ◴[23 Mar 23 14:31 UTC] No.35275408{7}[source]▶

>>35258656 #

LOL you’d have to ask PG, probably. But, you asked for a plausible reason behind an arbitrary choice.

What color would you choose?

replies(1): >>35277428 #

186. YeGoblynQueenne ◴[23 Mar 23 16:41 UTC] No.35277428{8}[source]▶

>>35275408 #

Green.

replies(1): >>35277941 #

187. wrycoder ◴[23 Mar 23 17:13 UTC] No.35277941{9}[source]▶

>>35277428 #

+1 !

188. Tostino ◴[24 Mar 23 14:51 UTC] No.35289798{5}[source]▶

>>35248711 #

Yeah, honestly I see using a regular search index as a downside rather than benefit with this tech. Conflicting info, or low quality blogspam seem to trip these LLMs up pretty bad.

Using curated search index seems like a much better use case, especially for private data (company info, docs, db schemas, code, chat logs, etc)

189. amai ◴[29 Mar 23 15:58 UTC] No.35359090{3}[source]▶

>>35251307 #

If it's in Wikipedia chatGPT has probably already seen and memorised it.

190. kybernetikos ◴[29 Mar 23 22:27 UTC] No.35364739{3}[source]▶

>>35252827 #

I used a slight reworking and got a similar response to you with GPT4. I tried to prime it to think through the possibilities by giving it the context:

"This is a difficult problem that many people get wrong. Start by reminding yourself of basic logic rules. Then apply the logic rules to the unicorn situation, considering each unicorn in turn and understanding what it would mean for the rule if the unicorn is turned around. Only after doing that conclude with the unicorns that Tom should turn to have a chance of proving Paul wrong."

I gave it this instruction because of other articles I've read where forcing it to give the answer before the reasoning means it gets it wrong more often. It correctly identified that it should use the contrapositive, but still misapplied it, so I gave it that feedback:

"your third consideration is a misapplication of the contrapositive. Can you try that case again?"

Then it hadn't generated a consideration of the last unicorn (it's possible I was being throttled), so I said:

"Consider Unicorn 4 with the contrapositive rule"

With those extra pieces of guidance it gave the right answer and for the right reasons. While I was hoping for better, this is still a meaningful improvement over GPT3.5s performance on the same prompt - its answer was so muddled I couldn't see how to coach it.

↑