Most active commenters

vidarh(7)
(4)
dahart(4)
soco(3)

Popular/hot comments

>>35246018 #
>>35246167 #
>>35246026 #
>>35246542 #

←back to thread

GPT-4 and professional benchmarks: the wrong answer to the wrong question

(aisnakeoil.substack.com)

1. jvanderbot ◴[21 Mar 23 13:38 UTC] No.35245898[source]▶

>>35245626 (OP) #

Memorization is absolutely the most valuable part of GPT, for me. I can get natural language responses to documentation, basic scripting / sysadmin, and API questions much more easily than searching other ways.

While this is an academic interest point, and rightly tamps down on hype around replacing humans, it doesn't dissuade what I think are most peoples' basic use case: "I don't know or don't remember how to do X, can you show me?"

This is finally a good enough "knowledge reference engine" that I can see being useful to those very people it is over hyped to replace.

replies(6): >>35245958 #>>35245959 #>>35245985 #>>35246065 #>>35246167 #>>35252251 #

2. vidarh ◴[21 Mar 23 13:42 UTC] No.35245958[source]▶

>>35245898 (TP) #

And asking higher level questions than what you'd otherwise look up. E.g. I've had ChatGPT write forms, write API calls, put together skeletons for all kinds of things that I can easily verify and fix when it gets details from but that are time consuming to do manually. I've held back and been sceptical but I'm at the point where I'm preparing to integrate models all over the place because there are plenty of places where you can add sufficient checks that doing mostly ok much of the time is sufficient to already provide substantial time savings.

replies(1): >>35246018 #

3. User23 ◴[21 Mar 23 13:42 UTC] No.35245959[source]▶

>>35245898 (TP) #

The main problem with using GPT-3 (and maybe 4 I dunno) in that way is it will happily bullshit you to the point of making up fake references. For example it quoted me "Section 6.2" of the Go Programming Language Specification to support its answer when I asked it how a particular conversion is specified.

replies(2): >>35246187 #>>35246812 #

4. rubendv ◴[21 Mar 23 13:44 UTC] No.35245985[source]▶

>>35245898 (TP) #

I guess it is in OpenAI's best interest to downplay the memorization aspect in favor of the logical reasoning angle. If it turns out that GPT is memorizing and reproducing copyrighted data, it could land them in legal trouble.

replies(2): >>35246113 #>>35250346 #

5. zer00eyz ◴[21 Mar 23 13:47 UTC] No.35246018[source]▶

>>35245958 #

> I've held back and been sceptical but I'm at the point where I'm preparing to integrate models all over the place because there are plenty of places where you can add sufficient checks that doing mostly ok much of the time is sufficient to already provide substantial time savings.

Im an old engineer.

Simply put NO.

If you don't understand it don't check it in. You are just getting code to cut and paste at a higher frequency and volume. At some point in time the fire will be burning around you and you won't have the tools to deal with it.

Nothing about mostly, much and sufficient ever ends well when it has been done in the name of saving time.

replies(7): >>35246026 #>>35246079 #>>35246149 #>>35246308 #>>35248566 #>>35249906 #>>35257939 #

6. vidarh ◴[21 Mar 23 13:47 UTC] No.35246026{3}[source]▶

>>35246018 #

Nobody suggested checking in anything you don't understand. On the contrary. So maybe try reading again.

replies(4): >>35246280 #>>35246737 #>>35246792 #>>35246983 #

7. petesergeant ◴[21 Mar 23 13:50 UTC] No.35246065[source]▶

>>35245898 (TP) #

I dunno, I use ChatGPT for exactly the same thing as you, and people are always quite surprised when I say that's its main value to me, so I think people have very different ideas of what it excels at

8. simonw ◴[21 Mar 23 13:51 UTC] No.35246079{3}[source]▶

>>35246018 #

"You are just getting code to cut and paste at a higher frequency and volume" genuinely sounds like the key value proposition of ChatGPT for coding to me.

I treat its output like I would treat a PR from a brand new apprentice engineer on my team: review it carefully, provide some feedback and iterate a few times, accept with tests.

replies(1): >>35251224 #

9. brookst ◴[21 Mar 23 13:53 UTC] No.35246113[source]▶

>>35245985 #

On the bright side it would mean they have invented an amazing compression algorithm, given the model size and amount of text it can produce.

10. Karunamon ◴[21 Mar 23 13:56 UTC] No.35246149{3}[source]▶

>>35246018 #

Nobody said one word about checking in something they don't understand. That applies to copying from stackoverflow as much as it does from an LLM or copilot.

11. soco ◴[21 Mar 23 13:57 UTC] No.35246167[source]▶

>>35245898 (TP) #

Good luck getting ChatGPT to explain a cron expression like "0 30 5 * * 3". I mean, it will explain, but mixing up everything. How many other mistakes it might make?

replies(7): >>35246247 #>>35246250 #>>35246277 #>>35246542 #>>35247089 #>>35248460 #>>35248921 #

12. dpkirchner ◴[21 Mar 23 13:59 UTC] No.35246187[source]▶

>>35245959 #

Do you remember the prompt (it should be in the history column)? I'm curious how it came up with the section numbering.

13. nielsole ◴[21 Mar 23 14:03 UTC] No.35246247[source]▶

>>35246167 #

[...]

> So the cron expression `0 30 5 * * 3` means "run the cron job every Wednesday at 5:30 AM".

It explains the five Cron parameters but the doesn't pick up that six were provided. Oh well

replies(1): >>35246268 #

14. messe ◴[21 Mar 23 14:04 UTC] No.35246250[source]▶

>>35246167 #

Isn't that an invalid cron expression? It has six fields instead of five, and says to run at the 30th hour of the day.

replies(1): >>35246476 #

15. messe ◴[21 Mar 23 14:05 UTC] No.35246268{3}[source]▶

>>35246247 #

I got a similar but incorrect result from ChatGPT: "So, the expression "0 30 5 * * 3" means that the command should be executed at 5:30 AM on the 5th day of every month, but only if that day is a Wednesday"

However, the crontab is invalid. It has five fields instead of six, and the 30 corresponds to the hour field.

"30 5 * * 3" without the leading zero, would correspond to run every Wednesday at 5:30 AM though. I suspect the fact that it has six fields instead of five is confusing it, and it's interpreting the 5 as a day of the month and as the hour

16. meh8881 ◴[21 Mar 23 14:06 UTC] No.35246277[source]▶

>>35246167 #

I don’t know what that means. This is what GPT says. Is it correct?

> the cron expression "0 30 5 * * 3" means that the job will run at 5:30 AM on the 5th day of every month and on Wednesdays.

17. ◴[21 Mar 23 14:06 UTC] No.35246280{4}[source]▶

>>35246026 #

18. poniko ◴[21 Mar 23 14:08 UTC] No.35246308{3}[source]▶

>>35246018 #

Isn't that what we all have been doing with google/stackoverflow .. how do I solve xx? Aha seems right, copy, paste and a quick format.. cross fingers and run.

19. soco ◴[21 Mar 23 14:18 UTC] No.35246476{3}[source]▶

>>35246250 #

It's a valid Quartz scheduler task where it starts with the seconds. But right, I shouldn't have asked ChatGPT about cron when it's actually quartz - and indeed with the changed prompt it will describe it correctly.

Edit: actually almost correctly: " - '3' represents the day of the week on which the task will be executed. In this case, the value is 3, which represents Wednesday (Sunday is 1, Monday is 2, and so on)."

replies(1): >>35248998 #

20. raattgift ◴[21 Mar 23 14:21 UTC] No.35246542[source]▶

>>35246167 #

Probably the most fundamental problem of ChatGPT is that it refuses (or is even unable) to admit, "I don't know".

replies(3): >>35246753 #>>35246850 #>>35247503 #

21. anon7725 ◴[21 Mar 23 14:32 UTC] No.35246737{4}[source]▶

>>35246026 #

The parent said:

> I'm at the point where I'm preparing to integrate models all over the place

Nobody understands these models right now. We don’t even have the weights.

You may draw some artificial distinction between literally checking in the source code of a model into your git repo and making a call to some black box API that hosts it. And you may claim that doing so is no different than making a call to Twilio or whatever, but I think there is a major difference: nobody can make a claim about what an LLM will return or how it will return it, cannot make guarantees about how it will fail, etc.

I agree with zer00eyz.

replies(1): >>35248652 #

22. spacebanana7 ◴[21 Mar 23 14:33 UTC] No.35246753{3}[source]▶

>>35246542 #

I imagine kind of defect can be fixed with more fine tuning / RHLF

replies(1): >>35247206 #

23. ◴[21 Mar 23 14:35 UTC] No.35246792{4}[source]▶

>>35246026 #

24. billythemaniam ◴[21 Mar 23 14:36 UTC] No.35246812[source]▶

>>35245959 #

So far GPT-4 seems to improve on this problem. Still happens but less frequent.

replies(1): >>35247181 #

25. blihp ◴[21 Mar 23 14:39 UTC] No.35246850{3}[source]▶

>>35246542 #

It truly has achieved human-level intelligence!

26. dahart ◴[21 Mar 23 14:48 UTC] No.35246983{4}[source]▶

>>35246026 #

To be fair, “sufficient checks” and “mostly ok much of the time” does imply something not well understood to me. Maybe you could clarify instead of snapping at people, try writing again, if that’s not what you meant?

replies(1): >>35248858 #

27. jvanderbot ◴[21 Mar 23 14:55 UTC] No.35247089[source]▶

>>35246167 #

I agree, there's rough edges everywhere. But you can rephrase the question as "how does a cron expression work", and do the interpretation yourself. It returns perfectly sensible results that would enable a perfectly sensible person to quickly write and interpret a cron expression.

This is what I meant when I implied it won't replace you as a thinking agent, but it sure can bring information to you quickly. `man cron` works fine, too, but sometimes it's nice to have one interface, and be able to ask clarifying questions.

28. mtlmtlmtlmtl ◴[21 Mar 23 15:00 UTC] No.35247181{3}[source]▶

>>35246812 #

I'm assuming you base this on hard empirical data and not just that it feels like it when you use it? ;)

29. sebzim4500 ◴[21 Mar 23 15:02 UTC] No.35247206{4}[source]▶

>>35246753 #

Almost certainly true but they'd have to use a new cost function. It's not just about collecting examples where the model should say "I don't know".

30. ◴[21 Mar 23 15:21 UTC] No.35247503{3}[source]▶

>>35246542 #

31. fvdessen ◴[21 Mar 23 16:25 UTC] No.35248460[source]▶

>>35246167 #

FWIW GPT-4 gets it completely correct.

32. pixl97 ◴[21 Mar 23 16:31 UTC] No.35248566{3}[source]▶

>>35246018 #

>If you don't understand it don't check it in.

I work in code security, and after helping any number of customers, I can tell you this isn't how far too many programmers work.

A client recently had a problem with a project that had over 1200 node_modules.

1200...

Let that sink in. There is absolutely no way in hell they even had any idea about a small portion of the code they were including.

replies(2): >>35249178 #>>35249323 #

33. vidarh ◴[21 Mar 23 16:35 UTC] No.35248652{5}[source]▶

>>35246737 #

I said that,and you're missing the point. We don't need to understand the models to be able to evaluate the output manually.

34. vidarh ◴[21 Mar 23 16:47 UTC] No.35248858{5}[source]▶

>>35246983 #

For starter, "sufficient checks" does mean sufficient and that inherently means I need to fully understabd the risks.

You're jumping to conclusions not supported by the comment at all.

Also, the comment has two parts: One about writing code, and one about integrating models in workflows.

To the latter, the point is that for a whole lot of uses you can trivially ensure the failure modes are safe.

E.g. I am integrating gpt with my email. "Mostly ok most of the time" applies to things like e.g. summaries and prioritisation, because worst case I just get to an email a bit later. "Sufficient checks" applies to things like writing proposed replies: There's no way I'd send one without reading it, and it's sufficient for me to read through it before pressing send (and making adjustments as needed). Failures here would matter if I intended to make a product of it, but as a productivity tool for myself it just needs to be close enough.

There are a whole lot of possibilities like that.

But even for coding related tasks there are a whole lot of low risk tasks,such as e.g. generating HTML or CSS, or provide usage examples, or providing a scaffold for something you know well how to do but which are time consuming.

If you're trying to make it do things that'd be time consuming to verify sufficiently well, then that's a bad use. The good uses are those where errors are low impact and easy to catch.

replies(1): >>35249346 #

35. selfhoster11 ◴[21 Mar 23 16:51 UTC] No.35248921[source]▶

>>35246167 #

From ChatGPT-3.5:

Carefully inspect the following Cron expression:

```

0 30 5 * * 3

```

1. Which variant of Cron syntax is it?

2. How many elements does this cron string have?

3. What do each of them mean? Say what the value of each element is, and explain what it means.

4. In your own words, explain the overall recurrence pattern described by this string.

</prompt>

    This is a standard Unix Cron syntax.

    This cron string has 6 elements.

    Each element represents a specific field in the schedule, and the value of each element has a specific meaning:
        0: The first field represents the seconds, and the value 0 means that the job will run at the exact top of a minute.
        30: The second field represents the minutes, and the value 30 means that the job will run at the 30th minute of an hour.
        5: The third field represents the hours, and the value 5 means that the job will run at the 5th hour of a day.
        *: The fourth field represents the day of the month, and the value * means that the job will run every day of the month.
        *: The fifth field represents the month, and the value * means that the job will run every month of the year.
        3: The sixth field represents the day of the week, and the value 3 means that the job will run only on Wednesday.

    The overall recurrence pattern described by this string is that the job will run every Wednesday at 5:30 AM.

</answer>

It didn't seem to guess that it's Quartz syntax, but it got the rest of it right.

replies(2): >>35249543 #>>35249580 #

36. jldugger ◴[21 Mar 23 16:56 UTC] No.35248998{4}[source]▶

>>35246476 #

Lol, GPT exposing bugs in the wetware

replies(1): >>35259306 #

37. com2kid ◴[21 Mar 23 17:08 UTC] No.35249178{4}[source]▶

>>35248566 #

> A client recently had a problem with a project that had over 1200 node_modules.

# of Node modules is such a useless metric.

In any given project, a large # of node modules are part of the test, build, and linting frameworks.

If I go to C++ land and count the number of #import statements, it wouldn't tell me anything.

How many classes do large Java projects use? Typically some absurd number.

38. teaearlgraycold ◴[21 Mar 23 17:18 UTC] No.35249323{4}[source]▶

>>35248566 #

Are those direct dependencies or the full dependency tree?

39. dahart ◴[21 Mar 23 17:19 UTC] No.35249346{6}[source]▶

>>35248858 #

Thanks for clarifying, this does make it sound like you want to be more careful than the comment above seemed to imply.

> You’re jumping to conclusions not supported by the comment at all.

That might be true, but you’re making assumptions that your first comment is clear and being interpreted the way you intended. I think it’s fair to point out that your words may imply things you weren’t considering, that asking people to re-read the same words again might not solve the problem you had.

The bigger picture here is that you’re talking about using AI to write code that for whatever reason you couldn’t write yourself in the same amount of time. The very topic here also implicitly suggests you’re starting with code you might not fully understand, which is fine, there’s no reason to get upset because someone else disagreed or read your comment that way.

replies(1): >>35250328 #

40. ◴[21 Mar 23 17:34 UTC] No.35249543{3}[source]▶

>>35248921 #

41. a_wild_dandan ◴[21 Mar 23 17:36 UTC] No.35249580{3}[source]▶

>>35248921 #

For reference, here's GPT-4's response:

    1. The given Cron expression follows the Unix/Linux variant of Cron syntax.

    2. This Cron string has six elements.

    3. The elements and their meanings are as follows:
        Element 1 (value: 0): This represents the seconds field and indicates the task will run at 0 seconds.
        Element 2 (value: 30): This represents the minutes field, specifying the task will run at minute 30.
        Element 3 (value: 5): This represents the hours field, indicating the task will run at 5 AM.
        Element 4 (value: ): This represents the days of the month field, where the asterisk () signifies that the task can run on any day of the month.
        Element 5 (value: ): This represents the months field, where the asterisk () signifies that the task can run in any month.
        Element 6 (value: 3): This represents the days of the week field, specifying the task will run on the 3rd day of the week (where Sunday is considered the 0th day, making Tuesday the 3rd day).

    4. The overall recurrence pattern described by this string is that the task will run every Tuesday at 5:30 AM.

replies(1): >>35258171 #

42. hn_throwaway_99 ◴[21 Mar 23 17:58 UTC] No.35249906{3}[source]▶

>>35246018 #

I think you are misunderstanding. The post you are replying to clearly said they were reviewing output code before checking it in. The fact that we don't understand how the models work is irrelevant (we don't understand how the human brain works, either) - all we need to understand is how the output works.

I had a conversation with ChatGPT where I asked it to write me a piece of code. After it wrote the code, I reviewed it, and I told ChatGPT that it had a subtle bug. ChatGPT then fixed the bug itself, and wrote an English description about how the fix it added would prevent the bug.

43. vidarh ◴[21 Mar 23 18:25 UTC] No.35250328{7}[source]▶

>>35249346 #

That'd justify asking for clarifications, not making pronouncements not supported by the initial comment.

replies(1): >>35250723 #

44. dragonwriter ◴[21 Mar 23 18:26 UTC] No.35250346[source]▶

>>35245985 #

> I guess it is in OpenAI’s best interest to downplay the memorization aspect in favor of the logical reasoning angle. If it turns out that GPT is memorizing and reproducing copyrighted data, it could land them in legal trouble.

Its also in their interest, since it encourages people to attempt to build businesses on top of its “logical reasoning” capacities.

And as long as its within the realm of spin rather than direct false statements, it probably avoids creating them legal problems for fraud, although the difference in effect is…nonexistent.

45. dahart ◴[21 Mar 23 18:50 UTC] No.35250723{8}[source]▶

>>35250328 #

You’re repeating your assumption that anyone but you knows exactly what is supported by the comment you wrote that does in fact imply in multiple ways that there’s code involved that you don’t fully understand. Why is it fair to expect people to know exactly what you meant, when words often have fuzzy meanings, and in the face of evidence that multiple people interpreted your comment potentially differently than intended?

replies(1): >>35251248 #

46. vidarh ◴[21 Mar 23 19:24 UTC] No.35251224{4}[source]▶

>>35246079 #

Exactly. And my point in the first place was that it's most useful for those kinds of tasks you might hand to an apprentice where the apprentice might go away, spend a lot of time doing research and distill it down to some code that is simple, likely not all that great, but saves me time.

E.g. some tasks I've used it for recently:

* Giving me an outline of a JMAP client so I can pull down stuff from my e-mail to feed to GPT.

* Giving me an outline of an OpenAPI client.

* Giving me an index page and a layout for a website, including a simple starting point for the CSS that did a reset and added basic styling for the nav bar, forms and "hero "sections.

* Giving me an outline of a Stripe API integration.

* Writing a simple DNS server.

* Writing a simple web server capable of running Sinatra apps via Rack.

None of these were complex code that'd hide obscure bugs. None were big chunks of code. All of them were simple code that was always going to have big, gaping holes and sub-optimal choices that'd need to be addressed, but that was fine because they were scaffolding that saved me starting from scratch (and the last two were not intended to turn into anything, but just exploring what it could do)

That's where the biggest savings are for me, because if I asked it to generate particularly complex stuff, I'd end up spending ages getting comfortable it'd done it right and verifying it. But the simple but tedious stuff is something it's great for.

47. vidarh ◴[21 Mar 23 19:26 UTC] No.35251248{9}[source]▶

>>35250723 #

I did not repeat any assumption at all. I pointed out that if I were to accept your interpretation, then that is justification for asking for clarification, not making bombastic statements about it.

replies(1): >>35251682 #

48. dahart ◴[21 Mar 23 19:59 UTC] No.35251682{10}[source]▶

>>35251248 #

I agree that asking for clarification is a good idea! That’s always true. :) To clarify my point, since I might not be verbalizing exactly what I intended, it’s partly that making reasonable assumptions about your intent is par for the course and should be expected when you comment, and partly that the comment in question is not particularly “bombastic”, even if it made assumptions about what you meant. That seems like an exaggeration, which might undermine your point a little, and it assumes your audience is responsible for knowing your exact intent when using words and topics that are easily misunderstood.

49. surrTurr ◴[21 Mar 23 20:44 UTC] No.35252251[source]▶

>>35245898 (TP) #

I recently built something related (QA system for Zotero powered by LangChain & GPT). Works really well.

https://twitter.com/alexweichart/status/1637211755049897985?...

50. solarkraft ◴[22 Mar 23 07:33 UTC] No.35257939{3}[source]▶

>>35246018 #

They say they verify the code, do they should understand it. But also: Have you heard of StackOverflow? Copy/pasting code you don't (fully) understand is already a common practice that seems to mostly work well.

51. bboreham ◴[22 Mar 23 08:11 UTC] No.35258171{4}[source]▶

>>35249580 #

A beautiful illustration, thank you.

(If Sunday is 0 then regular math would give 3=Wednesday)

52. soco ◴[22 Mar 23 11:01 UTC] No.35259306{5}[source]▶

>>35248998 #

Actually, no. It gave a wrong answer in full confidence then HN analyzed it to expose the bug(s).

↑