GPT-4 and professional benchmarks: the wrong answer to the wrong question

(aisnakeoil.substack.com)

340 points agomez314 | 4 comments | 21 Mar 23 13:12 UTC | HN request time: 1.682s | source

Show context

jvanderbot ◴[21 Mar 23 13:38 UTC] No.35245898[source]▶

Memorization is absolutely the most valuable part of GPT, for me. I can get natural language responses to documentation, basic scripting / sysadmin, and API questions much more easily than searching other ways.

While this is an academic interest point, and rightly tamps down on hype around replacing humans, it doesn't dissuade what I think are most peoples' basic use case: "I don't know or don't remember how to do X, can you show me?"

This is finally a good enough "knowledge reference engine" that I can see being useful to those very people it is over hyped to replace.

replies(6): >>35245958 #>>35245959 #>>35245985 #>>35246065 #>>35246167 #>>35252251 #

vidarh ◴[21 Mar 23 13:42 UTC] No.35245958[source]▶

>>35245898 #

And asking higher level questions than what you'd otherwise look up. E.g. I've had ChatGPT write forms, write API calls, put together skeletons for all kinds of things that I can easily verify and fix when it gets details from but that are time consuming to do manually. I've held back and been sceptical but I'm at the point where I'm preparing to integrate models all over the place because there are plenty of places where you can add sufficient checks that doing mostly ok much of the time is sufficient to already provide substantial time savings.

replies(1): >>35246018 #

zer00eyz ◴[21 Mar 23 13:47 UTC] No.35246018[source]▶

>>35245958 #

> I've held back and been sceptical but I'm at the point where I'm preparing to integrate models all over the place because there are plenty of places where you can add sufficient checks that doing mostly ok much of the time is sufficient to already provide substantial time savings.

Im an old engineer.

Simply put NO.

If you don't understand it don't check it in. You are just getting code to cut and paste at a higher frequency and volume. At some point in time the fire will be burning around you and you won't have the tools to deal with it.

Nothing about mostly, much and sufficient ever ends well when it has been done in the name of saving time.

replies(7): >>35246026 #>>35246079 #>>35246149 #>>35246308 #>>35248566 #>>35249906 #>>35257939 #

vidarh ◴[21 Mar 23 13:47 UTC] No.35246026[source]▶

>>35246018 #

Nobody suggested checking in anything you don't understand. On the contrary. So maybe try reading again.

replies(4): >>35246280 #>>35246737 #>>35246792 #>>35246983 #

dahart ◴[21 Mar 23 14:48 UTC] No.35246983[source]▶

>>35246026 #

To be fair, “sufficient checks” and “mostly ok much of the time” does imply something not well understood to me. Maybe you could clarify instead of snapping at people, try writing again, if that’s not what you meant?

replies(1): >>35248858 #

vidarh ◴[21 Mar 23 16:47 UTC] No.35248858[source]▶

>>35246983 #

For starter, "sufficient checks" does mean sufficient and that inherently means I need to fully understabd the risks.

You're jumping to conclusions not supported by the comment at all.

Also, the comment has two parts: One about writing code, and one about integrating models in workflows.

To the latter, the point is that for a whole lot of uses you can trivially ensure the failure modes are safe.

E.g. I am integrating gpt with my email. "Mostly ok most of the time" applies to things like e.g. summaries and prioritisation, because worst case I just get to an email a bit later. "Sufficient checks" applies to things like writing proposed replies: There's no way I'd send one without reading it, and it's sufficient for me to read through it before pressing send (and making adjustments as needed). Failures here would matter if I intended to make a product of it, but as a productivity tool for myself it just needs to be close enough.

There are a whole lot of possibilities like that.

But even for coding related tasks there are a whole lot of low risk tasks,such as e.g. generating HTML or CSS, or provide usage examples, or providing a scaffold for something you know well how to do but which are time consuming.

If you're trying to make it do things that'd be time consuming to verify sufficiently well, then that's a bad use. The good uses are those where errors are low impact and easy to catch.

replies(1): >>35249346 #

dahart ◴[21 Mar 23 17:19 UTC] No.35249346[source]▶

>>35248858 #

Thanks for clarifying, this does make it sound like you want to be more careful than the comment above seemed to imply.

> You’re jumping to conclusions not supported by the comment at all.

That might be true, but you’re making assumptions that your first comment is clear and being interpreted the way you intended. I think it’s fair to point out that your words may imply things you weren’t considering, that asking people to re-read the same words again might not solve the problem you had.

The bigger picture here is that you’re talking about using AI to write code that for whatever reason you couldn’t write yourself in the same amount of time. The very topic here also implicitly suggests you’re starting with code you might not fully understand, which is fine, there’s no reason to get upset because someone else disagreed or read your comment that way.

replies(1): >>35250328 #

1. vidarh ◴[21 Mar 23 18:25 UTC] No.35250328[source]▶

>>35249346 #

That'd justify asking for clarifications, not making pronouncements not supported by the initial comment.

replies(1): >>35250723 #

2. dahart ◴[21 Mar 23 18:50 UTC] No.35250723[source]▶

>>35250328 (TP) #

You’re repeating your assumption that anyone but you knows exactly what is supported by the comment you wrote that does in fact imply in multiple ways that there’s code involved that you don’t fully understand. Why is it fair to expect people to know exactly what you meant, when words often have fuzzy meanings, and in the face of evidence that multiple people interpreted your comment potentially differently than intended?

replies(1): >>35251248 #

3. vidarh ◴[21 Mar 23 19:26 UTC] No.35251248[source]▶

>>35250723 #

I did not repeat any assumption at all. I pointed out that if I were to accept your interpretation, then that is justification for asking for clarification, not making bombastic statements about it.

replies(1): >>35251682 #

4. dahart ◴[21 Mar 23 19:59 UTC] No.35251682{3}[source]▶

>>35251248 #

I agree that asking for clarification is a good idea! That’s always true. :) To clarify my point, since I might not be verbalizing exactly what I intended, it’s partly that making reasonable assumptions about your intent is par for the course and should be expected when you comment, and partly that the comment in question is not particularly “bombastic”, even if it made assumptions about what you meant. That seems like an exaggeration, which might undermine your point a little, and it assumes your audience is responsible for knowing your exact intent when using words and topics that are easily misunderstood.

↑