Most active commenters

vidarh(7)
dahart(4)

Popular/hot comments

>>35246018 #
>>35246026 #

←back to thread

GPT-4 and professional benchmarks: the wrong answer to the wrong question

(aisnakeoil.substack.com)

Show context

jvanderbot ◴[21 Mar 23 13:38 UTC] No.35245898[source]▶

>>35245626 (OP) #

Memorization is absolutely the most valuable part of GPT, for me. I can get natural language responses to documentation, basic scripting / sysadmin, and API questions much more easily than searching other ways.

While this is an academic interest point, and rightly tamps down on hype around replacing humans, it doesn't dissuade what I think are most peoples' basic use case: "I don't know or don't remember how to do X, can you show me?"

This is finally a good enough "knowledge reference engine" that I can see being useful to those very people it is over hyped to replace.

replies(6): >>35245958 #>>35245959 #>>35245985 #>>35246065 #>>35246167 #>>35252251 #

1. vidarh ◴[21 Mar 23 13:42 UTC] No.35245958[source]▶

>>35245898 #

And asking higher level questions than what you'd otherwise look up. E.g. I've had ChatGPT write forms, write API calls, put together skeletons for all kinds of things that I can easily verify and fix when it gets details from but that are time consuming to do manually. I've held back and been sceptical but I'm at the point where I'm preparing to integrate models all over the place because there are plenty of places where you can add sufficient checks that doing mostly ok much of the time is sufficient to already provide substantial time savings.

replies(1): >>35246018 #

2. zer00eyz ◴[21 Mar 23 13:47 UTC] No.35246018[source]▶

>>35245958 (TP) #

> I've held back and been sceptical but I'm at the point where I'm preparing to integrate models all over the place because there are plenty of places where you can add sufficient checks that doing mostly ok much of the time is sufficient to already provide substantial time savings.

Im an old engineer.

Simply put NO.

If you don't understand it don't check it in. You are just getting code to cut and paste at a higher frequency and volume. At some point in time the fire will be burning around you and you won't have the tools to deal with it.

Nothing about mostly, much and sufficient ever ends well when it has been done in the name of saving time.

replies(7): >>35246026 #>>35246079 #>>35246149 #>>35246308 #>>35248566 #>>35249906 #>>35257939 #

3. vidarh ◴[21 Mar 23 13:47 UTC] No.35246026[source]▶

>>35246018 #

Nobody suggested checking in anything you don't understand. On the contrary. So maybe try reading again.

replies(4): >>35246280 #>>35246737 #>>35246792 #>>35246983 #

4. simonw ◴[21 Mar 23 13:51 UTC] No.35246079[source]▶

>>35246018 #

"You are just getting code to cut and paste at a higher frequency and volume" genuinely sounds like the key value proposition of ChatGPT for coding to me.

I treat its output like I would treat a PR from a brand new apprentice engineer on my team: review it carefully, provide some feedback and iterate a few times, accept with tests.

replies(1): >>35251224 #

5. Karunamon ◴[21 Mar 23 13:56 UTC] No.35246149[source]▶

>>35246018 #

Nobody said one word about checking in something they don't understand. That applies to copying from stackoverflow as much as it does from an LLM or copilot.

6. ◴[21 Mar 23 14:06 UTC] No.35246280{3}[source]▶

>>35246026 #

7. poniko ◴[21 Mar 23 14:08 UTC] No.35246308[source]▶

>>35246018 #

Isn't that what we all have been doing with google/stackoverflow .. how do I solve xx? Aha seems right, copy, paste and a quick format.. cross fingers and run.

8. anon7725 ◴[21 Mar 23 14:32 UTC] No.35246737{3}[source]▶

>>35246026 #

The parent said:

> I'm at the point where I'm preparing to integrate models all over the place

Nobody understands these models right now. We don’t even have the weights.

You may draw some artificial distinction between literally checking in the source code of a model into your git repo and making a call to some black box API that hosts it. And you may claim that doing so is no different than making a call to Twilio or whatever, but I think there is a major difference: nobody can make a claim about what an LLM will return or how it will return it, cannot make guarantees about how it will fail, etc.

I agree with zer00eyz.

replies(1): >>35248652 #

9. ◴[21 Mar 23 14:35 UTC] No.35246792{3}[source]▶

>>35246026 #

10. dahart ◴[21 Mar 23 14:48 UTC] No.35246983{3}[source]▶

>>35246026 #

To be fair, “sufficient checks” and “mostly ok much of the time” does imply something not well understood to me. Maybe you could clarify instead of snapping at people, try writing again, if that’s not what you meant?

replies(1): >>35248858 #

11. pixl97 ◴[21 Mar 23 16:31 UTC] No.35248566[source]▶

>>35246018 #

>If you don't understand it don't check it in.

I work in code security, and after helping any number of customers, I can tell you this isn't how far too many programmers work.

A client recently had a problem with a project that had over 1200 node_modules.

1200...

Let that sink in. There is absolutely no way in hell they even had any idea about a small portion of the code they were including.

replies(2): >>35249178 #>>35249323 #

12. vidarh ◴[21 Mar 23 16:35 UTC] No.35248652{4}[source]▶

>>35246737 #

I said that,and you're missing the point. We don't need to understand the models to be able to evaluate the output manually.

13. vidarh ◴[21 Mar 23 16:47 UTC] No.35248858{4}[source]▶

>>35246983 #

For starter, "sufficient checks" does mean sufficient and that inherently means I need to fully understabd the risks.

You're jumping to conclusions not supported by the comment at all.

Also, the comment has two parts: One about writing code, and one about integrating models in workflows.

To the latter, the point is that for a whole lot of uses you can trivially ensure the failure modes are safe.

E.g. I am integrating gpt with my email. "Mostly ok most of the time" applies to things like e.g. summaries and prioritisation, because worst case I just get to an email a bit later. "Sufficient checks" applies to things like writing proposed replies: There's no way I'd send one without reading it, and it's sufficient for me to read through it before pressing send (and making adjustments as needed). Failures here would matter if I intended to make a product of it, but as a productivity tool for myself it just needs to be close enough.

There are a whole lot of possibilities like that.

But even for coding related tasks there are a whole lot of low risk tasks,such as e.g. generating HTML or CSS, or provide usage examples, or providing a scaffold for something you know well how to do but which are time consuming.

If you're trying to make it do things that'd be time consuming to verify sufficiently well, then that's a bad use. The good uses are those where errors are low impact and easy to catch.

replies(1): >>35249346 #

14. com2kid ◴[21 Mar 23 17:08 UTC] No.35249178{3}[source]▶

>>35248566 #

> A client recently had a problem with a project that had over 1200 node_modules.

# of Node modules is such a useless metric.

In any given project, a large # of node modules are part of the test, build, and linting frameworks.

If I go to C++ land and count the number of #import statements, it wouldn't tell me anything.

How many classes do large Java projects use? Typically some absurd number.

15. teaearlgraycold ◴[21 Mar 23 17:18 UTC] No.35249323{3}[source]▶

>>35248566 #

Are those direct dependencies or the full dependency tree?

16. dahart ◴[21 Mar 23 17:19 UTC] No.35249346{5}[source]▶

>>35248858 #

Thanks for clarifying, this does make it sound like you want to be more careful than the comment above seemed to imply.

> You’re jumping to conclusions not supported by the comment at all.

That might be true, but you’re making assumptions that your first comment is clear and being interpreted the way you intended. I think it’s fair to point out that your words may imply things you weren’t considering, that asking people to re-read the same words again might not solve the problem you had.

The bigger picture here is that you’re talking about using AI to write code that for whatever reason you couldn’t write yourself in the same amount of time. The very topic here also implicitly suggests you’re starting with code you might not fully understand, which is fine, there’s no reason to get upset because someone else disagreed or read your comment that way.

replies(1): >>35250328 #

17. hn_throwaway_99 ◴[21 Mar 23 17:58 UTC] No.35249906[source]▶

>>35246018 #

I think you are misunderstanding. The post you are replying to clearly said they were reviewing output code before checking it in. The fact that we don't understand how the models work is irrelevant (we don't understand how the human brain works, either) - all we need to understand is how the output works.

I had a conversation with ChatGPT where I asked it to write me a piece of code. After it wrote the code, I reviewed it, and I told ChatGPT that it had a subtle bug. ChatGPT then fixed the bug itself, and wrote an English description about how the fix it added would prevent the bug.

18. vidarh ◴[21 Mar 23 18:25 UTC] No.35250328{6}[source]▶

>>35249346 #

That'd justify asking for clarifications, not making pronouncements not supported by the initial comment.

replies(1): >>35250723 #

19. dahart ◴[21 Mar 23 18:50 UTC] No.35250723{7}[source]▶

>>35250328 #

You’re repeating your assumption that anyone but you knows exactly what is supported by the comment you wrote that does in fact imply in multiple ways that there’s code involved that you don’t fully understand. Why is it fair to expect people to know exactly what you meant, when words often have fuzzy meanings, and in the face of evidence that multiple people interpreted your comment potentially differently than intended?

replies(1): >>35251248 #

20. vidarh ◴[21 Mar 23 19:24 UTC] No.35251224{3}[source]▶

>>35246079 #

Exactly. And my point in the first place was that it's most useful for those kinds of tasks you might hand to an apprentice where the apprentice might go away, spend a lot of time doing research and distill it down to some code that is simple, likely not all that great, but saves me time.

E.g. some tasks I've used it for recently:

* Giving me an outline of a JMAP client so I can pull down stuff from my e-mail to feed to GPT.

* Giving me an outline of an OpenAPI client.

* Giving me an index page and a layout for a website, including a simple starting point for the CSS that did a reset and added basic styling for the nav bar, forms and "hero "sections.

* Giving me an outline of a Stripe API integration.

* Writing a simple DNS server.

* Writing a simple web server capable of running Sinatra apps via Rack.

None of these were complex code that'd hide obscure bugs. None were big chunks of code. All of them were simple code that was always going to have big, gaping holes and sub-optimal choices that'd need to be addressed, but that was fine because they were scaffolding that saved me starting from scratch (and the last two were not intended to turn into anything, but just exploring what it could do)

That's where the biggest savings are for me, because if I asked it to generate particularly complex stuff, I'd end up spending ages getting comfortable it'd done it right and verifying it. But the simple but tedious stuff is something it's great for.

21. vidarh ◴[21 Mar 23 19:26 UTC] No.35251248{8}[source]▶

>>35250723 #

I did not repeat any assumption at all. I pointed out that if I were to accept your interpretation, then that is justification for asking for clarification, not making bombastic statements about it.

replies(1): >>35251682 #

22. dahart ◴[21 Mar 23 19:59 UTC] No.35251682{9}[source]▶

>>35251248 #

I agree that asking for clarification is a good idea! That’s always true. :) To clarify my point, since I might not be verbalizing exactly what I intended, it’s partly that making reasonable assumptions about your intent is par for the course and should be expected when you comment, and partly that the comment in question is not particularly “bombastic”, even if it made assumptions about what you meant. That seems like an exaggeration, which might undermine your point a little, and it assumes your audience is responsible for knowing your exact intent when using words and topics that are easily misunderstood.

23. solarkraft ◴[22 Mar 23 07:33 UTC] No.35257939[source]▶

>>35246018 #

They say they verify the code, do they should understand it. But also: Have you heard of StackOverflow? Copy/pasting code you don't (fully) understand is already a common practice that seems to mostly work well.

↑