←back to thread

467 points mraniki | 1 comments | | HN request time: 0s | source
Show context
phkahler ◴[] No.43534852[source]
Here is a real coding problem that I might be willing to make a cash-prize contest for. We'd need to nail down some rules. I'd be shocked if any LLM can do this:

https://github.com/solvespace/solvespace/issues/1414

Make a GTK 4 version of Solvespace. We have a single C++ file for each platform - Windows, Mac, and Linux-GTK3. There is also a QT version on an unmerged branch for reference. The GTK3 file is under 2KLOC. You do not need to create a new version, just rewrite the GTK3 Linux version to GTK4. You may either ask it to port what's there or create the new one from scratch.

If you want to do this for free to prove how great the AI is, please document the entire session. Heck make a YouTube video of it. The final test is weather I accept the PR or not - and I WANT this ticket done.

I'm not going to hold my breath.

replies(15): >>43534866 #>>43534869 #>>43535026 #>>43535180 #>>43535208 #>>43535218 #>>43535261 #>>43535424 #>>43535811 #>>43535986 #>>43536115 #>>43536743 #>>43536797 #>>43536869 #>>43542998 #
snickell ◴[] No.43535424[source]
This is the smoothest tom sawyer move I've ever seen IRL, I wonder how many people are now grinding out your GTK4 port with our favorite LLM/system to see if it can. It'll be interesting to see if anyone gets something working with current-gen LLMs.

UPDATE: naive (just fed it your description verbatim) cline + claude 3.7 was a total wipeout. It looked like it was making progress, then freaked out, deleted 3/4 of its port, and never recovered.

replies(2): >>43535670 #>>43535712 #
phkahler ◴[] No.43535712[source]
>> This is the smoothest tom sawyer move I've ever seen IRL

That made me laugh. True, but not really the motivation. I honestly don't think LLMs can code significant real-world things yet and I'm not sure how else to prove that since they can code some interesting things. All the talk about putting programmers out of work has me calling BS but also thinking "show me". This task seems like a good combination of simple requirements, not much documentation, real world existing problem, non-trivial code size, limited scope.

replies(4): >>43536382 #>>43536563 #>>43536785 #>>43550128 #
snickell ◴[] No.43536382[source]
Yes, very much agree, an interesting benchmark. Particularly because it’s in a “tier 2” framework (gtkmm) in terms of amount of code available to train an LLM on. That tests the LLMs ability to plan and problem solve compared with, say, “convert to the latest version of react” where the LLM has access to tens of thousands (more?) of similar ports in its training dataset and more has to pattern match.
replies(1): >>43539349 #
phkahler ◴[] No.43539349[source]
>> Particularly because it’s in a “tier 2” framework (gtkmm) in terms of amount of code available to train an LLM on.

I asked GPT4 to write an empty GTK4 app in C++. I asked for a menu bar with File, Edit, View at the top and two GL drawing areas separated by a spacer. It produced what looked like usable code with a couple lines I suspected were out of place. I did not try to compile it so don't know if it was a hallucination, but it did seem to know about gtkmm 4.

replies(1): >>43543091 #
1. snickell ◴[] No.43543091{3}[source]
It definitely knows what GTK4 is, when it freaked out on me and lost the code, it was using all gtkmm-4.0 headers, and had the compiler error count down to 10 (most likely with tons of logic errors, but who knows).

But LLMs performance varies (and this is a huge critique!) not just on what they theoretically know, but how, erm, cross-linked it is with everything else, and that requires lots of training data in the topic.

Metaphorically, I think this is a little like the difference for humans in math between being able to list+define techniques to solve integrals vs being able to fluidly apply them without error.

I think a big and very valid critique of LLMs (compared to humans) is that they are stronger at "memory" than reasoning. They use their vast memory as a crutch to hide the weaknesses in their reasoning. This makes benchmarks like "convert from gtkmm3 to gtkmm4" both challenging AND very good benchmarks of what real programmers are able to do.

I suspect if we gave it a similarly sized 2kloc conversion problem with a popular web framework in TS or JS, it would one-shot it. But again, its "cheating" to do this, its leveraging having read a zillion conversion by humans and what they did.