←back to thread

2025 AI Index Report

(hai.stanford.edu)
166 points INGELRII | 1 comments | | HN request time: 0s | source
Show context
mrdependable ◴[] No.43645990[source]
I always see these reports about how much better AI is than humans now, but I can't even get it to help me with pretty mundane problem solving. Yesterday I gave Claude a file with a few hundred lines of code, what the input should be, and told it where the problem was. I tried until I ran out of credits and it still could not work backwards to tell me where things were going wrong. In the end I just did it myself and it turned out to be a pretty obvious problem.

The strange part with these LLMs is that they get weirdly hung up on things. I try to direct them away from a certain type of output and somehow they keep going back to it. It's like the same problem I have with Google where if I try to modify my search to be more specific, it just ignores what it doesn't like about my query and gives me the same output.

replies(4): >>43646008 #>>43646119 #>>43646496 #>>43647128 #
namaria ◴[] No.43646496[source]
It's overfitting.

Some people say they find LLMs very helpful for coding, some people say they are incredibly bad.

I often see people wondering if the some coding task is performed well or not because of availability of code examples in the training data. It's way worse than that. It's overfitting to diffs it was trained on.

"In other words, the model learns to predict plausible changes to code from examples of changes made to code by human programmers."

https://arxiv.org/abs/2206.08896

replies(2): >>43646676 #>>43651662 #
simonw ◴[] No.43646676[source]
... which explains why some models are better at code than others. The best coding models (like Claude 3.7 Sonnet) are likely that good because Anthropic spent an extraordinary amount of effort cultivating a really good training set for them.

I get the impression one of the most effective tricks is to load your training set up with as much code as possible that has comprehensive automated tests that pass already.

replies(2): >>43646863 #>>43646981 #
torginus ◴[] No.43646863[source]
I've often experienced that I had what I thought an obscure and very intellectually challenging coding problem, and after prompting the LLM, it basically one-shotted it.

I've been profoundly humbled by the the experience, but then it occurred to me that what I thought to be an unique problem has been solved by quite a few people before and the model had plenty of references to pull from.

replies(1): >>43651191 #
zifpanachr23 ◴[] No.43651191{3}[source]
Do you have any examples?
replies(2): >>43651468 #>>43654028 #
1. torginus ◴[] No.43654028{4}[source]
Yeah for the positive example, I described the syntax of a domain-specific-language, and the AI basically one-shotted the parsing rules, that only needed minor fixes.

For a counterexample, working on any part of a codebase that's 100% application specific business logic, with our custom abstractions, the AI is usually so lost that it's basically not even worth using it, as the chances of writing correct and usable code is next to zero.