Most active commenters
  • namaria(7)
  • simonw(6)

←back to thread

2025 AI Index Report

(hai.stanford.edu)
166 points INGELRII | 11 comments | | HN request time: 0s | source | bottom
Show context
mrdependable ◴[] No.43645990[source]
I always see these reports about how much better AI is than humans now, but I can't even get it to help me with pretty mundane problem solving. Yesterday I gave Claude a file with a few hundred lines of code, what the input should be, and told it where the problem was. I tried until I ran out of credits and it still could not work backwards to tell me where things were going wrong. In the end I just did it myself and it turned out to be a pretty obvious problem.

The strange part with these LLMs is that they get weirdly hung up on things. I try to direct them away from a certain type of output and somehow they keep going back to it. It's like the same problem I have with Google where if I try to modify my search to be more specific, it just ignores what it doesn't like about my query and gives me the same output.

replies(4): >>43646008 #>>43646119 #>>43646496 #>>43647128 #
namaria ◴[] No.43646496[source]
It's overfitting.

Some people say they find LLMs very helpful for coding, some people say they are incredibly bad.

I often see people wondering if the some coding task is performed well or not because of availability of code examples in the training data. It's way worse than that. It's overfitting to diffs it was trained on.

"In other words, the model learns to predict plausible changes to code from examples of changes made to code by human programmers."

https://arxiv.org/abs/2206.08896

replies(2): >>43646676 #>>43651662 #
simonw ◴[] No.43646676[source]
... which explains why some models are better at code than others. The best coding models (like Claude 3.7 Sonnet) are likely that good because Anthropic spent an extraordinary amount of effort cultivating a really good training set for them.

I get the impression one of the most effective tricks is to load your training set up with as much code as possible that has comprehensive automated tests that pass already.

replies(2): >>43646863 #>>43646981 #
namaria ◴[] No.43646981[source]
> ... which explains why some models are better at code than others.

No. It explains why models seem better at code in given situations. When your prompt mapped to diffs in the training data that are useful to you they seem great.

replies(1): >>43647037 #
simonw ◴[] No.43647037{3}[source]
I've been writing code with LLM assistance for over two years now and I've had plenty of situations where I am 100% confident the thing I am doing has never been done by anyone else before.

I've tried things like searching all of the public code on GitHub for every possible keyword relevant to my problem.

... or I'm writing code against libraries which didn't exist when the models were trained.

The idea that models can only write code if they've seen code that does the exact same thing in the past is uninformed in my opinion.

replies(2): >>43647176 #>>43647229 #
fergal_reid ◴[] No.43647229{4}[source]
Strongly agree.

This seems to be very hard for people to accept, per the other comments here.

Until recently I was willing to accept an argument that perhaps LLMs had mostly learned the patterns; e.g. to maybe believe 'well there aren't that many really different leetcode questions'.

But with recent models (eg sonnet-3.7-thinking) they are operating well on such large and novel chunks of code that the idea they've seen everything in the training set, or even, like, a close structural match, is becoming ridiculous.

replies(1): >>43647305 #
namaria ◴[] No.43647305[source]
All due respect to Simon but I would love to see some of that groundbreaking code that the LLMs are coming up with.

I am sure that the functionalities implemented are novel but do you really think the training data cannot possibly have had the patterns being used to deliver these features, really? How is it that in the past few months or years people suddenly found the opportunity and motivation to write code that cannot possibly be in any way shape or form represented by patterns in the diffs that have been pushed in the past 30 years?

replies(1): >>43647338 #
1. simonw ◴[] No.43647338{6}[source]
When I said "the thing I am doing has never been done by anyone else before" I didn't necessarily mean groundbreaking pushes-the-edge-of-computer-science stuff - I meant more pedestrian things like "nobody has ever published Python code to condense and uncondense JSON using this new format I just invented today": https://github.com/simonw/condense-json

I'm not claiming LLMs can invent new computer science. I'm saying it's not accurate to say "they can only produce code that's almost identical to what's in their training data".

replies(2): >>43647551 #>>43676009 #
2. namaria ◴[] No.43647551[source]
> "they can only produce code that's almost identical to what's in their training data"

Again, you're misinterpreting in a way that seems like you are reacting to the perception that someone attacked some of your core beliefs rather than considering what I am saying and conversing about that.

I never even used the words "exact same thing" or "almost identical". Not even synonyms. I just said overfitting and quoted from an OpenAI/Anthropic paper that said "predict plausible changes to code from examples of changes"

Think about that. Don't react, think. Why do you equate overfitting and plausibility prediction with "exact" and "identical". It very obviously is not what I said.

What I am getting at is that a cannon will kill the mosquito. But drawing a fly swatter in the cannonball and saying the plastic ones are obsolete now would be in bad faith. No need to say to someone pointing that out that they are claiming that the cannon can only fire on mosquitoes that have been swatted before.

replies(1): >>43647620 #
3. simonw ◴[] No.43647620[source]
I don't think I understood your point then. I matched it with the common "LLMs can only produce code that's similar to what they've seen before" argument.

Reading back, you said:

> I often see people wondering if the some coding task is performed well or not because of availability of code examples in the training data. It's way worse than that. It's overfitting to diffs it was trained on.

I'll be honest: I don't understand what you mean by "overfitting to diffs it was trained on" there.

Maybe I don't understand what "overfitting" means in this context?

(I'm afraid I didn't understand your cannon / fly swatter analogy either.)

replies(1): >>43647978 #
4. namaria ◴[] No.43647978{3}[source]
It's overkill. The models do not capture knowledge about coding. They overfit to the dataset. When one distills data into a useful model the model can be used to predict future behavior of the system.

That is the premise of LLM-as-AI. By training these models on enough data, knowledge of the world is purported as having been captured, creating something useful that can be leveraged to process new input and get a prediction of the trajectory of the system in some phase space.

But this, I argue, is not the case. The models merely overfit to the training data. Hence the variable results perceived by people. When their intentions and prompt fit to the data in the training, the model appears to give good output. But the situation and prompt do not, the models do no "reason" about it and "infer" anything. It fails. It gives you gibberish or go in circles, or worse if there is some "agentic" arrangement if fails to terminate and burns tokens until you intervene.

It's overkill. And I am pointing out it is overkill. It's not a clever system for creating code for any given situation. It overfits to training data set. And your response is to claim that my argument is something else, not that it's overkill but that it can only kill dead things. I never said that. I see it's more than capable of spitting out useful code even if that exact same code is not in the training dataset. But it is just automating the process of going through google, docs and stack overflow and assembling something for you. You might be good at searching and lucky and it is just what you need. You might not be so used to using the right keywords or just be using some uncommon language, or in a domain that happens to not be well represented and then it feels less useful. But instead of just coming up short as search, the model overkills and wastes your time and god knows how much subsidized energy and compute. Lucky you if you're not burning tokens on some agentic monstosity.

replies(2): >>43647993 #>>43648989 #
5. simonw ◴[] No.43647993{4}[source]
If that's the case, it turns out that what I want is a system that's "overfitted to the dataset" on code, since I'm getting incredibly useful results for code out of it.

(I'm not personally interested in the whole AGI thing.)

replies(1): >>43648232 #
6. namaria ◴[] No.43648232{5}[source]
Good man I never said anything about AGI. Why do you keep responding to things I never said?

This whole exchange was you having knee-jerk reactions to things you imagined I said. It has been incredibly frustrating. And at the end you shrug and say "eh it's useful to me"??

I am talking about this because of deceitfulness, resource efficiency, societal implications of technology.

replies(1): >>43648414 #
7. simonw ◴[] No.43648414{6}[source]
"That is the premise of LLM-as-AI" - I assumed that was an AGI reference. My definition of AGI is pretty much "hyped AI". What did you mean by "LLM-as-AI"?

In my own writing I don't even use the term "AI" very often because its meaning is so vague.

You're right to call me out on this: I did, in this earlier comment - https://news.ycombinator.com/item?id=43644662#43647037 - commit the sin of responding to something you hadn't actually said.

(Worse than that, I said "... is uninformed in my opinion" which was rude because I was saying that about a strawman argument.)

I did that thing where I saw an excuse to bang on one of my pet peeves (people saying "LLMs can't create new code if it's not already in their training data") and jumped at the opportunity.

I've tried to continue the rest of the conversation in good faith though. I'm sorry if it didn't come across that way.

replies(1): >>43651778 #
8. fergal_reid ◴[] No.43648989{4}[source]
You are correct that variable results could be a symptom of a failure to generalise well beyond the training set.

Such failure could happen if the models were overfit, or for other reasons. I don't think 'overfit', which is pretty well defined, is exactly the word you mean to use here.

However, I respectfully disagree with your claim. I think they are generalising well beyond the training dataset (though not as far beyond as say a good programmer would - at least not yet). I further think they are learning semantically.

Can't prove it in a comment except to say that there's simply no way they'd be able to successfully manipulate such large pieces of code, using English language instructions, it they weren't great at generalisation and ok at understanding semantics.

replies(1): >>43651066 #
9. namaria ◴[] No.43651066{5}[source]
I understand your position. But I think you're underestimating just how much training data is used and how much information can be encoded in hundreds of billions of parameters.

But this is the crux of the disagreement. I think the models overfit to the training data hence the fluctuating behavior. And you think they show generalization and semantic understanding. Which yeah they apparently do. But the failure modes in my opinion show that they don't and would be explained by overfitting.

10. mdp2021 ◴[] No.43651778{7}[source]
> My definition of AGI is pretty much

Simon, intelligence exists (and unintelligence exists). When you write «I'm not claiming LLMs can invent new computer science», you imply intelligence exists.

We can implement it. And it is somehow urgent, because intelligence is very desirable wealth - there is definite scarcity. It is even more urgent after the recent hype has made some people perversely confused about the idea of intelligence.

We can and must go well beyond the current state.

11. anon373839 ◴[] No.43676009[source]
I’ve spent a fair amount of time trying to coax assistance out of LLMs when trying to design novel or custom neural network architectures. They are sometimes helpful with narrow aspects of this. But more often, they disregard key requirements in favor of the common patterns they were trained on.