2025 AI Index Report

(hai.stanford.edu)

Show context

mrdependable ◴[10 Apr 25 17:09 UTC] No.43645990[source]▶

I always see these reports about how much better AI is than humans now, but I can't even get it to help me with pretty mundane problem solving. Yesterday I gave Claude a file with a few hundred lines of code, what the input should be, and told it where the problem was. I tried until I ran out of credits and it still could not work backwards to tell me where things were going wrong. In the end I just did it myself and it turned out to be a pretty obvious problem.

The strange part with these LLMs is that they get weirdly hung up on things. I try to direct them away from a certain type of output and somehow they keep going back to it. It's like the same problem I have with Google where if I try to modify my search to be more specific, it just ignores what it doesn't like about my query and gives me the same output.

replies(4): >>43646008 #>>43646119 #>>43646496 #>>43647128 #

namaria ◴[10 Apr 25 18:01 UTC] No.43646496[source]▶

>>43645990 #

It's overfitting.

Some people say they find LLMs very helpful for coding, some people say they are incredibly bad.

I often see people wondering if the some coding task is performed well or not because of availability of code examples in the training data. It's way worse than that. It's overfitting to diffs it was trained on.

"In other words, the model learns to predict plausible changes to code from examples of changes made to code by human programmers."

https://arxiv.org/abs/2206.08896

replies(2): >>43646676 #>>43651662 #

simonw ◴[10 Apr 25 18:19 UTC] No.43646676[source]▶

>>43646496 #

... which explains why some models are better at code than others. The best coding models (like Claude 3.7 Sonnet) are likely that good because Anthropic spent an extraordinary amount of effort cultivating a really good training set for them.

I get the impression one of the most effective tricks is to load your training set up with as much code as possible that has comprehensive automated tests that pass already.

replies(2): >>43646863 #>>43646981 #

namaria ◴[10 Apr 25 18:51 UTC] No.43646981[source]▶

>>43646676 #

> ... which explains why some models are better at code than others.

No. It explains why models seem better at code in given situations. When your prompt mapped to diffs in the training data that are useful to you they seem great.

replies(1): >>43647037 #

simonw ◴[10 Apr 25 18:58 UTC] No.43647037[source]▶

>>43646981 #

I've been writing code with LLM assistance for over two years now and I've had plenty of situations where I am 100% confident the thing I am doing has never been done by anyone else before.

I've tried things like searching all of the public code on GitHub for every possible keyword relevant to my problem.

... or I'm writing code against libraries which didn't exist when the models were trained.

The idea that models can only write code if they've seen code that does the exact same thing in the past is uninformed in my opinion.

replies(2): >>43647176 #>>43647229 #

1. namaria ◴[10 Apr 25 19:17 UTC] No.43647176[source]▶

>>43647037 #

> The idea that models can only write code if they've seen code that does the exact same thing in the past is deeply uninformed in my opinion.

This is a conceited interpretation of what I said.

replies(1): >>43647287 #

2. xboxnolifes ◴[10 Apr 25 19:29 UTC] No.43647287[source]▶

>>43647176 (TP) #

If this isn't what you meant, then what did you mean? To me, it's exactly how I read what you said.

replies(1): >>43647482 #

3. namaria ◴[10 Apr 25 19:54 UTC] No.43647482[source]▶

>>43647287 #

I am sorry but that's nonsense.

I quoted the paper "Evolution through Large Models" written in collaboration between OpenAI and Anthropic researchers

"In other words, the model learns to predict plausible changes to code from examples of changes made to code by human programmers."

https://arxiv.org/pdf/2206.08896

> The idea that models can only write code if they've seen code that does the exact same thing in the past

How do you get "code that does the exact same thing" from "predicting plausible changes?"

replies(1): >>43647676 #

4. simonw ◴[10 Apr 25 20:18 UTC] No.43647676{3}[source]▶

>>43647482 #

That paper describes an experimental diff-focused approach from 2022. It's not clear to me how relevant it is to the way models like Claude 3.7 Sonnet (thinking) and o3-mini work today.

replies(1): >>43647989 #

5. namaria ◴[10 Apr 25 21:02 UTC] No.43647989{4}[source]▶

>>43647676 #

If do not you think past research by OpenAI and Anthropic on how to use LLMs to generate code is relevant to how Anthropic LLMs generate code 3 years later I really don't think it is possible to have a reasonable conversation about this topic with you.

replies(1): >>43648238 #

6. simonw ◴[10 Apr 25 21:34 UTC] No.43648238{5}[source]▶

>>43647989 #

Can we be sure that research became part of their mainline model development process as opposed to being an interesting side-quest?

Are Gemini and DeepSeek and Llama and other strong coding models using the same ideas?

Llama and DeepSeek are at least slightly more open about their training processes so there might be clues in their papers (that's a lot of stuff to crunch through though).

↑