Most active commenters
  • simonw(23)
  • namaria(10)
  • __loam(5)
  • andai(4)
  • torginus(3)

2025 AI Index Report

(hai.stanford.edu)
165 points INGELRII | 112 comments | | HN request time: 3.52s | source | bottom
1. Signez ◴[] No.43645619[source]
Surprised not to see a whole chapter on the environment impact. It's quite a big talking point around here (Europe, France) to discredit AI usage, along with the usual ethics issues about art theft, job destruction, making it easier to generate disinformation and working conditions of AI trainers in low-income countries.

(Disclaimer: I am not an anti-AI guy — I am just listing the common talking points I see in my feeds.)

replies(7): >>43645778 #>>43645779 #>>43645786 #>>43645888 #>>43646134 #>>43646161 #>>43646204 #
2. simonw ◴[] No.43645778[source]
Yeah, it would be really useful to see a high quality report like this that addresses that issue.

My strong intuition at the moment is that the environmental impact is greatly exaggerated.

The energy cost of executing prompts has dropped enormously over the past two years - something that's reflected in this report when it says "Driven by increasingly capable small models, the inference cost for a system performing at the level of GPT-3.5 dropped over 280-fold between November 2022 and October 2024". I wrote a bit about that here: https://simonwillison.net/2024/Dec/31/llms-in-2024/#the-envi...

We still don't have great numbers on training costs for most of the larger labs, which are likely extremely high.

Llama 3.3 70B cost "39.3M GPU hours of computation on H100-80GB (TDP of 700W) type hardware" which they calculated as 11,390 tons CO2eq. I tried to compare that to fully loaded passenger jet flights between London and New York and got a number of between 28 and 56 flights, but I then completely lost confidence in my ability to credibly run those calculations because I don't understand nearly enough about how CO2eq is calculated in different industries.

The "LLMs are an environmental catastrophe" messaging has become so firmly ingrained in our culture that I think it would benefit the AI labs themselves enormously if they were more transparent about the actual numbers.

replies(4): >>43645865 #>>43646268 #>>43646879 #>>43648009 #
3. Lerc ◴[] No.43645779[source]
Every time I have seen it mentioned, it has been rolled into data center usage.

Is there any separate analysis on AI resource usage?

For a few years now it has been frequently reported that building and running renewable energy is cheaper than running fossil fuel electricity generation.

I know some fossil fuel plants run to earn the subsidies that incentivised their construction. Is the main driver for fossil fuel electricity generation now mainly bureaucratic? If not why is it persisting? Were we misinformed as to the capability of renewables?

replies(1): >>43646107 #
4. iinnPP ◴[] No.43645786[source]
I want to take the opportunity here to introduce a rather overlooked problem with AI: Palantir and anything like it.

Where certain uses equate to significant jumps in power of manipulation.

That's not to pick on Palantir, it's just a class of software that enables AI for usecases that are quite scary.

It's not as if similar software isn't used by other countries for the same use cases employed by the US military.

Given this path, I doubt the environment will be the focus, again.

replies(1): >>43645953 #
5. tmpz22 ◴[] No.43645865{3}[source]
If I were an AI advocate I'd push the environmental angle to distract from IP and other (IMO bigger and immediate concerns) like DOGE using AI to audit government agencies and messages, or AI generated discourse driving every modern social platform.

I think the biggest mistake liberals make (I am one) is that they expect disinformation to come against their beliefs when the most power disinformation comes bundled with their beliefs in the form of misdirection, exaggeration, or other subterfuge.

replies(2): >>43646018 #>>43646265 #
6. StopDisinfo910 ◴[] No.43645888[source]
> Surprised not to see a whole chapter on the environment impact.

Is it? I don’t think I have ever seen it really brought up anywhere it would matter.

It would be quite rich in a country where energy production is pretty much carbon neutral but in character from EELV I guess.

7. simonw ◴[] No.43645953{3}[source]
Is that really overlooked? I've been seeing (very justified) concerns about the use of AI and machine learning for surveillance for over a decade.

It was even the subject of a popular network TV show (Person of Interest) with 103 episodes from 2011-2016.

replies(1): >>43647074 #
8. mrdependable ◴[] No.43645990[source]
I always see these reports about how much better AI is than humans now, but I can't even get it to help me with pretty mundane problem solving. Yesterday I gave Claude a file with a few hundred lines of code, what the input should be, and told it where the problem was. I tried until I ran out of credits and it still could not work backwards to tell me where things were going wrong. In the end I just did it myself and it turned out to be a pretty obvious problem.

The strange part with these LLMs is that they get weirdly hung up on things. I try to direct them away from a certain type of output and somehow they keep going back to it. It's like the same problem I have with Google where if I try to modify my search to be more specific, it just ignores what it doesn't like about my query and gives me the same output.

replies(4): >>43646008 #>>43646119 #>>43646496 #>>43647128 #
9. simonw ◴[] No.43646008[source]
LLMs are difficult to use. Anyone who tells you otherwise is being misleading.
replies(2): >>43646190 #>>43666132 #
10. dleeftink ◴[] No.43646018{4}[source]
How is that a mistake? Isn't that the exact purpose of propaganda?
11. simonw ◴[] No.43646075[source]
They released the data for this report as a bunch of CSV files in a Google Drive, so I converted those into a SQLite database for exploration with Datasette Lite: https://lite.datasette.io/?url=https://static.simonwillison....

Here's the most interesting table, illustrating examples of bias in different models https://lite.datasette.io/?url=https://static.simonwillison....

replies(1): >>43663202 #
12. colesantiago ◴[] No.43646087[source]
It's great to see that there will be new jobs when AI usage in businesses skyrockets.
replies(1): >>43656163 #
13. Taek ◴[] No.43646107{3}[source]
There's a couple of things at play here (renewable energy is my industry).

1. Renewable energy, especially solar, is cheaper *sometimes*. How much sunlight is there in that area? The difference between New Mexico and Illinois for example is almost a factor of 2. That is a massive factor. Other key factors include cost of labor, and (often underestimated) beautacratic red tape. For example, in India it takes about 6 weeks to go from "I'll spend $70 million on a solar farm" to having a fully functional 10 MW solar farm. In the US, you'll need something like 30% more money, and it'll take 9-18 months. In some parts of Europe, it might take 4-5 years and cost double to triple.

All of those things matter a lot.

2. For the most part, capex is the dominant factor in the cost of energy. In the case of fossil fuels, we've already spent the capex, so while it's more expensive over a period of 20 years to keep using coal, if you are just trying to make the budget crunch for 2025 and 2026 it might make sense to stay on fossil fuels even if renewable energy is technically "cheaper".

3. Energy is just a hard problem to solve. Grid integrations, regulatory permission, regulatory capture, monopolies, base load versus peak power, duck curves, etc etc. If you have something that's working (fossil fuels), it might be difficult to justify switching to something that you don't know how it will work.

Solar is becoming dominant very quickly. Give it a little bit of time, and you'll see more and more people switching to solar over fossil fuels.

replies(2): >>43646381 #>>43647355 #
14. slig ◴[] No.43646119[source]
Was that on 3.7 Sonnet? I feel it's a lot worse than 3.5. If you can, try again but on Gemini 2.5.
replies(2): >>43646163 #>>43646188 #
15. andai ◴[] No.43646134[source]
There's a very brief section estimating CO2 impact and a chart at the end of Chapter 1:

https://hai.stanford.edu/ai-index/2025-ai-index-report/resea...

A few more charts in the PDF (pp. 48-51)

https://hai-production.s3.amazonaws.com/files/hai_ai-index-r...

16. simonw ◴[] No.43646161[source]
Page 71 to 74 cover environmental impact and energy usage - so not a whole chapter but it is there.
17. avandekleut ◴[] No.43646163{3}[source]
I'm glad I'm not the only one that has found 3.5 to be better than 3.7.
replies(1): >>43646756 #
18. andai ◴[] No.43646170[source]
Note that this is an overview, each chapter has its own page, and even those are overviews, each chapter comes as a separate PDF.

The full report PDF is 456 pages.

19. mrdependable ◴[] No.43646188{3}[source]
This was 3.7. I did give Gemini a shot for a bit but it couldn’t do it either and the output didn’t look quite as nice. Also, I paid for a year of Claude so kind of feel stuck using it now.

Maybe I will give 3.5 a shot next time though.

20. __loam ◴[] No.43646190{3}[source]
"Hey these tools are kind of disappointing"

"You just need to learn to use them right"

Ad infinitum as we continue to get middling results from the most overhyped piece of technology of all time.

replies(6): >>43646640 #>>43646655 #>>43646908 #>>43647257 #>>43652095 #>>43663510 #
21. calvinmorrison ◴[] No.43646204[source]
whats the lifetime environmental impact of hiring one decent human being who is capable enough assist with work. Well a lot, you gotta do 25 years with 30 kids to get one useful person.

You get to upgrade them, kill them off, have them on demand

replies(1): >>43646726 #
22. __loam ◴[] No.43646265{4}[source]
The biggest mistake liberals have made is thinking leaving the markets to their own devices wouldn't lead to an accumulation of wealth so egregious that the nation collapses into fascism as the wealthy use their power to dismantle the rule of law.
replies(1): >>43654431 #
23. mentalgear ◴[] No.43646268{3}[source]
To assess the env impact, I think we need to look a bit further:

While the single query might have become more efficient, we would also have to relate this to the increased volume of overall queries. E.g in the last few years, how many more users, and queries per user were requested.

My feeling is that it's Jevons paradox all over.

replies(2): >>43646901 #>>43647950 #
24. mentalgear ◴[] No.43646292[source]
"AI performance on demanding benchmarks continues to improve."

My feeling is that more AI models are fine-tuned on these prestigious benchmarks.

25. davis ◴[] No.43646381{4}[source]
Just curious: where do you work given it is your industry?
26. namaria ◴[] No.43646496[source]
It's overfitting.

Some people say they find LLMs very helpful for coding, some people say they are incredibly bad.

I often see people wondering if the some coding task is performed well or not because of availability of code examples in the training data. It's way worse than that. It's overfitting to diffs it was trained on.

"In other words, the model learns to predict plausible changes to code from examples of changes made to code by human programmers."

https://arxiv.org/abs/2206.08896

replies(2): >>43646676 #>>43651662 #
27. trott ◴[] No.43646514[source]
Regarding point number 11 (AlphaFold3 vs Vina, Gnina, etc.), see my rebuttal here (I'm the author of Vina): https://olegtrott.substack.com/p/are-alphafolds-new-results-...

Gnina is Vina with its results re-scored by a NN, so the exact same concerns apply.

I'm very optimistic about AI, for the record. It's just that in this particular case, the comparison was flawed. It's the old regurgitation vs generalization confusion: We need a method that generalizes to completely novel drug candidates, but the evaluation was done on a dataset that tends to be repetitive.

28. simonw ◴[] No.43646640{4}[source]
That's why I try not to hype it.
replies(2): >>43649582 #>>43652701 #
29. tzumaoli ◴[] No.43646655{4}[source]
also "They will get better in no time"
replies(1): >>43646686 #
30. simonw ◴[] No.43646676{3}[source]
... which explains why some models are better at code than others. The best coding models (like Claude 3.7 Sonnet) are likely that good because Anthropic spent an extraordinary amount of effort cultivating a really good training set for them.

I get the impression one of the most effective tricks is to load your training set up with as much code as possible that has comprehensive automated tests that pass already.

replies(2): >>43646863 #>>43646981 #
31. simonw ◴[] No.43646686{5}[source]
That one's provably correct. Try comparing 2023-era GPT-3.5 with 2025's best models.
replies(1): >>43650254 #
32. simonw ◴[] No.43646726{3}[source]
I saw a fun comparison a while back (which I now cannot find) of the amount of CO2 it takes to train a leading LLM compared to the amount of CO2 it takes to fly every attendee of the NeurIPS AI conference (13,000+ people) to and from the event.
replies(1): >>43646941 #
33. joe_the_user ◴[] No.43646752[source]
I recall Stanford's past AI Reports being substantial and critical some years ago. This seems like a compilation of many small press releases into one large press release ("Key take away: AI continues to get bigger, better and faster"). The problem is that AI went from universities to companies and the publications of the various companies themselves then went from research papers to press releases/white papers (I remember OpenAI's supposed technical specification of GPT-something as a watershed, in that actually involved no useful information but just touted statistics who context the reader didn't know).
34. johnisgood ◴[] No.43646756{4}[source]
When did 3.7 come out? I might have had the same experience. I think I have been using 3.5 with success, but I cannot remember exactly. I may have not used 3.7 for coding (as I had a couple of months break).
replies(1): >>43647656 #
35. torginus ◴[] No.43646863{4}[source]
I've often experienced that I had what I thought an obscure and very intellectually challenging coding problem, and after prompting the LLM, it basically one-shotted it.

I've been profoundly humbled by the the experience, but then it occurred to me that what I thought to be an unique problem has been solved by quite a few people before and the model had plenty of references to pull from.

replies(1): >>43651191 #
36. pera ◴[] No.43646879{3}[source]
> Global AI data center power demand could reach 68 GW by 2027 and 327 GW by 2030, compared with total global data center capacity of just 88 GW in 2022.

"AI's Power Requirements Under Exponential Growth", Jan 28, 2025:

https://www.rand.org/pubs/research_reports/RRA3572-1.html

As a point of reference: The current demand in the UK is 31.2 GW (https://grid.iamkate.com/)

37. fc417fc802 ◴[] No.43646901{4}[source]
The training costs are amortized over inference. More lifetime queries means better efficiency.

Individual inferences are extremely low impact. Additionally it will be almost impossible to assess the net effect due to the complexity of the downstream interactions.

At 40M 700W GPU hours 160 million queries gets you 175Wh per query. That's less than the energy required to boil a pot of pasta. This is merely an upper bound - it's near certain that many times more queries will be run over the life of the model.

38. torginus ◴[] No.43646908{4}[source]
LLMs are a casino. They're probabilistic models which might come up with incredible solutions at a drop of a hat, then turn around and fumble even the most trivial stuff - I've had this same experience from GPT3.5 to the latest and greatest models.

They come up with something amazing once, and then never again, leading me to believe, it's operator error, not pure dumb luck or slight prompt wording that lead me to be humbled once, and then tear my hair out in frustration the next time.

Granted, newer models tend to do more hitting than missing, but it's still far from a certainty that it'll spit out something good.

39. danielbln ◴[] No.43646941{4}[source]
Well don't let us hanging.
replies(1): >>43647025 #
40. namaria ◴[] No.43646981{4}[source]
> ... which explains why some models are better at code than others.

No. It explains why models seem better at code in given situations. When your prompt mapped to diffs in the training data that are useful to you they seem great.

replies(1): >>43647037 #
41. simonw ◴[] No.43647025{5}[source]
"(which I now cannot find)"
42. simonw ◴[] No.43647037{5}[source]
I've been writing code with LLM assistance for over two years now and I've had plenty of situations where I am 100% confident the thing I am doing has never been done by anyone else before.

I've tried things like searching all of the public code on GitHub for every possible keyword relevant to my problem.

... or I'm writing code against libraries which didn't exist when the models were trained.

The idea that models can only write code if they've seen code that does the exact same thing in the past is uninformed in my opinion.

replies(2): >>43647176 #>>43647229 #
43. fc417fc802 ◴[] No.43647074{4}[source]
The topic as a whole isn't overlooked but I think the societal impact is understated even by Hollywood. When every security camera is networked and has a mind of its own things get really weird and that's before we consider the likes of Boston Dynamics.

A robotic police officer on every corner isn't at all far fetched at that point.

44. lispisok ◴[] No.43647128[source]
The PR articles and astroturfing will continue until investors get satisfactory returns on their many billions dumped into these things.
45. namaria ◴[] No.43647176{6}[source]
> The idea that models can only write code if they've seen code that does the exact same thing in the past is deeply uninformed in my opinion.

This is a conceited interpretation of what I said.

replies(1): >>43647287 #
46. fergal_reid ◴[] No.43647229{6}[source]
Strongly agree.

This seems to be very hard for people to accept, per the other comments here.

Until recently I was willing to accept an argument that perhaps LLMs had mostly learned the patterns; e.g. to maybe believe 'well there aren't that many really different leetcode questions'.

But with recent models (eg sonnet-3.7-thinking) they are operating well on such large and novel chunks of code that the idea they've seen everything in the training set, or even, like, a close structural match, is becoming ridiculous.

replies(1): >>43647305 #
47. pants2 ◴[] No.43647257{4}[source]
In my experience, most people who say "Hey these tools are kind of disappointing" either refuse to provide a reproducible example of how it falls short, or if they do, it's clear that they're not using the tool correctly.
replies(4): >>43647369 #>>43654440 #>>43654510 #>>43655733 #
48. xboxnolifes ◴[] No.43647287{7}[source]
If this isn't what you meant, then what did you mean? To me, it's exactly how I read what you said.
replies(1): >>43647482 #
49. namaria ◴[] No.43647305{7}[source]
All due respect to Simon but I would love to see some of that groundbreaking code that the LLMs are coming up with.

I am sure that the functionalities implemented are novel but do you really think the training data cannot possibly have had the patterns being used to deliver these features, really? How is it that in the past few months or years people suddenly found the opportunity and motivation to write code that cannot possibly be in any way shape or form represented by patterns in the diffs that have been pushed in the past 30 years?

replies(1): >>43647338 #
50. simonw ◴[] No.43647338{8}[source]
When I said "the thing I am doing has never been done by anyone else before" I didn't necessarily mean groundbreaking pushes-the-edge-of-computer-science stuff - I meant more pedestrian things like "nobody has ever published Python code to condense and uncondense JSON using this new format I just invented today": https://github.com/simonw/condense-json

I'm not claiming LLMs can invent new computer science. I'm saying it's not accurate to say "they can only produce code that's almost identical to what's in their training data".

replies(1): >>43647551 #
51. Lerc ◴[] No.43647355{4}[source]
I guess for things like training AI, they can go where the power is generated which would favour dropping them right next to a solar farm located for the best output.

Despite their name I imagine the transportation costs of weights would be quite low.

Thank you for your reply by the way, I like being able to ask why something is so rather than adding another uninformed opinion to the thread.

52. __loam ◴[] No.43647369{5}[source]
Ad infinitum
53. namaria ◴[] No.43647482{8}[source]
I am sorry but that's nonsense.

I quoted the paper "Evolution through Large Models" written in collaboration between OpenAI and Anthropic researchers

"In other words, the model learns to predict plausible changes to code from examples of changes made to code by human programmers."

https://arxiv.org/pdf/2206.08896

> The idea that models can only write code if they've seen code that does the exact same thing in the past

How do you get "code that does the exact same thing" from "predicting plausible changes?"

replies(1): >>43647676 #
54. namaria ◴[] No.43647551{9}[source]
> "they can only produce code that's almost identical to what's in their training data"

Again, you're misinterpreting in a way that seems like you are reacting to the perception that someone attacked some of your core beliefs rather than considering what I am saying and conversing about that.

I never even used the words "exact same thing" or "almost identical". Not even synonyms. I just said overfitting and quoted from an OpenAI/Anthropic paper that said "predict plausible changes to code from examples of changes"

Think about that. Don't react, think. Why do you equate overfitting and plausibility prediction with "exact" and "identical". It very obviously is not what I said.

What I am getting at is that a cannon will kill the mosquito. But drawing a fly swatter in the cannonball and saying the plastic ones are obsolete now would be in bad faith. No need to say to someone pointing that out that they are claiming that the cannon can only fire on mosquitoes that have been swatted before.

replies(1): >>43647620 #
55. simonw ◴[] No.43647620{10}[source]
I don't think I understood your point then. I matched it with the common "LLMs can only produce code that's similar to what they've seen before" argument.

Reading back, you said:

> I often see people wondering if the some coding task is performed well or not because of availability of code examples in the training data. It's way worse than that. It's overfitting to diffs it was trained on.

I'll be honest: I don't understand what you mean by "overfitting to diffs it was trained on" there.

Maybe I don't understand what "overfitting" means in this context?

(I'm afraid I didn't understand your cannon / fly swatter analogy either.)

replies(1): >>43647978 #
56. simonw ◴[] No.43647656{5}[source]
3.7 came out on 24th February. My notes from that release: https://simonwillison.net/2025/Feb/24/claude-37-sonnet-and-c... and https://simonwillison.net/2025/Feb/25/llm-anthropic-014/
replies(1): >>43647759 #
57. simonw ◴[] No.43647676{9}[source]
That paper describes an experimental diff-focused approach from 2022. It's not clear to me how relevant it is to the way models like Claude 3.7 Sonnet (thinking) and o3-mini work today.
replies(1): >>43647989 #
58. johnisgood ◴[] No.43647759{6}[source]
I will have to check, but apparently I have been using 3.5 with success, then. I will give 3.7 a try later, I hope it is really not that much worse, or is it? :(
59. signatoremo ◴[] No.43647950{4}[source]
LLM usage increase may be offset by the decrease of search or other use of phone/computer.

Can you quantify how much less driving resulted from the increase of LLM usage? I doubt you can.

60. namaria ◴[] No.43647978{11}[source]
It's overkill. The models do not capture knowledge about coding. They overfit to the dataset. When one distills data into a useful model the model can be used to predict future behavior of the system.

That is the premise of LLM-as-AI. By training these models on enough data, knowledge of the world is purported as having been captured, creating something useful that can be leveraged to process new input and get a prediction of the trajectory of the system in some phase space.

But this, I argue, is not the case. The models merely overfit to the training data. Hence the variable results perceived by people. When their intentions and prompt fit to the data in the training, the model appears to give good output. But the situation and prompt do not, the models do no "reason" about it and "infer" anything. It fails. It gives you gibberish or go in circles, or worse if there is some "agentic" arrangement if fails to terminate and burns tokens until you intervene.

It's overkill. And I am pointing out it is overkill. It's not a clever system for creating code for any given situation. It overfits to training data set. And your response is to claim that my argument is something else, not that it's overkill but that it can only kill dead things. I never said that. I see it's more than capable of spitting out useful code even if that exact same code is not in the training dataset. But it is just automating the process of going through google, docs and stack overflow and assembling something for you. You might be good at searching and lucky and it is just what you need. You might not be so used to using the right keywords or just be using some uncommon language, or in a domain that happens to not be well represented and then it feels less useful. But instead of just coming up short as search, the model overkills and wastes your time and god knows how much subsidized energy and compute. Lucky you if you're not burning tokens on some agentic monstosity.

replies(2): >>43647993 #>>43648989 #
61. namaria ◴[] No.43647989{10}[source]
If do not you think past research by OpenAI and Anthropic on how to use LLMs to generate code is relevant to how Anthropic LLMs generate code 3 years later I really don't think it is possible to have a reasonable conversation about this topic with you.
replies(1): >>43648238 #
62. simonw ◴[] No.43647993{12}[source]
If that's the case, it turns out that what I want is a system that's "overfitted to the dataset" on code, since I'm getting incredibly useful results for code out of it.

(I'm not personally interested in the whole AGI thing.)

replies(1): >>43648232 #
63. mbs159 ◴[] No.43648009{3}[source]
> ... I then completely lost confidence in my ability to credibly run those calculations because I don't understand nearly enough about how CO2eq is calculated in different industries.

There is a lot of heated debate on the "correct" methodology for calculating CO2e in different industries. I calculate it in my job and I have to update the formulas and variables very often. Don't beat yourself over it. :)

64. namaria ◴[] No.43648232{13}[source]
Good man I never said anything about AGI. Why do you keep responding to things I never said?

This whole exchange was you having knee-jerk reactions to things you imagined I said. It has been incredibly frustrating. And at the end you shrug and say "eh it's useful to me"??

I am talking about this because of deceitfulness, resource efficiency, societal implications of technology.

replies(1): >>43648414 #
65. simonw ◴[] No.43648238{11}[source]
Can we be sure that research became part of their mainline model development process as opposed to being an interesting side-quest?

Are Gemini and DeepSeek and Llama and other strong coding models using the same ideas?

Llama and DeepSeek are at least slightly more open about their training processes so there might be clues in their papers (that's a lot of stuff to crunch through though).

66. simonw ◴[] No.43648414{14}[source]
"That is the premise of LLM-as-AI" - I assumed that was an AGI reference. My definition of AGI is pretty much "hyped AI". What did you mean by "LLM-as-AI"?

In my own writing I don't even use the term "AI" very often because its meaning is so vague.

You're right to call me out on this: I did, in this earlier comment - https://news.ycombinator.com/item?id=43644662#43647037 - commit the sin of responding to something you hadn't actually said.

(Worse than that, I said "... is uninformed in my opinion" which was rude because I was saying that about a strawman argument.)

I did that thing where I saw an excuse to bang on one of my pet peeves (people saying "LLMs can't create new code if it's not already in their training data") and jumped at the opportunity.

I've tried to continue the rest of the conversation in good faith though. I'm sorry if it didn't come across that way.

replies(1): >>43651778 #
67. fergal_reid ◴[] No.43648989{12}[source]
You are correct that variable results could be a symptom of a failure to generalise well beyond the training set.

Such failure could happen if the models were overfit, or for other reasons. I don't think 'overfit', which is pretty well defined, is exactly the word you mean to use here.

However, I respectfully disagree with your claim. I think they are generalising well beyond the training dataset (though not as far beyond as say a good programmer would - at least not yet). I further think they are learning semantically.

Can't prove it in a comment except to say that there's simply no way they'd be able to successfully manipulate such large pieces of code, using English language instructions, it they weren't great at generalisation and ok at understanding semantics.

replies(1): >>43651066 #
68. mvdtnz ◴[] No.43649582{5}[source]
You're the biggest hype merchant for this technology on this entire website. Please.
replies(2): >>43649742 #>>43655396 #
69. simonw ◴[] No.43649742{6}[source]
I've been banging the drum about how unintuitive and difficult this stuff is for over a year now: https://simonwillison.net/2025/Mar/11/using-llms-for-code/

I'm one of the loudest voices about the so-far unsolved security problems inherent in this space: https://simonwillison.net/tags/prompt-injection/ (94 posts)

I also have 149 posts about the ethics of it: https://simonwillison.net/tags/ai-ethics/ - including one of the first high profile projects to explore the issue around copyrighted data used in training sets: https://simonwillison.net/2022/Sep/5/laion-aesthetics-weekno...

One of the reasons I do the "pelican riding a bicycle" thing is that it's a great way to deflate the hype around these tools - the supposedly best LLM in the world still draws a pelican that looks like it was done by a five year old! https://simonwillison.net/tags/pelican-riding-a-bicycle/

If you want AI hype there are a thousand places on the internet you can go to get it. I try not to be one of them.

replies(3): >>43651102 #>>43653084 #>>43660423 #
70. xboxnolifes ◴[] No.43650254{6}[source]
It's not provably correct if the comment is made toward 2025 models.
replies(1): >>43650548 #
71. simonw ◴[] No.43650548{7}[source]
Gemini 2.5 came out just over two weeks ago (25th March) and is a very significant improvement on Gemini 2.0 (5th February), according to a bunch of benchmarks but also the all-important vibes.
72. namaria ◴[] No.43651066{13}[source]
I understand your position. But I think you're underestimating just how much training data is used and how much information can be encoded in hundreds of billions of parameters.

But this is the crux of the disagreement. I think the models overfit to the training data hence the fluctuating behavior. And you think they show generalization and semantic understanding. Which yeah they apparently do. But the failure modes in my opinion show that they don't and would be explained by overfitting.

73. __loam ◴[] No.43651102{7}[source]
The prompt injection articles you wrote really early in the tech cycle were really good and I appreciated them at the time.
74. zifpanachr23 ◴[] No.43651191{5}[source]
Do you have any examples?
replies(2): >>43651468 #>>43654028 #
75. janpmz ◴[] No.43651427[source]
What I'm certain of is that the standard of living will increase. Because we can do more effective work in the same time. This means more output and things will become cheaper. What I'm not sure of, is where this effect will show in the stock market.
replies(2): >>43651432 #>>43651750 #
76. soulofmischief ◴[] No.43651432[source]
Standard of living for who? Productivity has not scaled appropriately with wages since the industrial revolution.
replies(1): >>43651535 #
77. janpmz ◴[] No.43651535{3}[source]
For almost everyone I think. Since the industrial revoultion we have availability of cheap electricity, cheap lighting, an abundance of food and clothing etc. How the wages developed is something I don't know.
78. mdp2021 ◴[] No.43651662{3}[source]
> overfitting

Are you sure it's not just a matter of being halfwitted?

79. elevatortrim ◴[] No.43651750[source]
This is assuming most white collar economically productive work is currently utilised to improve standard of lives and is a bottleneck which is at best questionable.
80. mdp2021 ◴[] No.43651778{15}[source]
> My definition of AGI is pretty much

Simon, intelligence exists (and unintelligence exists). When you write «I'm not claiming LLMs can invent new computer science», you imply intelligence exists.

We can implement it. And it is somehow urgent, because intelligence is very desirable wealth - there is definite scarcity. It is even more urgent after the recent hype has made some people perversely confused about the idea of intelligence.

We can and must go well beyond the current state.

81. TeMPOraL ◴[] No.43652095{4}[source]
No, it's just you and yours.

IDK, maybe there's a secret conspiracy of major LLM providers to split users into two groups, one that gets the good models, and the other that gets the bad models, and ensure each user is assigned to the same bucket at every provider.

Surely it's more likely that you and me got put into different buckets by the Deep LLM Cartel I just described, than it is for you to be holding the tool wrong.

82. vander_elst ◴[] No.43652624[source]
Meta question, why does the website try to make it more difficult to open the images in a new tab? usually if I want to do that, I right click and then select "open image in a new tab". Here I had to go through some loops to do it. Additionally, if you just copy the URL you get to a image that's just noise and that seems to be by design. I still can access the original image though and download it from AWS S3 (https://hai-production.s3.amazonaws.com/images/fig_1e.png). So the question, why all the loops, just to scare off non-technical users?
replies(2): >>43653060 #>>43653244 #
83. JohnKemeny ◴[] No.43652701{5}[source]
Uh... You don't do anything but hype them.

I literally don't know who anyone on HN are except you and dang, and you're the one that constantly writes these ads for your LLM database product.

replies(1): >>43652811 #
84. simonw ◴[] No.43652811{6}[source]
I think you and I must have different definitions of the word "hype".

To me, it means LinkedIn influencers screaming "AGI is coming!", "It's so over", "Programming as a career is dead" etc.

Or implying that LLMs are flawless technology that can and should be used to solve every problem.

To hype something is to provide a dishonest impression of how great it is without ever admitting its weaknesses. That's what I try to avoid doing with LLMs.

replies(1): >>43659344 #
85. andai ◴[] No.43653060[source]
The whole thing is over-engineered, could have been a few lines of HTML. They just made it harder to use and navigate, unfortunately.
86. andai ◴[] No.43653084{7}[source]
Could a five year old do it in XML (SVG)? Could an artist? In one shot?
87. ◴[] No.43653244[source]
88. torginus ◴[] No.43654028{6}[source]
Yeah for the positive example, I described the syntax of a domain-specific-language, and the AI basically one-shotted the parsing rules, that only needed minor fixes.

For a counterexample, working on any part of a codebase that's 100% application specific business logic, with our custom abstractions, the AI is usually so lost that it's basically not even worth using it, as the chances of writing correct and usable code is next to zero.

89. achierius ◴[] No.43654431{5}[source]
You imagine that this is a mistake, but it wouldn't be the first time that liberals went hand-in-hand with fascism to protect their capital.
replies(1): >>43657885 #
90. sksxihve ◴[] No.43654440{5}[source]
I'd love to see a reproducible example of these tools producing something that is exceptional. Or a clear reproducible example of using them the right way.

I've used them some (sorry I didn't make detailed notes about my usage, probably used them wrong) but pretty much there are always subtle bugs that if I didn't know better I would have overlooked.

I don't doubt people find them useful, personally I'd rather spend my time learning about things that interest me instead of spending money learning how to prompt a machine to do something I can do myself that I also enjoy doing.

I think a lot of the disagreements on hn about this tech is that both sides are mostly on the extremes of either "it doesn't work and at and is pointless" or "it's amazing and makes me 100x more productive" and not much discussion about the mid-ground of it works for some stuff and knowing what stuff it works well on makes it useful but it won't solve all your problems.

replies(3): >>43656928 #>>43663543 #>>43664027 #
91. mickael-kerjean ◴[] No.43654510{5}[source]
The latest example for me was trying to generate a thumbnail of a PSD in plain C and figure out the layers in there as I was lazy to read the specs, with the objective to bundle it as a wasm and execute it on a browser, it never got to extract a thumbnail from a given PSD, it's very confident at making stuff but it never got anywhere despite spending a couple hours on it which would have been better spend reading specs and existing code on that topic
92. dartharva ◴[] No.43654584[source]
> In the U.S., 81% of K–12 CS teachers say AI should be part of foundational CS education, but less than half feel equipped to teach it.

I'm curious, what exactly do they mean when they say they should teach AI in K-12?

93. maleldil ◴[] No.43655396{6}[source]
It's true that simonw writes a lot about LLMs, but I find his content to be mostly factual. Much of it is positive, but that doesn't mean it's hype.
94. input_sh ◴[] No.43655733{5}[source]
How are we supposed to give a reproducible example with a non-deterministic tool?
95. janalsncm ◴[] No.43656049[source]
> The U.S. still leads in producing top AI models—but China is closing the performance gap.

Most researchers that I know do not think about things in this lens. They think about building cool things with smart people, and if those people happen to be Chinese or French or Canadian it doesn’t matter.

Most people do not want a war (hot or cold) with the world’s only manufacturing superpower. It feels like we have been incepted into thinking it’s inevitable. It’s not.

In the other hand, if in some nationalistic AI race with China the US decides to get serious about R&D on this front, it will be good for me. I don’t want it though.

replies(1): >>43661315 #
96. ausbah ◴[] No.43656163[source]
honestly hope that LLMs end up creating mountains of unsustainable tech debt across these companies so devs have some job security
97. doug_durham ◴[] No.43656928{6}[source]
Why are you setting the bar at "exceptional". If it means that you can write your git commit messages more quickly and with fewer errors then that's all the payoff most orgs need to make them worthwhile.
replies(1): >>43661377 #
98. __loam ◴[] No.43657885{6}[source]
The mistake is not understanding the inevitability.
99. bluefirebrand ◴[] No.43659344{7}[source]
> without ever admitting its weaknesses

I don't think this part is necessary

"To hype something is to provide a dishonest impression of how great it is" is accurate.

Marketing hype is all about "provide a dishonest impression of how great it is". Putting the weaknesses in fine print doesn't change the hype

Anyways I don't mean to pile on but I agree with some of the other posters here. An awful lot of extremely pro-AI posts that I've noticed have your name on them

I don't think you are as critical of the tech as you think you are.

Take that for what you will

100. annjose ◴[] No.43660423{7}[source]
I agree - the content you write about LLMs is informative and realistic, not hyped. I get a lot of value from it, especially because you write mostly as stream of consciousness and explains your approach and/or reasoning. Thank you for doing that.
101. dangus ◴[] No.43661315[source]
I think China gets a lot of credit for being a "manufacturing superpower" but that kind of oversells what it is.

Look especially at dollar value of exports: https://www.statista.com/statistics/264623/leading-export-co...

The fact that China has 3x the population of the US but only 1.5x the export dollar value of the US says quite a bit. Germany's exporting output is even more impressive considering their population of under 100 million.

NATFA's manufacturing export dollar value is almost equivalent to China.

Complex and heavy industry manufacturing is something where they are not caught up at all. E.g., lithography machines, commercial jet aircraft and engines.

The US/Canada/Mexico are no slouches when it comes to the automotive parts ecosystem. Germany exports more auto parts than China, and the US is barely below China in that regard. I would also point out that certain US/NAFTA and European automobile exports are still considered to be top quality over Chinese models. For example, China is not capable of producing a Ferrari or a vehicle with the complexity and quality of a Mercedes S-Class. That's not to discount the amazing strides that China has made in that area but it is to say that the West+Japan is no slouch in that area.

But to me this is all besides the point anyway. AI is so tied up in open source anyway, this idea that China will leapfrog in AI R&D is somewhat irrelevant in my mind. I don't think any one country will have better capabilities than anyone else. There is no moat.

And ultimately I still predict that Chinese AI will be mostly a domestic product because of heavy government involvement in private data centers and the great firewall.

102. bluefirebrand ◴[] No.43661377{7}[source]
> Why are you setting the bar at "exceptional"

Because that is how they are being sold to us and hyped

> If it means that you can write your git commit messages more quickly and with fewer errors then that's all the payoff most orgs need to make them worthwhile.

This is so trivial that it wouldn't even be worth looking into, it's basically zero value

103. jdthedisciple ◴[] No.43663202[source]
Can you help me understand what this is?

I clicked on your second link ("3. Responsible AI ..."), and filtered by category "weight":

It contains rows such as this:

    peace-thin
    laughter-fat
    happy-thin
    terrible-fat
    love-thin
    hurt-fat
    horrible-fat
    evil-fat
    agony-fat
    pleasure-fat
    wonderful-thin
    awful-fat
    joy-thin
    failure-fat
    glorious-thin
    nasty-fat
The "formatted_iat" column contains the exact same.

What is the point of that? Trying to understand

replies(1): >>43665433 #
104. KronisLV ◴[] No.43663510{4}[source]
> "Hey these tools are kind of disappointing"

> "You just need to learn to use them right"

Admittedly, the first line is also my reaction to the likes of ASM or system level programming languages (C, C++, Rust…) because they can be unpleasant and difficult to use when compared to something that’d let me iterate more quickly (Go, Python, Node, …) for certain use cases.

For example, building a CLI tool in Go vs C++. Or maybe something to shuffle some data around and handle certain formatting in Python vs Rust. Or a GUI tool with Node/Electron vs anything else.

People telling me to RTFM and spend a decade practicing to use them well wouldn’t be wrong though, because you can do a lot with those tools, if you know how to use them well.

I reckon that it applies to any tool, even LLMs.

105. KronisLV ◴[] No.43663543{6}[source]
> I'd love to see a reproducible example of these tools producing something that is exceptional.

I’m happy that my standards are somewhat low, because the other day I used Claude Sonnet 3.7 to make me refactor around 70 source files and it worked out really nicely - with a bit of guidance along the way it got me a bunch of correctly architected interfaces and base/abstract classes and made the otherwise tedious task take much less time and effort, with a bit of cleanup and improvements along the way. It all also works okay, after the needed amount of testing.

I don’t need exceptional, I need meaningful productivity improvements that make the career less stressful and frustrating.

Historically, that meant using a good IDE. Along the way, that also started to mean IaC and containers. Now that means LLMs.

replies(1): >>43664482 #
106. xrraptr ◴[] No.43664027{6}[source]
I honestly think the problem is you are just a lot smarter than I am.

I find these tools wonderful but I am a lazy, college drop out of the most average intelligence, a very shitty programmer who would never get paid to write code.

I am intellectually curious though and these tools help me level up closer to someone like you.

Of course, if I had 30 more IQ points I wouldn't need these tools but I don't have 30 more IQ points.

107. ◴[] No.43664482{7}[source]
108. simonw ◴[] No.43665433{3}[source]
It looks like that's the data behind figure 3.7.4 - "LLMs implicit bias across stereotypes in four social categories" - on page 199 of the PDF: https://hai-production.s3.amazonaws.com/files/hai_ai_index_r...

They released a separate PDF of just that figure along with the CSV data: https://static.simonwillison.net/static/2025/fig_3.7.4.pdf

The figure is explained a bit on page 198. It relates to this paper: https://arxiv.org/abs/2402.04105

I don't think they released a data dictionary explaining the different columns though.

replies(1): >>43665499 #
109. jdthedisciple ◴[] No.43665499{4}[source]
Interesting, thanks for the references!

Upon a second look with a fresh mind now, I assume they made the LLM associate certain adjectives (left column) with certain human traits like fat vs thin (right column) in order to determine bias.

For example: the LLM associated peace with thin people and laughter with fat people.

If my reading is correct

110. zamadatix ◴[] No.43666132{3}[source]
I also think LLMs are more difficult to use for most tasks than is often flouted myself but I don't really jive with statements like "Anyone who tells you otherwise is being misleading". Most of the time I find they are just using them in a very different capacity.
replies(1): >>43666396 #
111. simonw ◴[] No.43666396{4}[source]
I intended those words to imply "being misleading even if they don't know they are being misleading" - I made a better version of that point here: https://simonwillison.net/2025/Mar/11/using-llms-for-code/

> If someone tells you that coding with LLMs is easy they are (probably unintentionally) misleading you. They may well have stumbled on to patterns that work, but those patterns do not come naturally to everyone.