Most active commenters

billforsternz(3)

Popular/hot comments

>>43609890 #
>>43607910 #

←back to thread

Recent AI model progress feels mostly like bullshit

(www.lesswrong.com)

Show context

InkCanon ◴[06 Apr 25 20:03 UTC] No.43604503[source]▶

>>43603453 (OP) #

The biggest story in AI was released a few weeks ago but was given little attention: on the recent USAMO, SOTA models scored on average 5% (IIRC, it was some abysmal number). This is despite them supposedly having gotten 50%, 60% etc performance on IMO questions. This massively suggests AI models simply remember the past results, instead of actually solving these questions. I'm incredibly surprised no one mentions this, but it's ridiculous that these companies never tell us what (if any) efforts have been made to remove test data (IMO, ICPC, etc) from train data.

replies(18): >>43604865 #>>43604962 #>>43605147 #>>43605224 #>>43605451 #>>43606419 #>>43607255 #>>43607532 #>>43607825 #>>43608628 #>>43609068 #>>43609232 #>>43610244 #>>43610557 #>>43610890 #>>43612243 #>>43646840 #>>43658014 #

1. billforsternz ◴[07 Apr 25 03:12 UTC] No.43607255[source]▶

>>43604503 #

I asked Google "how many golf balls can fit in a Boeing 737 cabin" last week. The "AI" answer helpfully broke the solution into 4 stages; 1) A Boeing 737 cabin is about 3000 cubic metres [wrong, about 4x2x40 ~ 300 cubic metres] 2) A golf ball is about 0.000004 cubic metres [wrong, it's about 40cc = 0.00004 cubic metres] 3) 3000 / 0.000004 = 750,000 [wrong, it's 750,000,000] 4) We have to make an adjustment because seats etc. take up room, and we can't pack perfectly. So perhaps 1,500,000 to 2,000,000 golf balls final answer [wrong, you should have been reducing the number!]

So 1) 2) and 3) were out by 1,1 and 3 orders of magnitude respectively (the errors partially cancelled out) and 4) was nonsensical.

This little experiment made my skeptical about the state of the art of AI. I have seen much AI output which is extraordinary it's funny how one serious fail can impact my point of view so dramatically.

replies(10): >>43607836 #>>43607857 #>>43607910 #>>43608930 #>>43610117 #>>43610390 #>>43611692 #>>43612201 #>>43612324 #>>43612398 #

2. Sunspark ◴[07 Apr 25 04:52 UTC] No.43607836[source]▶

>>43607255 (TP) #

It's fascinating to me when you tell one that you'd like to see translated passages of work from authors who never have written or translated the item in question, especially if they passed away before the piece was written.

The AI will create something for you and tell you it was them.

replies(1): >>43610202 #

3. senordevnyc ◴[07 Apr 25 04:57 UTC] No.43607857[source]▶

>>43607255 (TP) #

Just tried with o3-mini-high and it came up with something pretty reasonable: https://chatgpt.com/share/67f35ae9-5ce4-800c-ba39-6288cb4685...

replies(1): >>43611814 #

4. greenmartian ◴[07 Apr 25 05:04 UTC] No.43607910[source]▶

>>43607255 (TP) #

Weird thing is, in Google AI Studio all their models—from the state-of-the-art Gemini 2.5Pro, to the lightweight Gemma 2—gave a roughly correct answer. Most even recognised the packing efficiency of spheres.

But Google search gave me the exact same slop you mentioned. So whatever Search is using, they must be using their crappiest, cheapest model. It's nowhere near state of the art.

replies(4): >>43608013 #>>43609176 #>>43609774 #>>43611700 #

5. aurareturn ◴[07 Apr 25 05:22 UTC] No.43608013[source]▶

>>43607910 #

Makes sense that search has a small, fast, dumb model designed to summarize and not to solve problems. Nearly 14 billion Google searches per day. Way too much compute needed to use a bigger model.

replies(1): >>43608376 #

6. fire_lake ◴[07 Apr 25 06:17 UTC] No.43608376{3}[source]▶

>>43608013 #

Massive search overlap though - and some questions (like the golf ball puzzle) can be cached for a long time.

replies(1): >>43609132 #

7. aezart ◴[07 Apr 25 07:46 UTC] No.43608930[source]▶

>>43607255 (TP) #

> I have seen much AI output which is extraordinary it's funny how one serious fail can impact my point of view so dramatically.

I feel the same way. It's like discovering for the first time that magicians aren't doing "real" magic, just sleight of hand and psychological tricks. From that point on, it's impossible to be convinced that a future trick is real magic, no matter how impressive it seems. You know it's fake even if you don't know how it works.

replies(2): >>43609752 #>>43609890 #

8. summerlight ◴[07 Apr 25 08:21 UTC] No.43609132{4}[source]▶

>>43608376 #

AFAIK they got 15% of unseen queries everyday, so it might be not very simple to design an effective cache layer on that. Semantic-aware clustering of natural language queries and projecting them into a cache-able low rank dimension is a non-trivial problem. Of course, LLM can effectively solve that, but then what's the point of using cache when you need LLM for clustering queries...

replies(1): >>43619704 #

9. vintermann ◴[07 Apr 25 08:31 UTC] No.43609176[source]▶

>>43607910 #

I have a strong suspicion that for all the low threshold APIs/services, before the real model sees my prompt, it gets evaluated by a quick model to see if it's something they care to bother the big models with. If not i get something shaked out of the sleeve of a bottom barrel model.

10. katsura ◴[07 Apr 25 10:27 UTC] No.43609752[source]▶

>>43608930 #

To be fair, I love that magicians can pull tricks on me even though I know it is fake.

11. InDubioProRubio ◴[07 Apr 25 10:32 UTC] No.43609774[source]▶

>>43607910 #

Its most likely one giant ["input token close enough question hash"] = answer_with_params_replay? It doesent missunderstands the question, it tries to squeeze the input to something close enough?

12. bambax ◴[07 Apr 25 10:54 UTC] No.43609890[source]▶

>>43608930 #

I think there is a big divide here. Every adult on earth knows magic is "fake", but some can still be amazed and entertained by it, while others find it utterly boring because it's fake, and the only possible (mildly) interesting thing about it is to try to figure out what the trick is.

I'm in the second camp but find it kind of sad and often envy the people who can stay entertained even though they know better.

replies(5): >>43611595 #>>43611757 #>>43612440 #>>43613188 #>>43614673 #

13. throwawaymaths ◴[07 Apr 25 11:33 UTC] No.43610117[source]▶

>>43607255 (TP) #

I've seen humans make exactly these sorts of mistakes?

replies(1): >>43611857 #

14. prawn ◴[07 Apr 25 11:46 UTC] No.43610202[source]▶

>>43607836 #

"That's impossible because..."

"Good point! Blah blah blah..."

Absolutely shameless!

15. tim333 ◴[07 Apr 25 12:12 UTC] No.43610390[source]▶

>>43607255 (TP) #

A lot of humans are similarly good at some stuff and bad at other things.

Looking up the math ability of the average American this is given as an example for the median (from https://www.wyliecomm.com/2021/11/whats-the-latest-u-s-numer...):

>Review a motor vehicle logbook with columns for dates of trip, odometer readings and distance traveled; then calculate trip expenses at 35 cents a mile plus $40 a day.

Which is ok but easier than golf balls in a 747 and hugely easier than USAMO.

Another question you could try from the easy math end is: Someone calculated the tariff rate for a country as (trade deficit)/(total imports from the country). Explain why this is wrong.

16. nucleogenesis ◴[07 Apr 25 14:00 UTC] No.43611595{3}[source]▶

>>43609890 #

Idk I don’t think of it as fake - it’s creative fiction paired with sometimes highly skilled performance. I’ve learned a lot about how magic tricks work and I still love seeing performers do effects because it takes so much talent to, say, hold and hide 10 coins in your hands while showing them as empty or to shuffle a deck of cards 5x and have the audience cut it only to pull 4 aces off the top.

17. swader999 ◴[07 Apr 25 14:06 UTC] No.43611692[source]▶

>>43607255 (TP) #

It'll get it right next time because they'll hoover up the parent post.

18. Workaccount2 ◴[07 Apr 25 14:07 UTC] No.43611700[source]▶

>>43607910 #

Google is shooting themselves in the foot with whatever model they use for search. It's probably a 2B or 4B model to keep up with demand, and man is it doing way more harm than good.

19. toddmorey ◴[07 Apr 25 14:12 UTC] No.43611757{3}[source]▶

>>43609890 #

I think the problem-solving / want-to-be-engineer side of my brain lights up in that "how did he do that??" way. To me that's the fun of it... I immediately try to engineer my own solutions to what I just saw happen. So I guess I'm the first camp, but find trying to figure out the trick hugely interesting.

20. CamperBob2 ◴[07 Apr 25 14:16 UTC] No.43611814[source]▶

>>43607857 #

It's just the usual HN sport: ask a low-end, obsolete or unspecified model, get a bad answer, brag about how you "proved" AI is pointless hype, collect karma.

Edit: Then again, maybe they have a point, going by an answer I just got from Google's best current model ( https://g.co/gemini/share/374ac006497d ) I haven't seen anything that ridiculous from a leading-edge model for a year or more.

21. toddmorey ◴[07 Apr 25 14:21 UTC] No.43611857[source]▶

>>43610117 #

As another commenter mentioned, LLMs tend to make these bad mistakes with enormous confidence. And because they represent SOTA technology (and can at times deliver incredible results), they have extra credence.

More than even filling the gaps in knowledge / skills, would be a huge advancement in AI for it to admit when it doesn't know the answer or is just wildly guessing.

22. CivBase ◴[07 Apr 25 14:51 UTC] No.43612201[source]▶

>>43607255 (TP) #

I just asked my company-approved AI chatbot the same question.

It got the golf ball volume right (0.00004068 cubic meters), but it still overestimated the cabin volume at 1000 cubic meters.

It's final calculation was reasonably accurate at 24,582,115 golf balls - even though 1000 ÷ 0.00004068 = 24,582,104. Maybe it was using more significant figures for the golf ball size than it showed in its answer?

It didn't acknowledge other items in the cabin (like seats) reducing its volume, but it did at least acknowlesge inefficiencies in packing spherical objects and suggested the actual number would be "somewhat lower", though it did not offer an estimate.

When I pressed it for an estimate, it used a packing density of 74% and gave an estimate of 18,191,766 golf balls. That's one more than the calculation should have produced, but arguably insignificant in context.

Next I asked it to account for fixtures in the cabin such as seats. It estimated a 30% reduction in cabin volume and redid the calculations with a cabin volume of 700 cubic meters. These calculations were much less accurate. It told me 700 ÷ 0.00004068 = 17,201,480 (off by ~6k). And it told me 17,201,480 × 0.74 was 12,728,096 (off by ~1k).

I told it the calculations were wrong and to try again, but it produced the same numbers. Then I gave it the correct answer for 700 ÷ 0.00004068. It told me I was correct and redid the last calculation correctly using the value I provided.

Of all the things for an AI chatbot which can supposedly "reason" to fail at, I didn't expect it to be basic arithmetic. The one I used was closer, but it was still off by a lot at times despite the calculations being simple multiplication and division. Even if might not matter in the context of filling an air plane cabin with golf balls, it does not inspire trust for more serious questions.

replies(1): >>43617479 #

23. aoeusnth1 ◴[07 Apr 25 15:02 UTC] No.43612324[source]▶

>>43607255 (TP) #

2.5 pro nails each of these calculations. I don’t agree with Google’s decision to use a weak model in its search queries, but you can’t say progress on LLMs in bullshit as evidenced by a weak model no one thinks is close to SOTA.

24. raxxorraxor ◴[07 Apr 25 15:08 UTC] No.43612398[source]▶

>>43607255 (TP) #

This reminds me of Google quick answers we had for a time in search. It is quite funny if you live outside the US, because it very often got the units or numbers wrong because of different decimal delimiters.

No wonder Trump isn't afraid to put taxes against Canada. Who could take a 3.8 sqare miles country seriously?

25. tshaddox ◴[07 Apr 25 15:11 UTC] No.43612440{3}[source]▶

>>43609890 #

I think magic is extremely interesting (particularly close-up magic), but I also hate the mindset (which seems to be common though not ubiquitous) that stigmatizes any curiosity in how the trick works.

In my view, the trick as it is intended to appear to the audience and the explanation of how the trick is performed are equal and inseparable aspects of my interest as a viewer. Either one without the other is less interesting than the pair.

replies(1): >>43614605 #

26. aezart ◴[07 Apr 25 16:24 UTC] No.43613188{3}[source]▶

>>43609890 #

It's still entertaining, that's true. I like magic tricks.

The point is the analogy to LLMs. A lot of people are very optimistic about their capabilities, while other people who have "seen behind the curtain" are skeptical, and feel that the fundamental flaws are still there even if they're better-hidden.

27. mrandish ◴[07 Apr 25 18:48 UTC] No.43614605{4}[source]▶

>>43612440 #

> that stigmatizes any curiosity in how the trick works.

As a long-time close-up magician and magical inventor who's spent a lot of time studying magic theory (which has been a serious field of magical research since the 1960s), it depends on which way we interpret "how the trick works." Frankly, for most magic tricks the method isn't very interesting, although there are some notable exceptions where the method is fascinating, sometimes to the extent it can be far more interesting than the effect it creates.

However, in general, most magic theorists and inventors agree that the method, for example, "palm a second coin in the other hand", isn't usually especially interesting. Often the actual immediate 'secret' of the method is so simple and, in hindsight, obvious that many non-magicians feel rather let down if the method is revealed. This is the main reason magicians usually don't reveal secret methods to non-magicians. It's not because of some code of honor, it's simply because the vast majority of people think they'll be happy if they know the secret but are instead disappointed.

Where studying close-up magic gets really fascinating is understanding why that simple, obvious thing works to mislead and then surprise audiences in the context of this trick. Very often changing subtle things seemingly unrelated to the direct method will cause the trick to stop fooling people or to be much less effective. Comparing a master magician to even a competent, well-practiced novice performing the exact same effect with the same method can be a night and day difference. Typically, both performances will fool and entertain audiences but the master's performance can have an intensely more powerful impact. Like leaving most audience members in stunned shock vs just pleasantly surprised and fooled. While neither the master nor novice's audiences have any idea of the secret method, this dramatic difference in impact is fascinating because careful deconstruction reveals it often has little to do with mechanical proficiency in executing the direct method. In other words, it's rarely driven by being able to do the sleight of hand faster or more dexterously. I've seen legendary close-up masters like a Dai Vernon or Albert Goshman when in their 80s and 90s perform sleight of hand with shriveled, arthritic hands incapable of even cleanly executing a basic palm, absolutely blow away a roomful of experienced magicians with a trick all the magicians already knew. How? It turns out there's something deep and incredibly interesting about the subtle timing, pacing, body language, posture, and psychology surrounding the "secret method" that elevates the impact to almost transcendence compared to a good, competent but uninspired performance of the same method and effect.

Highly skilled, experienced magicians refer to the complex set of these non-method aspects, which can so powerfully elevate an effect to another level, as "the real work" of the trick. At the top levels, most magicians don't really care about the direct methods which some audience members get so obsessed about. They aren't even interesting. And, contrary to what most non-magicians think, these non-methods are the "secrets" master magicians tend to guard from widespread exposure. And it's pretty easy to keep this crucially important "real work" secret because it's so seemingly boring and entirely unlike what people expect a magic secret to be. You have to really "get it" on a deeper level to even understand that what elevated the effect was intentionally establishing a completely natural-seeming, apparently random three-beat pattern of motion and then carefully injecting a subtle pause and slight shift in posture to the left six seconds before doing "the move". Audiences mistakenly think that "the hidden move" is the secret to the trick when it's just the proximate first-order secret. Knowing that secret won't get you very far toward recreating the absolute gob-smacking impact resulting from a master's years of experimentation figuring out and deeply understanding which elements beyond the "secret method" really elevate the visceral impact of the effect to another level.

replies(1): >>43616235 #

28. abustamam ◴[07 Apr 25 18:54 UTC] No.43614673{3}[source]▶

>>43609890 #

I love magic, and illusions in general. I know that Disney's Haunted Mansion doesn't actually have ghosts. But it looks pretty convincing, and watching the documentaries about how they made it is pretty mind-blowing especially considering that they built the original long before I was born.

I look at optical illusions like The Dress™ and am impressed that I cannot force my brain to see it correctly even though I logically know what color it is supposed to be.

Finding new ways that our brains can be fooled despite knowing better is kind of a fun exercise in itself.

29. tshaddox ◴[07 Apr 25 21:38 UTC] No.43616235{5}[source]▶

>>43614605 #

> Frankly, for most magic tricks the method isn't very interesting, although there are some notable exceptions where the method is fascinating, sometimes to the extent it can be far more interesting than the effect it creates.

> However, in general, most magic theorists and inventors agree that the method, for example, "palm a second coin in the other hand", isn't usually especially interesting.

Fair enough. It sounds like I simply fundamentally disagree, because I think nearly any explanation of method is very interesting. For close-up maginc, the only exceptions for me would be if the explanation is "the video you were watching contains visual effects" or "the entire in-person audience was in on it."

Palming is awesome. Misdirection is awesome. I fully expect these sorts of things to be used in most magic tricks, but I still want to know precisely how. The fact that I'm aware of most close-up magic techniques but am still often fooled by magic tricks should make it pretty clear that the methods are interesting!

replies(1): >>43617212 #

30. mrandish ◴[08 Apr 25 00:16 UTC] No.43617212{6}[source]▶

>>43616235 #

> Palming is awesome. Misdirection is awesome.

Since studying magic has been a lifelong passion since I was a kid, I clearly couldn't agree more. However, experience has shown that despite claiming otherwise, most people aren't actually interested in the answer to "How did you do that?" beyond the first 30 seconds. So... you're unusual - and that's great!

> but I still want to know precisely how.

Well, you're extremely fortunate to be interested in learning how magic is really done at the best time in history for doing so. I was incredibly lucky to be accepted into the Magic Castle as a teenager and mentored by Dai Vernon (widely thought to be the greatest close-up magician of the 20th century) who was in his late 80s at the time. I also had access the Castle's library of magic books, the largest in the world at the time. 99% of other kids on Earth interested in magic at the time only had a handful of local public library books and mail-order tricks.

Today there's an incredible amount of insanely high-quality magic instruction available in streaming videos, books and online forums. There are even master magicians who teach those willing to learn via Zoom. While most people think magicians want to hoard their secrets, the reality couldn't be more different. Magicians love teaching how to actually do magic to anyone who really wants to learn. However, most magicians aren't interested in wasting time satisfying the extremely fleeting curiosity of those who only want to know "how it works" in the surface sense of that first 30 seconds of only revealing the proximate 'secret method'.

Yet many magicians will happily devote hours to teaching anyone who really wants to actually learn how to do magic themselves and is willing put in the time and effort to develop the skills, even if those people have no intention of ever performing magic for others - and even if the student isn't particularly good at it. It just requires the interest to go really deep on understanding the underlying principles and developing the skills, even if for no other purpose than just having the knowledge and skills. Personally, I haven't performed magic for non-magicians in over a decade but I still spend hours learning and mastering new high-level skills because it's fun, super intellectually interesting and extremely satisfying. If you're really interested, I encourage you to dive in. There's quite literally never been a better time to learn magic.

31. billforsternz ◴[08 Apr 25 01:17 UTC] No.43617479[source]▶

>>43612201 #

> It's final calculation was reasonably accurate at 24,582,115 golf balls - even though 1000 ÷ 0.00004068 = 24,582,104. Maybe it was using more significant figures for the golf ball size than it showed in its answer?

1000 ÷ 0.00004068 = 25,000,000. I think this is an important point that's increasingly widely misunderstood. All those extra digits you show are just meaningless noise and should be ruthlessly eliminated. If 1000 cubic metres in this context really meant 1000.000 cubic metres, then by all means show maybe the four digits of precision you get from the golf ball (but I am more inclined to think 1000 cubic metres is actually the roughest of rough approximations, with just one digit of precision).

In other words, I don't fault the AI for mismatching one set of meaninglessly precise digits for another, but I do fault it for using meaninglessly precise digits in the first place.

replies(1): >>43617644 #

32. CivBase ◴[08 Apr 25 01:50 UTC] No.43617644{3}[source]▶

>>43617479 #

I agree those digits are not significant in the context of the question asked. But if the AI is going to use that level of precision in the answer, I expect it to be correct.

replies(1): >>43639574 #

33. fire_lake ◴[08 Apr 25 09:11 UTC] No.43619704{5}[source]▶

>>43609132 #

Not a search engineer, but wouldn’t a cache lookup to a previous LLM result be faster than a conventional free text search over the indexed websites? Seems like this could save money whilst delivering better results?

replies(1): >>43625227 #

34. summerlight ◴[08 Apr 25 18:48 UTC] No.43625227{6}[source]▶

>>43619704 #

Yes, that's what Google's doing for AI overview IIUC. From what I've seen from my experiences, this is working okay and improving over time but not close to perfection. The results are stale for developing stories, some bad results are kept there for a long time, effectively same queries are returning different caches etc etc...

35. billforsternz ◴[10 Apr 25 00:52 UTC] No.43639574{4}[source]▶

>>43617644 #

Fair enough, I agree, simple arithmetic calculations shouldn't generate mysterious answers.

↑