The pace of notable releases across the industry right now is unlike any time I remember since I started doing this in the early 2000's. And it feels like it's accelerating

replies(3): >>43707964 #>>43708571 #>>43712041 #

6. typs ◴[16 Apr 25 17:13 UTC] No.43707854[source]▶

>>43707719 (OP) #

I’m not sure I fully understand the rationale of having newer mini versions (eg o3-mini, o4-mini) when previous thinking models (eg o1) and smart non-thinking models (eg gpt-4.1) exist. Does anyone here use these for anything?

replies(2): >>43707901 #>>43707916 #

7. morkalork ◴[16 Apr 25 17:13 UTC] No.43707855[source]▶

>>43707719 (OP) #

If the ai is smart, why not have it choose the model for the user

replies(1): >>43707870 #

8. firejake308 ◴[16 Apr 25 17:14 UTC] No.43707866[source]▶

>>43707719 (OP) #

Not sure what the goal is with Codex CLI. It's not running a local LLM right, just a CLI to make API calls from the terminal?

replies(1): >>43707899 #

9. zvitiate ◴[16 Apr 25 17:14 UTC] No.43707870[source]▶

>>43707855 #

That’s what GPT-5 was supposed to be (instead of a new base or reasoning model) last Sam updated his plans I thought. Did those change again?

10. originalvichy ◴[16 Apr 25 17:15 UTC] No.43707885[source]▶

>>43707719 (OP) #

Is there a non-obvious reason using something like Python to solve queries requiring calculations was not used from day one with LLMs?

replies(2): >>43707905 #>>43708187 #

11. ksylvest ◴[16 Apr 25 17:15 UTC] No.43707886[source]▶

>>43707719 (OP) #

Are these available via the API? I'm getting back 'model_not_found' when testing.

12. planb ◴[16 Apr 25 17:15 UTC] No.43707887[source]▶

>>43707719 (OP) #

What is wrong with OpenAI? The naming of their models seems like it is intentionally confusing - maybe to distract from lack of progress? Honestly, I have no idea which model to use for simply everyday tasks anymore.

replies(6): >>43707954 #>>43707970 #>>43707996 #>>43708010 #>>43708019 #>>43709904 #

13. falleng0d ◴[16 Apr 25 17:15 UTC] No.43707890[source]▶

>>43707719 (OP) #

Maybe they should ask the new models to generate a better name for themselves. It's getting quite confusing.

14. gallerdude ◴[16 Apr 25 17:16 UTC] No.43707897[source]▶

>>43707838 #

For coding, I like the Aider polyglot benchmark, since it covers multiple programming languages.

Gemini 2.5 Pro got 72.9%

o3 high gets 81.3%, o4-mini high gets 68.9%

replies(4): >>43708090 #>>43708632 #>>43709557 #>>43709763 #

15. maheshrijal ◴[16 Apr 25 17:16 UTC] No.43707899[source]▶

>>43707866 #

This might be their answer to claude code more than anything else.

replies(2): >>43708003 #>>43708849 #

16. sho_hn ◴[16 Apr 25 17:16 UTC] No.43707901[source]▶

>>43707854 #

I use o3-mini-high in Aider, where I want a model to employ reasoning but not put up with the latency of the non-mini o1.

17. planb ◴[16 Apr 25 17:16 UTC] No.43707905[source]▶

>>43707885 #

Because it‘s not a feature of the LLM but the product that is built around it (like ChatGPT).

replies(1): >>43708061 #

18. zapnuk ◴[16 Apr 25 17:16 UTC] No.43707908[source]▶

>>43707719 (OP) #

Surprisingly, they didn't provide a comparison to Sonnet 3.7 or Gemini Pro 2.5—probably because, while both are impressive, they're only slightly better by comparison.

Lets see what the pricing looks like.

replies(3): >>43707971 #>>43708136 #>>43708313 #

19. burke ◴[16 Apr 25 17:16 UTC] No.43707912[source]▶

>>43707719 (OP) #

It's pretty frustrating to see a press release with "Try on ChatGPT" and then not see the models available even though I'm paying them $200/mo.

replies(6): >>43707920 #>>43707975 #>>43707988 #>>43708378 #>>43708732 #>>43711405 #

20. drvladb ◴[16 Apr 25 17:16 UTC] No.43707916[source]▶

>>43707854 #

o1 is a much larger, more expensive to operate on OpenAI's end. Having a smaller "newer" (roughly equating newer to more capable) model means that you can match the performance of larger older models while reducing inference and API costs.

21. andrethegiant ◴[16 Apr 25 17:17 UTC] No.43707917[source]▶

>>43707719 (OP) #

Buried in the article, a new CLI for coding:

> Codex CLI is fully open-source at https://github.com/openai/codex today.

replies(2): >>43708163 #>>43710904 #

22. _bin_ ◴[16 Apr 25 17:17 UTC] No.43707920[source]▶

>>43707912 #

I see o4-mini on the $20 tier but no o3.

23. testfrequency ◴[16 Apr 25 17:17 UTC] No.43707921{3}[source]▶

>>43707851 #

I’m assuming when you say “read once”, that implies reading once every single release?

It’s confusing. If I’m confused, it’s confusing. This is UX 101.

24. darioush ◴[16 Apr 25 17:18 UTC] No.43707929[source]▶

>>43707831 #

It's becoming a bit like iphone 3, 4... 13, 25...

Ok they are all phones that run apps and have a camera. I'm not an "AI power user", but I do talk to ChatGPT + Grok for daily tasks and use copilot.

The big step function happened when they could search the web but not much else has changed in my limited experience.

replies(1): >>43707983 #

25. mrits ◴[16 Apr 25 17:18 UTC] No.43707930{3}[source]▶

>>43707851 #

Some people don't blindly trust the marketing department of the publisher

replies(1): >>43707948 #

26. n2d4 ◴[16 Apr 25 17:18 UTC] No.43707933[source]▶

>>43707831 #

This one seems to make it easier — if the promises here hold true, the multi-modal support probably makes o4-mini-high OpenAI's best model for most tasks unless you have time and money, in which case it's o3-pro.

27. behnamoh ◴[16 Apr 25 17:19 UTC] No.43707940[source]▶

>>43707719 (OP) #

OpenAI be like:

    o1, o1-mini,
    o1-pro, o3,
    o4-mini, gpt-4,
    gpt-4o, gpt-4-turbo,
    gpt-4.5, gpt-4.1,
    gpt-4o-mini, gpt-4.1-mini,
    gpt-4.1-nano, gpt-3.5-turbo

replies(1): >>43708056 #

28. xqcgrek2 ◴[16 Apr 25 17:19 UTC] No.43707943[source]▶

>>43707719 (OP) #

Underwhelming. Cancelled my subscription in favor of Gemini Pro 2.5

29. fkyoureadthedoc ◴[16 Apr 25 17:20 UTC] No.43707948{4}[source]▶

>>43707930 #

Then it doesn't even matter what they name the model since it's just marketing that they wouldn't trust anyway.

30. georgewsinger ◴[16 Apr 25 17:20 UTC] No.43707951[source]▶

>>43707719 (OP) #

Very impressive! But under arguably the most important benchmark -- SWE-bench verified for real-world coding tasks -- Claude 3.7 still remains the champion.[1]

Incredible how resilient Claude models have been for best-in-coding class.

[1] But by only about 1%, and inclusive of Claude's "custom scaffold" augmentation (which in practice I assume almost no one uses?). The new OpenAI models might still be effectively best in class now (or likely beating Claude with similar augmentation?).

replies(7): >>43708008 #>>43708068 #>>43708249 #>>43708545 #>>43709203 #>>43713202 #>>43716307 #

31. sho_hn ◴[16 Apr 25 17:20 UTC] No.43707954[source]▶

>>43707887 #

Seems to me like they're somewhat trying to simplify now.

GPT-N.m -> Non-reasoning

oN -> Reasoning

oN+1-mini -> Reasoning but speedy; cut-down version of an upcoming oN model (unclear if true or marketing)

It would be nice if they actually stick to this pattern.

replies(3): >>43708026 #>>43708083 #>>43709806 #

32. oofbaroomf ◴[16 Apr 25 17:20 UTC] No.43707955[source]▶

>>43707719 (OP) #

Finally, a new SOTA model on SWE-bench. Love to see this progress, and nice to see OpenAI finally catching up in the coding domain.

33. rahimnathwani ◴[16 Apr 25 17:20 UTC] No.43707960[source]▶

>>43707719 (OP) #

  ChatGPT Plus, Pro, and Team users will see o3, o4-mini, and o4-mini-high in the model selector starting today, replacing o1, o3‑mini, and o3‑mini‑high.

I subscribe to pro but don't yet see the new models (either in the Android app or on the web version).

replies(1): >>43708079 #

34. 1123581321 ◴[16 Apr 25 17:21 UTC] No.43707962[source]▶

>>43707831 #

I think it can be confusing if you're just reading the news. If you use ChatGPT, the model selector has good brief explanations and teaching you about newly available options if you don't visit the dropdown. Anthropic does similarly.

35. emp17344 ◴[16 Apr 25 17:21 UTC] No.43707964[source]▶

>>43707849 #

Not really. We’re definitely in the incremental improvement stage at this point. Certainly no indication that progress is “accelerating”.

replies(3): >>43708074 #>>43708367 #>>43712868 #

36. czk ◴[16 Apr 25 17:21 UTC] No.43707966{3}[source]▶

>>43707851 #

"good at advanced reasoning", "fast at advanced reasoning", "slower at advanced reasoning but more advanced than the good one but not as fast but cant search the internet", "great at code and logic", "good for everyday tasks but awful at everything else", "faster for most questions but answers them incorrectly", "can draw but cant search", "can search but cant draw", "good for writing and doing creative things"

replies(1): >>43708005 #

37. ApolloFortyNine ◴[16 Apr 25 17:21 UTC] No.43707967[source]▶

>>43707719 (OP) #

Maybe OpenAI needs an easy mode for all these people saying 5 choices of models (and that's only if you pay) is simply too confusing for them.

They even provide a description in the UI of each before you select it, and it defaults to a model for you.

If you just want an answer of what you should use and can't be bothered to research them, just use o3(4)-mini and call it a day.

replies(1): >>43708073 #

38. davidkunz ◴[16 Apr 25 17:21 UTC] No.43707968[source]▶

>>43707719 (OP) #

I wish companies would adhere to a consistent naming scheme, like <name>-<params>-<cut-off-month>.

39. dabeeeenster ◴[16 Apr 25 17:21 UTC] No.43707970[source]▶

>>43707887 #

It really is bizarre. If you had asked me 2 days ago I would have said unequivically that these models already existed. Surely given the rate of change a date-based numbering system would be more helpful?

40. oofbaroomf ◴[16 Apr 25 17:21 UTC] No.43707971[source]▶

>>43707908 #

They didn't provide a comparison either in the GPT-4.1 release and quite a few past releases, which is telling of their attitude as an org.

41. ◴[16 Apr 25 17:22 UTC] No.43707975[source]▶

>>43707912 #

42. CamperBob2 ◴[16 Apr 25 17:22 UTC] No.43707978[source]▶

>>43707831 #

I asked OpenAI how to choose the right USB cable for my device. Now the objects around me are shimmering and winking out of existence, one by one. Help

replies(1): >>43708676 #

43. meetpateltech ◴[16 Apr 25 17:22 UTC] No.43707979[source]▶

>>43707719 (OP) #

o3 is cheaper than o1. (per 1M tokens)

• o3 Pricing:

  - Input: $10.00  

  - Cached Input: $2.50  

  - Output: $40.00

• o1 Pricing:

  - Input: $15.00  

  - Cached Input: $7.50  

  - Output: $60.00

o4-mini pricing remains the same as o3-mini.

44. ben_w ◴[16 Apr 25 17:22 UTC] No.43707981[source]▶

>>43707719 (OP) #

4o and o4 at the same time. Excellent work on the product naming, whoever did that.

replies(3): >>43708561 #>>43708656 #>>43708822 #

45. refulgentis ◴[16 Apr 25 17:22 UTC] No.43707983{3}[source]▶

>>43707929 #

This is a very apt analogy.

It confers to the speaker confirmation they're absolutely right - names are arbitrary.

While also politely, implicitly, pointing out the core issue is it doesn't matter to you --- which is fine! --- but it may just be contributing to dull conversation to be the 10th person to say as much.

46. TuxSH ◴[16 Apr 25 17:22 UTC] No.43707988[source]▶

>>43707912 #

They're supposed to be released today for everyone, and o3-pro for Pro users in a few weeks:

"ChatGPT Plus, Pro, and Team users will see o3, o4-mini, and o4-mini-high in the model selector starting today, replacing o1, o3‑mini, and o3‑mini‑high."

with rate limits unchanged

47. evaneykelen ◴[16 Apr 25 17:22 UTC] No.43707993[source]▶

>>43707719 (OP) #

A suggestion for OpenAI to create more meaningful model names:

{Size}-{Quarter/Year}-{Speed/Accuracy}-{Specialty}

Where:

* Size is XS/S/M/L/XL/XXL to indicate overall capability level

* Quarter/Year like Q2-25

* Speed/Accuracy indicated as Fast/Balanced/Precise

* Optional specialty tag like Code/Vision/Science/etc

Example model names:

* L-Q2-25-Fast-Code (Large model from Q2 2025, optimized for speed, specializes in coding)

* M-Q4-24-Balanced (Medium model from Q4 2024, balanced speed/accuracy)

replies(4): >>43708299 #>>43708493 #>>43708595 #>>43708729 #

48. i_love_retros ◴[16 Apr 25 17:23 UTC] No.43707996[source]▶

>>43707887 #

I tend to look at the lmarena leaderboard to see what to use (or the aider polyglot leaderboard for coding)

49. sho_hn ◴[16 Apr 25 17:23 UTC] No.43708003{3}[source]▶

>>43707899 #

Looks more like a direct competitor to Aider.

replies(1): >>43708982 #

50. fkyoureadthedoc ◴[16 Apr 25 17:23 UTC] No.43708005{4}[source]▶

>>43707966 #

Putting the actual list would have made it too clear that I'm right I see

51. oofbaroomf ◴[16 Apr 25 17:23 UTC] No.43708008[source]▶

>>43707951 #

Claude got 63.2% according to the swebench.com leaderboard (listed as "Tools + Claude 3.7 Sonnet (2025-02-24)).[0] OpenAI said they got 69.1% in their blog post.

[0] swebench.com/#verified

replies(3): >>43708246 #>>43708263 #>>43708363 #

52. xd1936 ◴[16 Apr 25 17:23 UTC] No.43708010[source]▶

>>43707887 #

Fix coming this summer, hopefully.

https://twitter.com/sama/status/1911906570835022319

53. ◴[16 Apr 25 17:24 UTC] No.43708015[source]▶

>>43707719 (OP) #

54. ◴[16 Apr 25 17:24 UTC] No.43708019[source]▶

>>43707887 #

55. jagger27 ◴[16 Apr 25 17:24 UTC] No.43708026{3}[source]▶

>>43707954 #

Are the oN models built on top of GPT-N.m models? It would be nice to know the lineage there.

56. _fat_santa ◴[16 Apr 25 17:24 UTC] No.43708027[source]▶

>>43707719 (OP) #

So at this point OpenAI has 6 reasoning models, 4 flagship chat models, and 7 cost optimized models. So that's 17 models in total and that's not even counting their older models and more specialized ones. Compare this with Anthropic that has 7 models in total and 2 main ones that they promote.

This is just getting to be a bit much, seems like they are trying to cover for the fact that they haven't actually done much. All these models feel like they took the exact same base model, tweaked a few things and released it as an entirely new model rather than updating the existing ones. In fact based on some of the other comments here it sounds like these are just updates to their existing model, but they release them as new models to create more media buzz.

replies(22): >>43708044 #>>43708100 #>>43708150 #>>43708219 #>>43708340 #>>43708462 #>>43708605 #>>43708626 #>>43708645 #>>43708647 #>>43708800 #>>43708970 #>>43709059 #>>43709249 #>>43709317 #>>43709652 #>>43709926 #>>43710038 #>>43710114 #>>43710609 #>>43710652 #>>43713438 #

57. amarcheschi ◴[16 Apr 25 17:26 UTC] No.43708044[source]▶

>>43708027 #

The old Chinese strategy of having 7343 different phone models with almost the same specs to confuse the customer better

replies(3): >>43708346 #>>43709682 #>>43710292 #

58. ◴[16 Apr 25 17:27 UTC] No.43708056[source]▶

>>43707940 #

59. rahimnathwani ◴[16 Apr 25 17:27 UTC] No.43708061{3}[source]▶

>>43707905 #

It's true that product provides the tools, but the model still needs to be trained to use tools, or it won't use them well or at the right times.

60. mentalgear ◴[16 Apr 25 17:27 UTC] No.43708067[source]▶

>>43707719 (OP) #

I have doubts whether the live stream was really live.

During the live-stream the subtitles are shown line by line.

When subtitles are auto-generated, they pop up word by word, which I assume would need to happen during a real live stream.

Line-by-line subtitles are shown if the uploader provides captions by themselves for an existing video, the only way OpenAI could provide captions ahead of time, is if the "live-stream" isn't actually live.

replies(2): >>43708091 #>>43708147 #

61. jjani ◴[16 Apr 25 17:27 UTC] No.43708068[source]▶

>>43707951 #

Gemini 2.5 Pro is widely considered superior to 3.7 Sonnet now by heavy users, but they don't have an SWE-bench score. Shows that looking at one such benchmark isn't very telling. Main advantage over Sonnet being that it's better at using a large amount of context, which is enormously helpful during coding tasks.

Sonnet is still an incredibly impressive model as it held the crown for 6 months, which may as well be a decade with the current pace of LLM improvement.

replies(6): >>43708198 #>>43709336 #>>43710444 #>>43712513 #>>43714843 #>>43720979 #

62. brokencode ◴[16 Apr 25 17:27 UTC] No.43708073[source]▶

>>43707967 #

I personally like being able to choose because I understand the tradeoffs and want to choose the best one for what I’m asking. So I hope this doesn’t go away.

But I agree that they probably need some kind of basic mode to make things easier for the average person. The basic mode should decide automatically what model to use and hide this from the user.

replies(2): >>43713066 #>>43713995 #

63. nwienert ◴[16 Apr 25 17:27 UTC] No.43708074{3}[source]▶

>>43707964 #

ChatGPT 3 : iPhone 1

A bunch of models later, we're about on the iPhone 4-5 now. Feels about right.

replies(1): >>43711992 #

64. oofbaroomf ◴[16 Apr 25 17:28 UTC] No.43708079[source]▶

>>43707960 #

Same...

replies(1): >>43708103 #

65. bogtog ◴[16 Apr 25 17:29 UTC] No.43708083{3}[source]▶

>>43707954 #

I suspect that "ChatGPT-4o" is the most confusing part. Absolutely baffling to go with that and then later "oN", but surely they will avoid any "No" models moving forward

66. EcommerceFlow ◴[16 Apr 25 17:29 UTC] No.43708089[source]▶

>>43707719 (OP) #

A very subtle mention of o3-pro, which I'd imagine is now the most capable programming model. Excited to see when I get access to that.

Good thing I stopped working a few hours ago

EDIT: Altman tweeted o3-pro is coming out in a few weeks, looks like that guy misspoke :(

67. asadm ◴[16 Apr 25 17:29 UTC] No.43708090{3}[source]▶

>>43707897 #

thanks

68. KTibow ◴[16 Apr 25 17:29 UTC] No.43708091[source]▶

>>43708067 #

I think this is just a quirk of how Google does live captions.

69. Workaccount2 ◴[16 Apr 25 17:29 UTC] No.43708098[source]▶

>>43707719 (OP) #

o4-mini, not to be confused with 4o-mini.

70. ◴[16 Apr 25 17:30 UTC] No.43708100[source]▶

>>43708027 #

71. oofbaroomf ◴[16 Apr 25 17:30 UTC] No.43708103{3}[source]▶

>>43708079 #

It's there now in the web app for me.

replies(1): >>43708169 #

72. spencersolberg ◴[16 Apr 25 17:32 UTC] No.43708130[source]▶

>>43707719 (OP) #

The Codex CLI looks nice, but it's a shame I have to bring my own API key when I already subscribe to ChatGPT Plus

73. Workaccount2 ◴[16 Apr 25 17:32 UTC] No.43708136[source]▶

>>43707908 #

Looks like they are taking a page from Apple's book, which is to never even acknowledge other products exist outside your ecosystem.

replies(1): >>43715811 #

74. oofbaroomf ◴[16 Apr 25 17:33 UTC] No.43708144[source]▶

>>43707719 (OP) #

When are they going to release o3-high? I don't think it's in the API, and I certainly don't see it in the web app (Pro).

replies(1): >>43708395 #

75. carlita_express ◴[16 Apr 25 17:33 UTC] No.43708146[source]▶

>>43707719 (OP) #

> we’ve observed that large-scale reinforcement learning exhibits the same “more compute = better performance” trend observed in GPT‑series pretraining.

Didn’t the pivot to RL from pretraining happen because the scaling “law” didn’t deliver the expected gains? (Or at least because O(log) increases in model performance became unreasonably costly?) I see they’ve finally resigned themselves to calling these trends, not laws, but trends are often fleeting. Why should we expect this one to hold for much longer?

replies(2): >>43709169 #>>43712766 #

76. ipsum2 ◴[16 Apr 25 17:34 UTC] No.43708147[source]▶

>>43708067 #

All YouTube live streams are like this.

77. kristofferR ◴[16 Apr 25 17:34 UTC] No.43708150[source]▶

>>43708027 #

To use that criticism for this release ain't really fair, as these will replace the old models (o3 will replace o1, o4-mini will replace o3-mini).

On a more general level - sure, but they aren't planning to use this release to add a larger number of models, it's just that deprecating/killing the old models can't be done overnight.

replies(1): >>43708470 #

78. jcynix ◴[16 Apr 25 17:34 UTC] No.43708160[source]▶

>>43707719 (OP) #

To plan a visit to a dark sky place, I used duck.ai (Duckduckgo's experimental AI chat feature) to ask five different AIs on what date the new moon will happen in August 2025.

GPT-4o mini: The new moon in August 2025 will occur on August 12.

Llama 3.3 70B: The new moon in August 2025 is expected to occur on August 16, 2025.

Claude 3 Haiku: The new moon in August 2025 will occur on August 23, 2025.

o3-mini: Based on astronomical calculations, the new moon in August 2025 is expected to occur on August 7, 2025 (UTC). [...]

Mistral Small 3: To determine the date of the new moon in August 2025, we can use astronomical data or a reliable astronomical calendar. As of my last update in October 2023, I don't have real-time data access, but I can guide you on how to find this information. [...]

I got different answers, mostly wrong. My calendars (both paper and app versions) show me 23. august as the date.

And btw, when I asked those AIs which entries in a robots.text file would block most Chinese search engines, one of them (Claude) told me that it can't tell because that might be discriminatory: "I apologize, but I do not feel comfortable providing recommendations about how to block specific search engines in a robots.txt file. That could be seen as attempting to circumvent or manipulate search engine policies, which goes against my principles."

replies(5): >>43708189 #>>43708210 #>>43709121 #>>43709510 #>>43710432 #

79. ipsum2 ◴[16 Apr 25 17:34 UTC] No.43708163[source]▶

>>43707917 #

Looks like a Claude Code clone.

replies(1): >>43709734 #

80. rahimnathwani ◴[16 Apr 25 17:35 UTC] No.43708169{4}[source]▶

>>43708103 #

I see them in the Android app now.

81. ipsum2 ◴[16 Apr 25 17:36 UTC] No.43708187[source]▶

>>43707885 #

LLMs could not use tools on day one.

82. xnx ◴[16 Apr 25 17:36 UTC] No.43708189[source]▶

>>43708160 #

Gemini gets the new moon right. Better to use one good model than 5 worse ones.

replies(1): >>43708765 #

83. unsupp0rted ◴[16 Apr 25 17:37 UTC] No.43708198{3}[source]▶

>>43708068 #

Main advantage over Sonnet is Gemini 2.5 doesn't try to make a bunch of unrelated changes like it's rewriting my project from scratch.

replies(6): >>43708296 #>>43708338 #>>43708390 #>>43708580 #>>43708811 #>>43709225 #

84. WhatIsDukkha ◴[16 Apr 25 17:38 UTC] No.43708210[source]▶

>>43708160 #

I would never ask any of these questions of an LLM (and I use and rely on LLMs multiple times a day), this is a job for a computer.

I would also never ask a coworker for this precise number either.

replies(4): >>43708844 #>>43709680 #>>43709731 #>>43712171 #

85. leesec ◴[16 Apr 25 17:38 UTC] No.43708219[source]▶

>>43708027 #

"haven't actually done much" being popularizing the chat llm and absolutely dwarfing the competition in paid usage

replies(6): >>43708303 #>>43708311 #>>43708349 #>>43708941 #>>43709054 #>>43709594 #

86. kumarm ◴[16 Apr 25 17:39 UTC] No.43708225[source]▶

>>43707719 (OP) #

Anyone got codex working? After installing and setting up API Key I get this error :

    system
      OpenAI rejected the request (request ID: req_06727eaf1c5d1e3f900760d10ca565a7). Please verify your settings and try again.

╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

87. bratao ◴[16 Apr 25 17:39 UTC] No.43708227[source]▶

>>43707719 (OP) #

Oh god. I´m Brazilian and can´t get the "Verification". Using my passport or id. This is very frighting future.

88. oofbaroomf ◴[16 Apr 25 17:40 UTC] No.43708236[source]▶

>>43707719 (OP) #

Still a knowledge cutoff of August 2023. That is a significant bottleneck to devs using it for AI stuff.

replies(1): >>43709093 #

89. pton_xd ◴[16 Apr 25 17:41 UTC] No.43708244[source]▶

>>43707719 (OP) #

This reminds me of keeping up with all the latest JavaScript framework trivia circa the ~2010s

replies(1): >>43708348 #

90. awestroke ◴[16 Apr 25 17:41 UTC] No.43708246{3}[source]▶

>>43708008 #

OpenAI have not shown themselves to be trustworthy, I'd take their claims with a few solar masses of salt

91. lattalayta ◴[16 Apr 25 17:41 UTC] No.43708249[source]▶

>>43707951 #

I haven't been following them that closely, but are people finding these benchmarks relevant? It seems like these companies could just tune their models to do well on particular benchmarks

replies(2): >>43708433 #>>43712302 #

92. georgewsinger ◴[16 Apr 25 17:42 UTC] No.43708263{3}[source]▶

>>43708008 #

Yes, however Claude advertised 70.3%[1] on SWE bench verified when using the following scaffolding:

> For Claude 3.7 Sonnet and Claude 3.5 Sonnet (new), we use a much simpler approach with minimal scaffolding, where the model decides which commands to run and files to edit in a single session. Our main “no extended thinking” pass@1 result simply equips the model with the two tools described here—a bash tool, and a file editing tool that operates via string replacements—as well as the “planning tool” mentioned above in our TAU-bench results.

Arguably this shouldn't be counted though?

[1] https://www.anthropic.com/_next/image?url=https%3A%2F%2Fwww-...

replies(1): >>43708567 #

93. WhitneyLand ◴[16 Apr 25 17:42 UTC] No.43708264[source]▶

>>43707719 (OP) #

So it looks like no increase in context window size since it’s not mentioned anywhere.

I assume this announcement is all 256k, while the base model 4.1 just shot up this week to a million.

94. basisword ◴[16 Apr 25 17:42 UTC] No.43708269[source]▶

>>43707719 (OP) #

The user experience needs to be massively improved when it comes to model choice. How are average users supposed to know which model to pick? Why shouldn't I just always pick the newest or most powerful one? Why should I have to choose at all? I say this from the perspective of a ChatGPT user - I understand the different pricing on the API side helps people make decisions.

95. tempaccount420 ◴[16 Apr 25 17:43 UTC] No.43708283[source]▶

>>43707831 #

As another consumer, I think you're overreacting, it's not that bad.

96. itsmevictor ◴[16 Apr 25 17:45 UTC] No.43708296{4}[source]▶

>>43708198 #

I find Gemini 2.5 truly remarkable and overall better than Claude, which I was a big fan of

replies(1): >>43708611 #

97. oofbaroomf ◴[16 Apr 25 17:45 UTC] No.43708299[source]▶

>>43707993 #

This is even more incomprehensible to users who don't understand what this naming scheme is supposed to mean. Right now, most power users are keeping track of all the models and know what they are like, so this naming wouldn't help them. Normal consumers don't really know the difference between the models, but this wouldn't help them either - all those letters and numbers aren't super inviting and friendly. They could try just having a linear slider for amount of intelligence and another one for speed.

98. amarcheschi ◴[16 Apr 25 17:45 UTC] No.43708303{3}[source]▶

>>43708219 #

I guess it was related to the last period, rather than the full picture

replies(1): >>43709606 #

99. iLoveOncall ◴[16 Apr 25 17:46 UTC] No.43708311{3}[source]▶

>>43708219 #

ChatGPT was released in 2022, so OP's point stands perfectly well.

replies(1): >>43708336 #

100. BeetleB ◴[16 Apr 25 17:46 UTC] No.43708313[source]▶

>>43707908 #

Pricing is already available:

https://platform.openai.com/docs/pricing

101. iamronaldo ◴[16 Apr 25 17:47 UTC] No.43708332[source]▶

>>43707719 (OP) #

Tyler cowen seems convinced https://marginalrevolution.com/marginalrevolution/2025/04/o3...

replies(2): >>43710548 #>>43717166 #

102. echelon ◴[16 Apr 25 17:47 UTC] No.43708336{4}[source]▶

>>43708311 #

They're rumored to be working on a social network to rival X with the focus being on image generations.

https://techcrunch.com/2025/04/15/openai-is-reportedly-devel...

The play now seems to be less AGI, more "too big to fail" / use all the capital to morph into a FAANG bigtech.

My bet is that they'll develop a suite of office tools that leverage their model, chat/communication tools, a browser, and perhaps a device.

They're going to try to turn into Google (with maybe a bit of Apple and Meta) before Google turns into them.

Near-term, I don't see late stage investors as recouping their investment. But in time, this may work out well for them. There's a tremendous amount of inefficiency and lack of competition amongst the big tech players. They've been so large that nobody else could effectively challenge them. Now there's a "startup" with enough capital to start eating into big tech's more profitable business lines.

replies(4): >>43708440 #>>43708642 #>>43708769 #>>43709004 #

103. jdgoesmarching ◴[16 Apr 25 17:48 UTC] No.43708338{4}[source]▶

>>43708198 #

Also that Gemini 2.5 still doesn’t support prompt caching, which is huge for tools like Cline.

replies(1): >>43708480 #

104. wilg ◴[16 Apr 25 17:48 UTC] No.43708340[source]▶

>>43708027 #

There are 9 models in the ChatGPT model picker and they have stated that it's their goal to get rid of the model picker because everyone finds it annoying.

105. kylehotchkiss ◴[16 Apr 25 17:49 UTC] No.43708346{3}[source]▶

>>43708044 #

This sounds like recent Dell and Lenovo strategies

replies(1): >>43708624 #

106. bufferoverflow ◴[16 Apr 25 17:49 UTC] No.43708348[source]▶

>>43708244 #

JS framework thing is still ongoing

https://krausest.github.io/js-framework-benchmark/2025/table...

107. swyx ◴[16 Apr 25 17:49 UTC] No.43708349{3}[source]▶

>>43708219 #

seriously. the level of arrogance combined with ignorance is awe inspiring.

replies(1): >>43709312 #

108. swyx ◴[16 Apr 25 17:50 UTC] No.43708363{3}[source]▶

>>43708008 #

they also gave more detail on their SWEBench scaffolding here https://www.latent.space/p/claude-sonnet

109. Workaccount2 ◴[16 Apr 25 17:50 UTC] No.43708367{3}[source]▶

>>43707964 #

Integration is accelerating rapidly. Even if model development froze today, we would still probably have ~5 years of adoption and integration before it started to level off.

replies(1): >>43709182 #

110. wilg ◴[16 Apr 25 17:51 UTC] No.43708378[source]▶

>>43707912 #

They are all now available on the Pro plan. Y'all really ought to have a little bit more grace to wait 30 minutes after the announcement for the rollout.

replies(1): >>43708547 #

111. Jordan-117 ◴[16 Apr 25 17:52 UTC] No.43708383[source]▶

>>43707748 #

Fuck Everything, We're Doing o5

112. Workaccount2 ◴[16 Apr 25 17:52 UTC] No.43708390{4}[source]▶

>>43708198 #

It's viable context, context length where is doesn't fall apart, is also much longer.

113. wilg ◴[16 Apr 25 17:53 UTC] No.43708395[source]▶

>>43708144 #

> We expect to release OpenAI o3‑pro in a few weeks with full tool support. For now, Pro users can still access o1‑pro.

https://openai.com/index/introducing-o3-and-o4-mini/

114. emp17344 ◴[16 Apr 25 17:56 UTC] No.43708433{3}[source]▶

>>43708249 #

That’s exactly what’s happening. I’m not convinced there’s any real progress occurring here.

115. refulgentis ◴[16 Apr 25 17:57 UTC] No.43708440{5}[source]▶

>>43708336 #

I don't know how anyone could look at any of this and say ponderously: it's basically the same as Nov 2022 ChatGPT. Thus strategically they're pivoting to social to become too big to fail.

replies(1): >>43708454 #

116. jawiggins ◴[16 Apr 25 17:58 UTC] No.43708452[source]▶

>>43707719 (OP) #

In the examples they demonstrate tool use in the reasoning loop. The models pretty impressively recognize they need some external data, and either complete a web search, or write and execute python to solve intermediate steps.

To the extent that reasoning is noisy and models can go astray during it, this helps inject truth back into the reasoning loop.

Is there some well known equivalent to Moores Law for token use? We're headed in a direction where LLM control loops can run 24/7 generating tokens to reason about live sensor data, and calling tools to act on it.

117. echelon ◴[16 Apr 25 17:58 UTC] No.43708454{6}[source]▶

>>43708440 #

I mean, it's not fucking AGI/ASI. No amount of LLM flip floppery is going to get us terminators.

If this starts looking differently and the pace picks up, I won't be giving analysis on OpenAI anymore. I'll start packing for the hills.

But to OpenAI's credit, I also don't see how minting another FAANG isn't an incredible achievement. Like - wow - this tech giant was willed into existence. Can't we marvel at that a little bit without worrying about LLMs doing our taxes?

replies(2): >>43708507 #>>43710413 #

118. shmatt ◴[16 Apr 25 17:59 UTC] No.43708462[source]▶

>>43708027 #

Im old enough to remember the mystery and hype before o*/o1/strawberry that was supposed to be essentially AGI. We had serious news outlets write about senior people at OpenAI quitting because o1 was SkyNet

Now we're up to o4, AGI is still not even in near site (depending on your definition, I know). And OpenAI is up to about 5000 employees. I'd think even before AGI a new model would be able to cover for at least 4500 of those employees being fired, is that not the case?

replies(8): >>43708694 #>>43708755 #>>43708824 #>>43709411 #>>43709774 #>>43710199 #>>43710213 #>>43710748 #

119. drcongo ◴[16 Apr 25 17:59 UTC] No.43708470{3}[source]▶

>>43708150 #

As someone who doesn't use anything OpenAI (for all the reasons), I have to agree with the GP. It's all baffling. Why is there an o3-mini and an o4-mini? Why on earth are there so many models?

Once you get to this point you're putting the paradox of choice on the user - I used to use a particular brand toothpaste for years until it got to the point where I'd be in the supermarket looking at a wall of toothpaste all by the same brand with no discernible difference between the products. Why is one of them called "whitening"? Do the others not do that? Why is this one called "complete" and that one called "complete ultra"? That would suggest that the "complete" one wasn't actually complete. I stopped using that brand of toothpaste as it become impossible to know which was the right product within the brand.

If I was assessing the AI landscape today, where the leading models are largely indistinguishable in day to day use, I'd look at OpenAI's wall of toothpaste and immediately discount them.

replies(4): >>43708621 #>>43708737 #>>43708778 #>>43708895 #

120. eric-p7 ◴[16 Apr 25 18:00 UTC] No.43708479[source]▶

>>43707719 (OP) #

Babe wake up a new LLM just dropped.

121. scrlk ◴[16 Apr 25 18:00 UTC] No.43708480{5}[source]▶

>>43708338 #

2.5 Pro supports prompt caching now: https://cloud.google.com/vertex-ai/generative-ai/docs/models...

replies(1): >>43708565 #

122. simianwords ◴[16 Apr 25 18:00 UTC] No.43708485[source]▶

>>43707719 (OP) #

I feel like the only reason O3 is better than O1 just due to the tool usage. With tool use O1 could be similar to O3.

123. jsnell ◴[16 Apr 25 18:01 UTC] No.43708493[source]▶

>>43707993 #

I think they should name them after fictional characters. Bonus points if they're trademarked characters.

"You gotta try Mickey, it beats the crap out of Gandalf in coding."

124. refulgentis ◴[16 Apr 25 18:02 UTC] No.43708507{7}[source]▶

>>43708454 #

I don't know what AGI/ASI means to you.

I'm bullish on the models, and my first quiet 5 minutes after the announcement was spent thinking how many of the people I walked past days would be different if the computer Just Did It(tm) (I don't think their day would be different, so I'm not bullish on ASI-even-if-achieved, I guess?)

I think binary analysis that flips between "this is a propped up failure, like when banks get bailouts" and "I'd run away from civilization" isn't really worth much.

125. thefourthchime ◴[16 Apr 25 18:06 UTC] No.43708545[source]▶

>>43707951 #

Also, if you're using Cursor AI, it seems to have much better integration with Claude where it can reflect on its own things and go off and run commands. I don't see it doing that with Gemini or the O1 models.

126. drcongo ◴[16 Apr 25 18:06 UTC] No.43708547{3}[source]▶

>>43708378 #

Or maybe OpenAI could wait until they'd released it before telling people to use it now.

replies(3): >>43708985 #>>43711049 #>>43721587 #

127. janderson215 ◴[16 Apr 25 18:07 UTC] No.43708561[source]▶

>>43707981 #

It took me reading your comment to realize that they were different and this wasn’t deja vu. Maybe that says more about me than OpenAI, but my gut agrees with you.

128. jdgoesmarching ◴[16 Apr 25 18:07 UTC] No.43708565{6}[source]▶

>>43708480 #

Oh, that must’ve been in the last few days. Weird that it’s only in 2.5 Pro preview but at least they’re headed in the right direction.

Now they just need a decent usage dashboard that doesn’t take a day to populate or require additional GCP monitoring services to break out the model usage.

129. tedsanders ◴[16 Apr 25 18:08 UTC] No.43708567{4}[source]▶

>>43708263 #

I think you may have misread the footnote. That simpler setup results in the 62.3%/63.7% score. The 70.3% score results from a high-compute parallel setup with rejection sampling and ranking:

> For our “high compute” number we adopt additional complexity and parallel test-time compute as follows:

> We sample multiple parallel attempts with the scaffold above

> We discard patches that break the visible regression tests in the repository, similar to the rejection sampling approach adopted by Agentless; note no hidden test information is used.

> We then rank the remaining attempts with a scoring model similar to our results on GPQA and AIME described in our research post and choose the best one for the submission.

> This results in a score of 70.3% on the subset of n=489 verified tasks which work on our infrastructure. Without this scaffold, Claude 3.7 Sonnet achieves 63.7% on SWE-bench Verified using this same subset.

replies(1): >>43709569 #

130. Topfi ◴[16 Apr 25 18:08 UTC] No.43708570[source]▶

>>43707719 (OP) #

I have barely found time to gauge 4.1s capabilities, so at this stage, I’d rather focus on the ever worsening names these companies bestow upon their models. To say that I the USB-IF have found their match would be an understatement.

131. qoez ◴[16 Apr 25 18:08 UTC] No.43708571[source]▶

>>43707849 #

Lots of releases but very little actual performance increases

replies(1): >>43708812 #

132. ◴[16 Apr 25 18:09 UTC] No.43708576[source]▶

>>43707719 (OP) #

133. zaptrem ◴[16 Apr 25 18:10 UTC] No.43708580{4}[source]▶

>>43708198 #

I do find it likes to subtly reformat every single line thereby nuking my diff and making its changes unusable since I can’t verify them that way, which Sonnet doesn’t do.

134. pembrook ◴[16 Apr 25 18:12 UTC] No.43708595[source]▶

>>43707993 #

Thank god we don’t usually let engineers name stuff in the west.

While this is entirely logical in theory this is how you get LG style naming like “THE ALL NEW LG-CFT563-X2”

I mean, it makes total sense, it tells you exactly the model, region, series and edition! Right??

135. whalesalad ◴[16 Apr 25 18:13 UTC] No.43708605[source]▶

>>43708027 #

Model fatigue is a real thing - Particularly with their billing model that is wildly different from model to model and gives you more headroom as you spend more. We spend a lot of time and effort running tests across many models to balance for that cost/performance ratio. When you can run 300k tokens per min on a shittier model, or 10k tokens per min on a better model - you want to use the cheaper model but if the performance isn't there then you gotta pivot. Can I use tools here? Can I use function calling here? Do I use the chat API, the chat completions API, or the responses API? Do either of those work with the model I want to use, or only with other models?

I almost wonder if this is intentional ... because when you create a quagmire of insane inter-dependent billing scenarios you end up with a product like AWS that can generate substantial amounts of revenue from sheer ignorance or confusion. Then you can hire special consultants to come in and offer solutions to your customers in order to wade through the muck on your behalf.

Dealing with OpenAI's API's is a straight up nightmare.

136. SweetSoftPillow ◴[16 Apr 25 18:13 UTC] No.43708606[source]▶

>>43707838 #

Some sources mention that o3 scores 63.8 on SWE-bench, while Gemini 2.5 Pro scores 69.1.

On most other benchmarks, they seem to perform about the same, which is bad news for o3 because it's much more expensive and slower than Gemini 2.5 Pro, and it also hides its reasoning while Gemini shows everything.

We can probably just stick with Gemini 2.5 Pro, since it offers the best combination of price, quality, and speed. No need to worry about finding a replacement (for now).

replies(1): >>43712286 #

137. enraged_camel ◴[16 Apr 25 18:14 UTC] No.43708611{5}[source]▶

>>43708296 #

Still doesn't work well in Cursor unfortunately.

replies(3): >>43709559 #>>43710997 #>>43712870 #

138. mkozlows ◴[16 Apr 25 18:14 UTC] No.43708621{4}[source]▶

>>43708470 #

They keep a lot of models around for backward compatibility for API users. This is confusing, but not inherently a bad idea.

139. whalesalad ◴[16 Apr 25 18:14 UTC] No.43708624{4}[source]▶

>>43708346 #

recent? they've been doing this for decades.

person a: "I just got an new macbook pro!"

person b: "Nice! I just got a Lenovo YogaPilates Flipfold XR 3299 T92 Thinkbookpad model number SRE44939293X3321"

...

person a: "does that have oled?"

person b: "Lol no silly that is model SRE44939293XB3321". Notice the B in the middle?!?! That is for OLED.

replies(1): >>43709783 #

140. crowcroft ◴[16 Apr 25 18:15 UTC] No.43708626[source]▶

>>43708027 #

Most industries, or categories go through cycles of fragmentation and consolidation.

AI is currently in a high growth expansion phase. The leads to rapid iteration and fragmentation because getting things released is the most important thing.

When the models start to plateau or the demands on the industry are for profit you will see consolidation start.

replies(1): >>43709444 #

141. vessenes ◴[16 Apr 25 18:16 UTC] No.43708632{3}[source]▶

>>43707897 #

where do you find those o3 high numbers? https://aider.chat/docs/leaderboards/ currently has gemini 2.5 pro as the leader at, as you say, 72.9%.

replies(1): >>43708984 #

142. fkyoureadthedoc ◴[16 Apr 25 18:17 UTC] No.43708642{5}[source]▶

>>43708336 #

Not surprising. Add comments to sora.com and you've got a social network.

replies(1): >>43709772 #

143. paxys ◴[16 Apr 25 18:17 UTC] No.43708645[source]▶

>>43708027 #

OpenAI isn't selling GPT-4 or o1 or o4-mini or turbo or whatever else to the general public. These announcements may as well be them releasing GPT v12.582.599385. No one outside of a small group of nerds cares. The end consumer is going to chatgpt.com and typing things in the box.

replies(2): >>43708834 #>>43710245 #

144. resters ◴[16 Apr 25 18:17 UTC] No.43708647[source]▶

>>43708027 #

They do this because people like to have predictability. A new model may behave quite differently on something that’s important for a use case.

Also, there are a lot of cases where very small models are just fine and others where they are not. It would always make sense to have the smallest highest performing models available.

replies(1): >>43709975 #

145. throwuxiytayq ◴[16 Apr 25 18:18 UTC] No.43708656[source]▶

>>43707981 #

Just wait until they announce oA and A0.

They jokingly admitted that they’re bad at naming in the 4.1 reveal video, so they’re certainly aware of the problem. They’re probably hoping to make the model lineup clearer after some of the older models get retired, but the current mess was certainly entirely foreseeable.

replies(1): >>43715139 #

146. neya ◴[16 Apr 25 18:19 UTC] No.43708671[source]▶

>>43707719 (OP) #

The most annoying part of all this is they replaced o1 with o3 without any notices or warnings. This is why I hate proprietary models.

replies(1): >>43708863 #

147. I_am_tiberius ◴[16 Apr 25 18:20 UTC] No.43708674[source]▶

>>43707719 (OP) #

What is again the advantage of pro over plus subscriptions?

replies(1): >>43708759 #

148. ithkuil ◴[16 Apr 25 18:20 UTC] No.43708676{3}[source]▶

>>43707978 #

Lol. But that's nothing. Wait until you shimmer and wink in and out of existence, like llms do during each completion

149. djohnston ◴[16 Apr 25 18:21 UTC] No.43708690[source]▶

>>43707719 (OP) #

Any quick impressions of o3 vs o1? We've got one inference in our product that only o1 has seemed to handle well, wondering if o3 can replace it.

replies(1): >>43708898 #

150. pants2 ◴[16 Apr 25 18:21 UTC] No.43708694{3}[source]▶

>>43708462 #

Remember that Docusign has 7,000 employees. I think OpenAI is pretty lean for what they're accomplishing.

replies(4): >>43708723 #>>43709013 #>>43709232 #>>43709548 #

151. osigurdson ◴[16 Apr 25 18:22 UTC] No.43708704[source]▶

>>43707719 (OP) #

I have a very basic / stupid "Turing test" which is just to write a base 62 converter in C#. I would think this exact thing would be in github somewhere (thus in the weights) but has always failed for me in the past (non-scientific / didn't try every single model).

Using o4-mini-high, it actually did produce a working implementation after a bit of prompting. So yeah, today, this test passed which is cool.

replies(3): >>43708784 #>>43709386 #>>43713122 #

152. LanceJones ◴[16 Apr 25 18:24 UTC] No.43708729[source]▶

>>43707993 #

What about using Marvel superhero names (with permission, of course)? The studio keeps giving us stronger and stronger examples...

153. brcmthrowaway ◴[16 Apr 25 18:25 UTC] No.43708732[source]▶

>>43707912 #

Holy crap... thats expensive.

154. petesergeant ◴[16 Apr 25 18:25 UTC] No.43708737{4}[source]▶

>>43708470 #

> Why is there an o3-mini and an o4-mini? Why on earth are there so many models?

Because if they removed access to o3-mini — which I have tested, costed, and built around — I would be very angry. I will probably switch to o4-mini when the time is right.

replies(1): >>43708953 #

155. stavros ◴[16 Apr 25 18:28 UTC] No.43708755{3}[source]▶

>>43708462 #

> Im old enough to remember the mystery and hype before o*/o1/strawberry

So at least two years old?

replies(2): >>43708972 #>>43709063 #

156. postmaster ◴[16 Apr 25 18:28 UTC] No.43708759[source]▶

>>43708674 #

> We expect to release OpenAI o3‑pro in a few weeks with full tool support. For now, Pro users can still access o1‑pro.

replies(1): >>43708815 #

157. kenjackson ◴[16 Apr 25 18:28 UTC] No.43708765{3}[source]▶

>>43708189 #

I think all the full power LLMs will get it right because they do web search. ChatGPT 4 does as well.

replies(1): >>43710702 #

158. paul7986 ◴[16 Apr 25 18:28 UTC] No.43708769{5}[source]▶

>>43708336 #

chatGPT should be built into my iMessage threads with friends. @chatGPT "Is there an evening train on Thursdays from Brussels to Berlin?" Something a friend and I were discussing but we had to exit out of iMessage and use GPT then back to iMessage.

For UX The GPT info in the thread would be collapsed by default and both users have the discretion to click to expand the info.

159. pcdoodle ◴[16 Apr 25 18:29 UTC] No.43708773[source]▶

>>43707719 (OP) #

It seems to be getting better. I used to use my custom "Turbo Chad" GPT based on 4o and now the default models are similar. Is it learning from my previous annoyances?

It has been getting better IMO.

160. fpgaminer ◴[16 Apr 25 18:30 UTC] No.43708776[source]▶

>>43707719 (OP) #

On the vision side of things: I ran my torture test through it, and while it performed "well", about the same level as 4o and o1, it still fails to handle spatial relationships well, and did hallucinate some details. OCR is a little better it seems, but a more thorough OCR focused test would be needed to know for sure. My torture tests are more focused on accurately describing the content of images.

Both seem to be better at prompt following and have more up to date knowledge.

But honestly, if o3 was only at the same level as o1, it'd still be an upgrade since it's cheaper. o1 is difficult to justify in the API due to cost.

161. louthy ◴[16 Apr 25 18:30 UTC] No.43708778{4}[source]▶

>>43708470 #

You could develop an AI model to help pick the correct AI model.

Now you’ve got 18 problems.

replies(1): >>43709651 #

162. sebzim4500 ◴[16 Apr 25 18:31 UTC] No.43708784[source]▶

>>43708704 #

Unless I'm misunderstanding what you are asking the model to do, Gemini 2.5 pro just passed this easily. https://g.co/gemini/share/e2876d310914

replies(2): >>43708929 #>>43711326 #

163. irthomasthomas ◴[16 Apr 25 18:32 UTC] No.43708800[source]▶

>>43708027 #

That would explain why they all have a knowledge cutoff (likely training date) of ~August 2023.

164. bitbuilder ◴[16 Apr 25 18:34 UTC] No.43708811{4}[source]▶

>>43708198 #

This was incredibly irritating at first, though over time I've learned to appreciate this "extra credit" work. It can be fun to see what Claude thinks I can do better, or should add in addition to whatever feature I just asked for. Especially when it comes to UI work, Claude actually has some pretty cool ideas.

If I'm using Claude through Copilot where it's "free" I'll let it do its thing and just roll back to the last commit if it gets too ambitious. If I really want it to stay on track I'll explicitly tell it in the prompt to focus only on what I've asked, and that seems to work.

And just today, I found myself leaving a comment like this: //Note to Claude: Do not refactor the below. It's ugly, but it's supposed to be that way.

Never thought I'd see the day I was leaving comments for my AI agent coworker.

replies(1): >>43709077 #

165. int_19h ◴[16 Apr 25 18:34 UTC] No.43708812{3}[source]▶

>>43708571 #

Sonnet and Gemini saw fairly substantial perf increases recenly

replies(1): >>43709040 #

166. ◴[16 Apr 25 18:34 UTC] No.43708814[source]▶

>>43707719 (OP) #

167. I_am_tiberius ◴[16 Apr 25 18:34 UTC] No.43708815{3}[source]▶

>>43708759 #

Ok, so currently they pay for nothing (or is o1-pro superior to o3?).

168. stavros ◴[16 Apr 25 18:34 UTC] No.43708822[source]▶

>>43707981 #

Oh, that was Altman Sam.

replies(1): >>43709639 #

169. irthomasthomas ◴[16 Apr 25 18:34 UTC] No.43708824{3}[source]▶

>>43708462 #

Yeah, I don't know exactly what at an AGI model will look like, but I think it would have more than 200k context window.

replies(3): >>43709865 #>>43710042 #>>43710363 #

170. taytus ◴[16 Apr 25 18:35 UTC] No.43708826[source]▶

>>43707719 (OP) #

This is a mess. I do follow AI news, and do no know if this is "better/faster/cheaper" than 4.1

Why are they doing this?

171. astrange ◴[16 Apr 25 18:35 UTC] No.43708834{3}[source]▶

>>43708645 #

They have an enterprise business too. I think it's relevant for that.

replies(1): >>43709752 #

172. achierius ◴[16 Apr 25 18:36 UTC] No.43708844{3}[source]▶

>>43708210 #

But it's a good reminder when so many enterprises like to claim that hallucinations have "mostly been solved".

replies(1): >>43709300 #

173. sebzim4500 ◴[16 Apr 25 18:36 UTC] No.43708846{3}[source]▶

>>43707851 #

Aside from anything else, having one model called o4 and one model called 4o is confusing. And I know they haven't released o4 yet but still.

replies(1): >>43712912 #

174. mpaepper ◴[16 Apr 25 18:37 UTC] No.43708849{3}[source]▶

>>43707899 #

Yes, that's exactly what I thought as well. An attempt to get more share in the developer tooling space for the long term.

175. sebzim4500 ◴[16 Apr 25 18:39 UTC] No.43708863[source]▶

>>43708671 #

Meanwhile we have people elsewhere in the thread complaining about too many models.

Assuming OpenAI are correct that o3 is strictly an improvement over o1 then I don't see why they'd keep o1 around. When they upgrade gpt-o4 they don't let you use the old version, after all.

replies(1): >>43709595 #

176. rsanheim ◴[16 Apr 25 18:42 UTC] No.43708890[source]▶

>>43707719 (OP) #

`ETOOMANYMODELS`

Is there a reputable, non-blogspam site that offers a 'cheat sheet' of sorts for what models to use, in particular for development? Not just openAI, but across the main cloud offerings and feasible local models?

I know there are the benchmarks, and directories like huggingface, and you can get a 'feel' for things by scanning threads here or other forums.

I'm thinking more of something that provides use-case tailored "top 3" choices by collecting and summarizing different data points. For example:

* agent & tool based dev (cloud) - [top 3 models] * agent & tool based dev (local) - m1, m2, m,3 * code review / high level analysis - ... * general tech questions - ... * technical writing (ADRs, needs assessments, etc) - ...

Part of the problem is how quickly the landscape changes everyday, and also just relying on benchmarks isn't enough: it ignores cost, and more importantly ignores actual user experience (which I realize is incredibly hard to aggregate & quantify).

replies(3): >>43711451 #>>43713256 #>>43714126 #

177. tedsanders ◴[16 Apr 25 18:42 UTC] No.43708895{4}[source]▶

>>43708470 #

(I work at OpenAI.)

In ChatGPT, o4-mini is replacing o3-mini. It's a straight 1-to-1 upgrade.

In the API, o4-mini is a new model option. We continue to support o3-mini so that anyone who built a product atop o3-mini can continue to get stable behavior. By offering both, developers can test both and switch when they like. The alternative would be to risk breaking production apps whenever we launch a new model and shut off developers without warning.

I don't think it's too different from what other companies do. Like, consider Apple. They support dozens of iPhone models with their software updates and developer docs. And if you're an app developer, you probably want to be aware of all those models and docs as you develop your app (not an exact analogy). But if you're a regular person and you go into an Apple store, you only see a few options, which you can personalize to what you want.

If you have concrete suggestions on how we can improve our naming or our product offering, happy to consider them. Genuinely trying to do the best we can, and we'll clean some things up later this year.

Fun fact: before GPT-4, we had a unified naming scheme for models that went {modality}-{size}-{version}, which resulted in names like text-davinci-002. We considered launching GPT-4 as something like text-earhart-001, but since everyone was calling it GPT-4 anyway, we abandoned that system to use the name GPT-4 that everyone had already latched onto. Kind of funny how our unified naming scheme originally made room for 999 versions, but we didn't make it past 3.

replies(2): >>43709084 #>>43711162 #

178. sebzim4500 ◴[16 Apr 25 18:42 UTC] No.43708898[source]▶

>>43708690 #

They are replacing o1 with o3 in the UI, at least for me, so they must be pretty confident it is a strict improvement.

179. Sol- ◴[16 Apr 25 18:44 UTC] No.43708923[source]▶

>>43707719 (OP) #

Interesting that using tools to zoom around the image is useful for the model. I was kind of assuming that these models were beyond such things and could attend to all aspects image simultaneously anyway, but perhaps their input is still limited in the resolution? Very cool, in any case, spooky progress as always.

replies(1): >>43708993 #

180. osigurdson ◴[16 Apr 25 18:44 UTC] No.43708929{3}[source]▶

>>43708784 #

As I mentioned, this is not a scientific test but rather just something that I have tried from time to time and has always (shockingly in my opinion) failed but today worked. It takes a minute of two of prompting, is boring to verify and I don't remember exactly which models I have used. It is purely a personal anecdote, nothing more.

However, looking at the code that Gemini wrote in the link, it does the same thing that other LLMs often do, which is to assume that we are encoding individual long values. I assume there must be a github repo or stackoverflow question in the weights somewhere that is pushing it in this direction but it is a little odd. Naturally, this isn't the kind encoder that someone would normally want. Typically it should encode a byte array and return a string (or maybe encode / decode UTF8 strings directly). Having the interface use a long is very weird and not very useful.

In any case, I suspect with a bit more prompting you might be able to get gemini to do the right thing.

replies(2): >>43711098 #>>43711934 #

181. littlestymaar ◴[16 Apr 25 18:46 UTC] No.43708941{3}[source]▶

>>43708219 #

ChatGPT was released two and a half years ago though. Pretty sure that at some point Sam Altman had promised us AGI by now.

The person you're responding to is correct that OpenAI feels a lot more stagnant than other players (like Google, which was nowhere to be seen even one year and a half ago and now has the leading model on pretty much every metric, but also DeepSeek, who built a competitive model in a year that runs for much cheaper).

replies(1): >>43710383 #

182. TuxSH ◴[16 Apr 25 18:46 UTC] No.43708953{5}[source]▶

>>43708737 #

They just did that, at least for chat

replies(1): >>43715065 #

183. energy123 ◴[16 Apr 25 18:47 UTC] No.43708958[source]▶

>>43707831 #

Gemini 2.5 Pro for every single task was the meta until this release. Will have to reassess now.

replies(3): >>43708968 #>>43709096 #>>43712077 #

184. hollerith ◴[16 Apr 25 18:48 UTC] No.43708968{3}[source]▶

>>43708958 #

Huh. I use Gemini 2.0 Flash for many things because it's several times faster than 2.5 Pro.

replies(2): >>43709677 #>>43710695 #

185. onlyrealcuzzo ◴[16 Apr 25 18:48 UTC] No.43708970[source]▶

>>43708027 #

> All these models feel like they took the exact same base model, tweaked a few things and released it as an entirely new model rather than updating the existing ones.

That's not a problem in and of itself. It's only a problem if the models aren't good enough.

Judging by ChatGPT's adoption, people seem to think they're doing just fine.

186. throwanem ◴[16 Apr 25 18:48 UTC] No.43708972{4}[source]▶

>>43708755 #

Honestly, sometimes I wonder if most people these days kinda aren't at least that age, you know? Or less inhibited about acting it than I believe I recall people being last decade. Even compared to just a few years back, people seem more often to struggle to carry a thought, and resort much more quickly to emotional belligerence.

Oh, not that I haven't been as knocked about in the interim, of course. I'm not really claiming I'm better, and these are frightening times; I hope I'm neither projecting nor judging too harshly. But even trying to discount for the possibility, there still seems something new left to explain.

replies(1): >>43710498 #

187. whitten ◴[16 Apr 25 18:49 UTC] No.43708982{4}[source]▶

>>43708003 #

Where do I find out more about Aider ?

replies(2): >>43709055 #>>43709072 #

188. re-thc ◴[16 Apr 25 18:49 UTC] No.43708984{4}[source]▶

>>43708632 #

It's in the OpenAI article post (OP) i.e. OpenAI ran Aider themselves.

replies(1): >>43730783 #

189. ◴[16 Apr 25 18:49 UTC] No.43708985{4}[source]▶

>>43708547 #

190. littlestymaar ◴[16 Apr 25 18:50 UTC] No.43708993[source]▶

>>43708923 #

There's just a certain amount of things the image encoder can process at once. It's pretty apparent when you give the models a big table in an image.

replies(1): >>43713996 #

191. sbochins ◴[16 Apr 25 18:51 UTC] No.43709001[source]▶

>>43707719 (OP) #

So far with my random / coding design question that I asked with o1 last week, it did substantially better with o3. It’s more like a mid level engineer and less like a intern.

192. hu3 ◴[16 Apr 25 18:52 UTC] No.43709004{5}[source]▶

>>43708336 #

I appreciate the info and I have a question:

Why would anyone use a social network run by Sam Altman? No offense but his reputation is chaotic neutral to say the least.

Social networks require a ton of momentum to get going.

BlueSky already ate all the momentum that X lost.

replies(2): >>43709613 #>>43709621 #

193. tymscar ◴[16 Apr 25 18:52 UTC] No.43709007[source]▶

>>43707719 (OP) #

Gave Codex a go with o4-mini and it's disappointing... Here you can see my tries. It fully fails on something a mid engineer can do after getting used to the tools: https://xcancel.com/Tymscar/status/1912578655378628847

194. scarface_74 ◴[16 Apr 25 18:52 UTC] No.43709013{4}[source]▶

>>43708694 #

Yes and Amazon has 1.52 million employees. How many developers could they possibly need?

Or maybe it’s just nonsensical to compare the number of employees across companies - especially when they don’t do nearly the same thing.

On a related note, wait until you find out how many more employees that Apple has than Google since Apple has hundreds of retail employees.

replies(2): >>43709376 #>>43709397 #

195. mchusma ◴[16 Apr 25 18:55 UTC] No.43709040{4}[source]▶

>>43708812 #

Love Sonnet but 3.7 is not obviously an improvement over 3.5 in my real world usage. Gemini 2.5 pro is great, has replaced most others for me (Grok I use for things that require realtime answers)

replies(2): >>43710308 #>>43711956 #

196. ein0p ◴[16 Apr 25 18:55 UTC] No.43709044{5}[source]▶

>>43708723 #

The closest Elon ever came to anything Hague-worthy is allowing Starlink to be used in Ukrainian attacks on Russian civilian infrastructure. I don't think the Hague would be interested in anything like that. And if his life is worthless, then what would you say about your own? Nonetheless, I commend you on your complete lack of hinges. /s

replies(1): >>43709056 #

197. spaceywilly ◴[16 Apr 25 18:56 UTC] No.43709054{3}[source]▶

>>43708219 #

They have 500M weekly users now. I would say that counts as doing something.

replies(1): >>43715982 #

198. tailspin2019 ◴[16 Apr 25 18:56 UTC] No.43709055{5}[source]▶

>>43708982 #

https://aider.chat

199. throwanem ◴[16 Apr 25 18:56 UTC] No.43709056{6}[source]▶

>>43709044 #

Oh, I'm thinking more in the sense of the special one-off kinds of trials, the sort Gustave Gilbert so ably observed. The venue is a matter of convenience, nothing more. To the rest I would say the worth of my life is no more mine to judge than anyone else is competent to do the same for themselves, or indeed other than foolish to pursue the attempt.

200. w10-1 ◴[16 Apr 25 18:56 UTC] No.43709059[source]▶

>>43708027 #

> This is just getting to be a bit much, seems like they are > trying to cover for the fact that they haven't actually done much

Or perhaps they're trying to make some important customers happy by showing movement on areas the customers care about. Subjectively, customers get locked in by feeling they have the inside track, and these small tweaks prove that. Objectively, the small change might make a real difference to the customer's use case.

Similarly, it's important to force development teams to actually ship, and shipping more frequently reduces risk, so this could reflect internal discipline.

As for media buzz, OpenAI is probably trying to tamp that down; they have plenty of first-mover advantage. More puffery just makes their competitors seem more important, and the risk to their reputation of a flop is a lot larger than the reward of the next increment.

As for "a bit much", before 2023 I was thinking I could meaningfully track progress and trade-off's in selecting tech, but now the cat is not only out of the bag, it's had more litters than I can count. So, yeah - a bit much!

replies(1): >>43709164 #

201. bananaflag ◴[16 Apr 25 18:56 UTC] No.43709063{4}[source]▶

>>43708755 #

I think people expected reasoning to be more than just trained chain of thought (which was known already at the time). On the other hand, it is impressive that CoT can achieve so much.

202. stavros ◴[16 Apr 25 18:57 UTC] No.43709072{5}[source]▶

>>43708982 #

Just wait a few seconds and there will be a post here with Aider benchmarks for the new model, or https://aider.chat

203. TuxSH ◴[16 Apr 25 18:58 UTC] No.43709077{5}[source]▶

>>43708811 #

> If I'm using Claude through Copilot where it's "free"

Too bad Microsoft is widely limiting this -- have you seen their pricing changes?

I also feel like they nerfed their models, or reduced context window again.

replies(1): >>43711593 #

204. daveguy ◴[16 Apr 25 18:58 UTC] No.43709084{5}[source]▶

>>43708895 #

Have any of the models been deprecated? It seems like a deprecation plan and definition of timelines would be extraordinarily helpful.

I have not seen any sort of "If you're using X.122, upgrade to X.123, before 202X. If you're using X.120, upgrade to anything before April 2026, because the model will no longer be available on that date." ... Like all operating systems and hardware manufacturers have been doing for decades.

Side note, it's amusing that stable behavior is only available on a particular model with a sufficiently low temperature setting. As near-AGI shouldn't these models be smart enough to maintain consistency or improvement from version to version?

replies(1): >>43709273 #

205. dr_kiszonka ◴[16 Apr 25 18:59 UTC] No.43709091[source]▶

>>43707719 (OP) #

I want to be excited about this but after chatting with 4.1 about a simple app screenshot and it continuously forgetting and hallucinating, I am increasingly sceptical of Open AI's announcements. (No coding involved, so the context window was likely < 10% full.)

206. cryptoz ◴[16 Apr 25 18:59 UTC] No.43709093[source]▶

>>43708236 #

I've taken to pasting in the latest OpenAI API docs for their python library to each prompt (via API, I'm not pasting each time manually in ChatGPT) so that the AI can write code that uses itself! Like, I get it, the training data thing is hard, but - OpenAI changed their python library with breaking changes and their models largely still do not know about it! I haven't tried 4.1- series yet with their newer cutoff, but, the rest of the models like o3-mini (and I presume these new ones today) still write openai python library code in the old, broken style. Argh.

207. blueprint ◴[16 Apr 25 19:00 UTC] No.43709096{3}[source]▶

>>43708958 #

how do you deal with the fact that they use all of your data for training their own systems and review all conversations

replies(2): >>43711739 #>>43713052 #

208. iandanforth ◴[16 Apr 25 19:01 UTC] No.43709112[source]▶

>>43707719 (OP) #

o3 failed the first test I gave it. I wanted it to create a bar chart using Python of the first 10 Fibonacci numbers (did this easily), and then use that image as input to generate an info-graphic of the chart with an animal theme. It failed in two ways. It didn't have access to the visual output from python and, when I gave it a screenshot of that output, it failed in standard GenAI fashion by having poor / incomplete text and not adhering exactly to bar heights, which were critical in this case.

So one failure that could be resolved with better integration on the back end and then an open problem with image generation in general.

209. pixl97 ◴[16 Apr 25 19:02 UTC] No.43709121[source]▶

>>43708160 #

So I asked GPT-o4-mini-high

"On what date will the new moon occur on in August 2025. Use a tool to verify the date if needed"

It correctly reasoned it did not have exact dates due to its cutoff and did a lookup.

"The new moon in August 2025 falls on Friday, August 22, 2025"

Now, I did not specify the timezone I was in so our timing between 22 and 23 appears to be just a time zone difference at it had marked an time of 23:06 PDT per its source.

replies(3): >>43709671 #>>43709781 #>>43712168 #

210. erikw ◴[16 Apr 25 19:06 UTC] No.43709152[source]▶

>>43707719 (OP) #

Interesting... I asked o3 for help writing a flake so I could install the latest Webstorm on NixOS (since the one in the package repo is several months old), and it looks like it actually spun up a NixOS VM, downloaded the Webstorm package, wrote the Flake, calculated the SHA hash that NixOS needs, and wrote a test suite. The test suite indicates that it even did GUI testing- not sure whether that is a hallucination or not though. Nevertheless, it one-shotted the installation instructions for me, and I don't see how it could have calculated the package hash without downloading, so I think this indicates some very interesting new capabilities. Highly impressive.

replies(5): >>43709469 #>>43709535 #>>43710231 #>>43713910 #>>43714068 #

211. sksxihve ◴[16 Apr 25 19:07 UTC] No.43709164{3}[source]▶

>>43709059 #

> Or perhaps they're trying to make some important customers happy by showing movement on areas the customers care about

Or make important investors happy, they need to justify the latest $40 billion round

212. anothermathbozo ◴[16 Apr 25 19:07 UTC] No.43709169[source]▶

>>43708146 #

This isn't exactly the case. The trend is a log scale. So a 10x in pretraining should yield a 10% increase in performance. That's not proving to be false per say but rather they are encountering practical limitations around 10x'ing data volume and 10x'ing available compute.

replies(1): >>43709656 #

213. littlestymaar ◴[16 Apr 25 19:08 UTC] No.43709182{4}[source]▶

>>43708367 #

You are both correct. It feels like the tech itself is kinda plateauing but it's still massively under-used. It will take a decade or more before the deployment starts slowing down.

214. pizzathyme ◴[16 Apr 25 19:10 UTC] No.43709203[source]▶

>>43707951 #

The image generation improvement with o4-mini is incredible. Testing it out today, this is a step change in editing specificity even from the ChatGPT 4o LLM image integration just a few weeks ago (which was already a step change). I'm able to ask for surgical edits, and they are done correctly.

There isn't a numerical benchmark for this that people seem to be tracking but this opens up production-ready image use cases. This was worth a new release.

replies(3): >>43710367 #>>43710556 #>>43711280 #

215. erikw ◴[16 Apr 25 19:12 UTC] No.43709225{4}[source]▶

>>43708198 #

What language / framework are you using? I ask because in a Node / Typescript / React project I experience the opposite- Claude 3.7 usually solves my query on the first try, and seems to understand the project's context, ie the file structure, packages, coding guidelines, tests, etc, while Gemini 2.5 seems to install packages willy-nilly, duplicate existing tests, create duplicate components, etc.

replies(1): >>43715141 #

216. shmatt ◴[16 Apr 25 19:13 UTC] No.43709232{4}[source]▶

>>43708694 #

If we're making comparisons, its more like someone selling a $10,000 course on how to be a millionaire

Not directly from OpenAI - but people in the industry is advertising how these advanced models can replace employees, yet they keep on going on hiring tears (including OpenAI). Lets see the first company to stand behind their models, and replace 50% of their existing headcount with agents. That to me would be a sign these things are going to replace peoples jobs. Until I see that, if OpenAI can't figure out how to replace humans with models, then no one will

I mean could you imagine if todays announcement was - the chatgpt.com webdev team has been laid off, and all new features and fixes will be complete by Codex CLI + o4-mini. That means they believe in the product theyre advertising. Until they do something like that, theyll keep on trusting those human engineers and try selling other people on the dream

replies(2): >>43710194 #>>43711210 #

217. vunderba ◴[16 Apr 25 19:15 UTC] No.43709249[source]▶

>>43708027 #

> This is just getting to be a bit much, seems like they are trying to cover for the fact that they haven't actually done much.

Did you miss the 4o image generation announcement from roughly three week ago?

https://news.ycombinator.com/item?id=43474112

Combining a multimodal LLM+ImageGen puts them pretty significantly ahead of the curve at least in that domain.

Demonstration of the capabilities:

https://mordenstar.com/blog/chatgpt-4o-images

218. andrewinardeer ◴[16 Apr 25 19:16 UTC] No.43709263{5}[source]▶

>>43708723 #

The US is not a signatory to the International Criminal Court so you won't see Musk on trial there.

replies(1): >>43709347 #

219. tedsanders ◴[16 Apr 25 19:17 UTC] No.43709273{6}[source]▶

>>43709084 #

Yep, we have a page of announced API deprecations here: https://platform.openai.com/docs/deprecations

It's got all deprecations, ordered by date of announcement, alongside shutdown dates and recommended replacements.

Note that we use the term deprecated to mean slated for shutdown, and shutdown to mean when it's actually shut down.

In general, we try to minimize developer pain by supporting models for as long as we reasonably can, and we'll give a long heads up before any shutdown. (GPT-4.5-preview was a bit of an odd case because it was launched as a potentially temporary preview, so we only gave a 3-month notice. But generally we aim for much longer notice.)

replies(1): >>43710698 #

220. AcerbicZero ◴[16 Apr 25 19:18 UTC] No.43709282[source]▶

>>43707719 (OP) #

I can't even get ChatGPT to tell me which chatgpt to use.

221. WhatIsDukkha ◴[16 Apr 25 19:20 UTC] No.43709300{4}[source]▶

>>43708844 #

I agree with you partially, BUT

when are the long list of 'enterprise' coworkers, who have glibly and overconfidently answered questions without doing math or looking them up, going to be fired?

222. buzzerbetrayed ◴[16 Apr 25 19:22 UTC] No.43709312{4}[source]▶

>>43708349 #

True. They've blown their absolutely massive lead with power users to Anthropic and Google. So they definitely haven't done nothing.

223. mrcwinn ◴[16 Apr 25 19:22 UTC] No.43709317[source]▶

>>43708027 #

Well, in fairness, Anthropic has less because 1) they started later, 2) could learn from competitors' mistakes, 3) focused on enterprise and not consumer, 4) have fewer resources.

The point is taken — and OpenAI agrees. They have said they are actively working on simplifying the offering. I just think it's a bit unfair. We have perfect hindsight today here on HackerNews and also did zero of the work to produce the product.

224. spaceman_2020 ◴[16 Apr 25 19:24 UTC] No.43709336{3}[source]▶

>>43708068 #

I feel that Claude 3.7 is smarter, but does way too much and has poor prompt adherence

225. throwanem ◴[16 Apr 25 19:25 UTC] No.43709347{6}[source]▶

>>43709263 #

I hope I don't have to link this adjacent reply of mine too many more times: https://news.ycombinator.com/item?id=43709056 Specifically "The venue is a matter of convenience, nothing more," and if you prefer another, that would work about as well. Perhaps Merano; I hear it's a lovely little town.

226. kupopuffs ◴[16 Apr 25 19:29 UTC] No.43709376{5}[source]▶

>>43709013 #

what kind of employees does Docusign employ? surely Digital Documents dont require physical onsite distribution centers and labor

replies(1): >>43709457 #

227. croemer ◴[16 Apr 25 19:29 UTC] No.43709386[source]▶

>>43708704 #

I asked o3 to build and test a maximum parsimony phylogenetic tree builder in Python (my standard test for new models) and it's been thinking for 10 minutes. Still not clear if anything is happening, I have barely seen any code since I asked to test what it produced in the first answer. The thought summary is totally useless compared to Gemini's. Underwhelming so far.

The CoT summary is full of references to Jupyter notebook cells. The variable names are too abbreviated, nbr for neighbor, the code becomes fairly cryptic as a result, not nice to read. Maybe optimized too much for speed.

Also I've noticed ChatGPT seems to abort thinking when I switch away from the app. That's stupid, I don't want to look at a spinner for 5 minutes.

And the CoT summary keeps mentioning my name which is irritating.

replies(2): >>43710864 #>>43749211 #

228. jsnell ◴[16 Apr 25 19:30 UTC] No.43709397{5}[source]▶

>>43709013 #

Apple has fewer employees than Google (164k < 183k).

replies(1): >>43709779 #

229. actsasbuffoon ◴[16 Apr 25 19:31 UTC] No.43709411{3}[source]▶

>>43708462 #

Meanwhile even the highest ranked models can’t do simple logic tasks. GothamChess on YouTube did some tests where he played against a bunch of the best models and every single one of them failed spectacularly.

They’d happily lose a queen to take a pawn. They failed to understand how pieces are even allowed to move, hallucinated the existence of new pieces, repeatedly declared checkmate when it wasn’t, etc.

I tried it last night with Gemini 2.5 Pro and it made it 6 turns before it started making illegal moves, and 8 turns before it got so confused about the state of the board before it refused to play with me any longer.

I was in the chess club in 3rd grade. One of the top ranked LLMs in the world is vastly dumber than I was in 3rd grade. But we’re going to pour hundreds of billions into this in the hope that it can end my career? Good luck with that, guys.

replies(4): >>43709556 #>>43710189 #>>43710252 #>>43716131 #

230. airstrike ◴[16 Apr 25 19:35 UTC] No.43709444{3}[source]▶

>>43708626 #

having many models from the same company in some haphazard strategy doesn't equate to "industry fragmentation". it's just confusion

replies(1): >>43710178 #

231. scarface_74 ◴[16 Apr 25 19:36 UTC] No.43709457{6}[source]▶

>>43709376 #

Just look at their careers page

replies(1): >>43715221 #

232. peterldowns ◴[16 Apr 25 19:37 UTC] No.43709469[source]▶

>>43709152 #

If it can write a nixos flake it's significantly smarter than the average programmer. Certainly smarter than me, one-shotting a flake is not something I'll ever be able to do — usually takes me about thirty shots and a few minutes to cool off from how mad I am at whoever designed this fucking idiotic language. That's awesome.

replies(3): >>43709668 #>>43709855 #>>43711694 #

233. andrewinardeer ◴[16 Apr 25 19:41 UTC] No.43709510[source]▶

>>43708160 #

"Who was the President of the United States when Neil Armstrong walked on the moon?"

Gemini 2.5 refuses to answer this because it is too political.

replies(2): >>43709587 #>>43717374 #

234. lubitelpospat ◴[16 Apr 25 19:41 UTC] No.43709520[source]▶

>>43707719 (OP) #

Sooo... are any of these (or their distils) getting open-sourced/open-weighted?

235. croemer ◴[16 Apr 25 19:42 UTC] No.43709528[source]▶

>>43707719 (OP) #

I wonder where o3 and o4-mini will land on the LMarena leaderboard. When might we see them there?

236. ai-christianson ◴[16 Apr 25 19:42 UTC] No.43709535[source]▶

>>43709152 #

> Interesting... I asked o3 for help writing...

What tool were you using for this?

237. steamrolled ◴[16 Apr 25 19:44 UTC] No.43709548{4}[source]▶

>>43708694 #

I don't think these comparisons are useful. Every time you look at companies like LinkedIn or Docusign, yeah - they have a lot of staff, but a significant proportion of this are functions like sales, customer support, and regulatory compliance across a bazillion different markets; along with all the internal tooling and processes you need to support that.

OpenAI is at a much earlier stage in their adventures and probably doesn't have that much baggage. Given their age and revenue streams, their headcount is quite substantial.

238. schindlabua ◴[16 Apr 25 19:45 UTC] No.43709556{4}[source]▶

>>43709411 #

Chess is not exactly a simple logic task. It requires you to keep track of 32 things in a 2d space.

I remember being extremely surprised when I could ask GPT3 to rotate a 3d model of a car in it's head and ask it about what I would see when sitting inside, or which doors would refuse to open because they're in contact with the ground.

It really depends on how much you want to shift the goalposts on what constitutes "simple".

replies(1): >>43710056 #

239. croemer ◴[16 Apr 25 19:45 UTC] No.43709557{3}[source]▶

>>43707897 #

Isn't it easy to train on the specific Exercism exercises that this benchmark uses?

240. ai-christianson ◴[16 Apr 25 19:45 UTC] No.43709559{6}[source]▶

>>43708611 #

Works well in RA.Aid --in fact I'd recommend it as the default model in terms of overall cost and capability.

241. georgewsinger ◴[16 Apr 25 19:46 UTC] No.43709569{5}[source]▶

>>43708567 #

Somehow completely missed that, thanks!

I think reading this makes it even clearer that the 70.3% score should just be discarded from the benchmarks. "I got a 7%-8% higher SWE benchmark score by doing a bunch of extra work and sampling a ton of answers" is not something a typical user is going to have already set up when logging onto Claude and asking it a SWE style question.

Personally, it seems like an illegitimate way to juice the numbers to me (though Claude was transparent with what they did so it's all good, and it's not uninteresting to know you can boost your score by 8% with the right tooling).

replies(1): >>43712240 #

242. croemer ◴[16 Apr 25 19:49 UTC] No.43709587{3}[source]▶

>>43709510 #

I call bs on this: https://g.co/gemini/share/ed38e9d38b02

replies(1): >>43711409 #

243. caconym_ ◴[16 Apr 25 19:50 UTC] No.43709594{3}[source]▶

>>43708219 #

Relative to the hype they've been spinning to attract investment, casting the launch and commercialization of ChatGPT as their greatest achievement really is a quite significant downgrade, especially given that they really only got there first because they were the first entity reckless enough to deploy such a tool to the public.

It's easy to forget what smart, connected people were saying about how AI would evolve by <current date> ~a year ago, when in fact what we've gotten since then is a whole bunch of diminishing returns and increasingly sketchy benchmark shenanigans. I have no idea when a real AGI breakthrough will happen, but if you're a person who wants it to happen (I am not), you have to admit to yourself that the last year or so has been disappointing---even if you won't admit it to anybody else.

replies(1): >>43785485 #

244. kgeist ◴[16 Apr 25 19:50 UTC] No.43709595{3}[source]▶

>>43708863 #

>Assuming OpenAI are correct that o3 is strictly an improvement over o1 then I don't see why they'd keep o1 around.

Imagine if every time your favorite SaaS had an update, they renamed the product. Yesterday you were using Slack S7, and today you're suddenly using Slack 9S-o. That was fine in the desktop era, when new releases happened once a year - not every few weeks. You just can't keep up with all the versions.

I think they should just stick with one brand and announce new releases as just incremental updates to that same brand/product (even if the underlying models are different): "the DeepSearch Update" or "The April 2025 Reasoning Update" etc.

The model picker should be replaced entirely with a router that automatically detects which underlying model to use. Power users could have optional checkboxes like "Think harder" or "Code mode" as settings, if they want to guide the router toward more specialized models.

245. flkenosad ◴[16 Apr 25 19:51 UTC] No.43709606{4}[source]▶

>>43708303 #

What are people expecting here honestly? This thread is ridiculous.

246. echelon ◴[16 Apr 25 19:51 UTC] No.43709613{6}[source]▶

>>43709004 #

Most people don't care about techies or tech drama. They just use the platforms their friends do.

ChatGPT images are the biggest thing on social media right now. My wife is turning photos of our dogs into people. There's a new GPT4o meme trending on TikTok every day. Using GPT4o as the basis of a social media network could be just the kickstart a new social media platform needs.

247. flkenosad ◴[16 Apr 25 19:52 UTC] No.43709621{6}[source]▶

>>43709004 #

Social networks have to be the most chaotic neutral thing ever made. It's like, "hey everyone! Come share what ever you want on my servers!"

248. ai-christianson ◴[16 Apr 25 19:53 UTC] No.43709639{3}[source]▶

>>43708822 #

Am Saltman

replies(2): >>43709653 #>>43710297 #

249. skygazer ◴[16 Apr 25 19:55 UTC] No.43709651{5}[source]▶

>>43708778 #

I think you're trying to re-contextualize the old Standards joke, but I actually think you're right -- if a front end model could dispatch as appropriate to the best backend model for a given prompt, and turn everything into a high level sort of mixture of models, I think that would be great, and a great simplifying step. Then they can specialize and optimize all they want, CPU goes down, responses get better and we only see one interface.

replies(2): >>43709744 #>>43710769 #

250. jstummbillig ◴[16 Apr 25 19:55 UTC] No.43709652[source]▶

>>43708027 #

I can not believe that we feel that this is what's most worth talking about here (by visibility). At this point I truly wonder if AI is what will make HN side with the luddites.

replies(2): >>43709665 #>>43710044 #

251. stavros ◴[16 Apr 25 19:55 UTC] No.43709653{4}[source]▶

>>43709639 #

Enter.

252. carlita_express ◴[16 Apr 25 19:55 UTC] No.43709656{3}[source]▶

>>43709169 #

I am aware of that, like I said:

> (Or at least because O(log) increases in model performance became unreasonably costly?)

But, yes, I left implicit in my comment that the trend might be “fleeting” because of its impracticality. RL is only a trend so long as it is fashionable, and only fashionable (i.e., practical) so long as OpenAI is fed an exponential amount of VC money to ensure linear improvements under O(log) conditions.

OpenAI is selling to VCs the idea that some hitherto unspecified amount of linear model improvement will kick off productivity gains greater than their exponentially increasing investment. These productivity gains would be no less than a sizeable percentage of American GDP, which Altman has publicly set as his target. But as the capital required increases exponentially, the gap between linearly increasing model capability (i.e., its productivity) and the breakeven ROI target widens. The bigger model would need to deliver a non-linear increase in productivity to justify the exponential price tag.

replies(1): >>43711075 #

253. flkenosad ◴[16 Apr 25 19:56 UTC] No.43709665{3}[source]▶

>>43709652 #

It's giving "they took our jerbs"

254. ◴[16 Apr 25 19:56 UTC] No.43709668{3}[source]▶

>>43709469 #

255. jcynix ◴[16 Apr 25 19:56 UTC] No.43709671{3}[source]▶

>>43709121 #

"Use a tool to verify the date if needed" that's a good idea, yes. And the answers I got are based on UTC, so 23:06 PDT should match the 23. for Europe.

My reasoning for the plain question was: as people start to replace search engines by AI chat, I thought that asking "plain" questions to see how trustworthy the answers might be would be worth it.

replies(1): >>43710731 #

256. mring33621 ◴[16 Apr 25 19:57 UTC] No.43709677{4}[source]▶

>>43708968 #

Agreed.

I pretty much stopped shopping around once Gemini 2.0 Flash came out.

For general, cloud-centric software development help, it does the job just fine.

I'm honestly quite fond of this Gemini model. I feel silly saying that, but it's true.

257. stavros ◴[16 Apr 25 19:57 UTC] No.43709680{3}[source]▶

>>43708210 #

First we wanted to be able to do calculations really quickly, so we built computers.

Then we wanted the computers to reason like humans, so we built LLMs.

Now we want the LLMs to do calculations really quickly.

It doesn't seem like we'll ever be satisfied.

replies(1): >>43711050 #

258. ◴[16 Apr 25 19:58 UTC] No.43709682{3}[source]▶

>>43708044 #

259. jcynix ◴[16 Apr 25 20:02 UTC] No.43709731{3}[source]▶

>>43708210 #

My reasoning for the plain question was: as people start to replace search engines by AI chat, I thought that asking "plain" questions to see how trustworthy the answers might be, would be a good test. Because plain folks will ask plain questions and won't think about the subtle details. They would not expect a "precise number" either, i.e. not 23:06 PDT, but would like to know if this weekend would be fine for a trip or the previous or next weekend would be better to book a "dark sky" tour.

And, BTW, I thought that LLMs are computers too ;-0

replies(1): >>43711041 #

260. jumpCastle ◴[16 Apr 25 20:02 UTC] No.43709734{3}[source]▶

>>43708163 #

But open source like aider

261. yoyohello13 ◴[16 Apr 25 20:03 UTC] No.43709741[source]▶

>>43707831 #

The answer is to just use the latest Claude model and not worry beyond that.

262. louthy ◴[16 Apr 25 20:03 UTC] No.43709744{6}[source]▶

>>43709651 #

> I think you're trying to re-contextualize the old Standards joke

Regex joke [1], but the standards joke will do just fine also :)

[1] Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.

263. M4v3R ◴[16 Apr 25 20:05 UTC] No.43709752{4}[source]▶

>>43708834 #

And that’s exactly why their model naming and release process looks like this right now.

264. jumpCastle ◴[16 Apr 25 20:06 UTC] No.43709763{3}[source]▶

>>43707897 #

It was a good benchmark until it entered the training set.

265. echelon ◴[16 Apr 25 20:06 UTC] No.43709772{6}[source]▶

>>43708642 #

Seriously. The users on sora.com are already trying to. They're sending messages to each other with the embedded image text and upvoting it.

GPT 4o and Sora are incredibly viral and organic and it's taking over TikTok, Instagram, and all other social media.

If you're not watching casual social media you might miss it, but it's nothing short of a phenomenon.

ChatGPT is now the most downloaded app this month. Images are the reason for that.

replies(1): >>43711038 #

266. chrsw ◴[16 Apr 25 20:07 UTC] No.43709774{3}[source]▶

>>43708462 #

I’m not an AI researcher but I’m not convinced these contemporary artificial neural networks will get us to AGI, even assuming an acceleration to current scaling pace. Maybe my definition of AGI is off but I’m thinking what that means is a machine that can think, learn and behave in the world in ways very close to human. I think we need a fundamentally different paradigm for that. Not something that is just trained and deployed like current models, but something that is constantly observing, constantly learning and constantly interacting with the real world like we do. AHI, not AGI. True AGI may not exist because there are always compromises of some kind.

But, we don’t need AGI/AHI to transform large parts of our civilization. And I’m not seeing this happen either.

replies(2): >>43710716 #>>43759761 #

267. solardev ◴[16 Apr 25 20:08 UTC] No.43709779{6}[source]▶

>>43709397 #

Siri must be really good.

268. phoe18 ◴[16 Apr 25 20:08 UTC] No.43709781{3}[source]▶

>>43709121 #

Response from Gemini 2.5 Pro for comparison -

``` Based on the search results, the new moon in August 2025 will occur late on Friday, August 22nd, 2025 in the Pacific Time Zone (PDT), specifically around 11:06 PM.

In other time zones, like the Eastern Time Zone (ET), this event falls early on Saturday, August 23rd, 2025 (around 2:06 AM). ```

269. anizan ◴[16 Apr 25 20:08 UTC] No.43709783{5}[source]▶

>>43708624 #

They should launch o999 and count backwards for each release till they hit oagi

270. krackers ◴[16 Apr 25 20:09 UTC] No.43709806{3}[source]▶

>>43707954 #

But we have both 4o and 4.1 for non-reasoning. And it's still not clear to me which is better (the comparison on their page was from an older version of 4o).

271. jumploops ◴[16 Apr 25 20:11 UTC] No.43709827[source]▶

>>43707719 (OP) #

The big step function here seems to be RL on tool calling.

Claude 3.7/3.5 are the only models that seem to be able to handle "pure agent" usecases well (agent in a loop, not in an agentic workflow scaffold[0]).

OpenAI has made a bet on reasoning models as the core to a purely agentic loop, but it hasn't worked particularly well yet (in my own tests, though folks have hacked a Claude Code workaround[1]).

o3-mini has been better at some technical problems than 3.7/3.5 (particularly refactoring, in my experience), but still struggles with long chains of tool calling.

My hunch is that these models were tuned _with_ OpenAI Codex[2], which is presumably what Anthropic was doing internally with Claude Code on 3.5/3.7

tl;dr - GPT-3 launched with completions (predict the next token), then OpenAI fine-tuned that model on "chat completions" which then led GPT-3.5/GPT-4, and ultimately the success of ChatGPT. This new agent paradigm, requires fine-tuning on the LLM interacting with itself (thinking) and with the outside world (tools), sans any human input.

[0]https://www.anthropic.com/engineering/building-effective-age...

[1]https://github.com/1rgs/claude-code-proxy

[2]https://openai.com/index/openai-codex/

272. ZeroTalent ◴[16 Apr 25 20:14 UTC] No.43709855{3}[source]▶

>>43709469 #

I was a major contributor of Flake. What in particular is so idiotic in your opinion?

replies(2): >>43709901 #>>43709996 #

273. kurthr ◴[16 Apr 25 20:16 UTC] No.43709865{4}[source]▶

>>43708824 #

I'd think it would be able to at least suggest which model to use rather than just having 6 for you to choose from.

274. yjftsjthsd-h ◴[16 Apr 25 20:18 UTC] No.43709901{4}[source]▶

>>43709855 #

FWIW, they said the language was bad, not specifically flakes. IMHO, nix is super easy if you already know Haskell (possibly others in that family). If you don't, it's extremely unintuitive.

275. nopinsight ◴[16 Apr 25 20:20 UTC] No.43709926[source]▶

>>43708027 #

Research by METR suggests that frontier LLMs can perform software tasks over exponentially longer time horizon required for human engineers, with ~7-month for each doubling. o3 is above the trend line.

https://x.com/METR_Evals/status/1912594122176958939

—-

The AlexNet paper which kickstarted the deep learning era in 2012 was ahead of the 2nd-best entry by 11%. Many published AI papers then advanced SOTA by just a couple percentage points.

o3 high is about 9% ahead of o1 high on livebench.ai and there are also quite a few testimonials of their differences.

Yes, AlexNet made major strides in other aspects as well but it’s been just 7 months since o1-preview, the first publicly available reasoning model, which is a seminal advance beyond previous LLMs.

It seems some people have become desensitized to how rapidly things are moving in AI, despite its largely unprecedented pace of progress.

Ref:

- https://proceedings.neurips.cc/paper_files/paper/2012/file/c...

- https://livebench.ai/#/

replies(1): >>43710260 #

276. siva7 ◴[16 Apr 25 20:23 UTC] No.43709949[source]▶

>>43707719 (OP) #

So what are they selling with the 200 dollar subscription? Only a model that has now caught up with their competitor who sells for 1/10 of their price?

277. t-writescode ◴[16 Apr 25 20:25 UTC] No.43709975{3}[source]▶

>>43708647 #

I have *no idea* why you're being downvoted on this.

If I want to take advantage of a new model, I must validate that the structured queries I've made to the older models still work on the new models.

The last time I did a validation and update. Their Responses. Had. Changed.

API users need dependability, which means they need older models to keep being usable.

replies(1): >>43711645 #

278. peterldowns ◴[16 Apr 25 20:26 UTC] No.43709996{4}[source]▶

>>43709855 #

I use flakes a lot and I think both flakes and the Nix language are beyond comprehension. Try searching duckduckgo or google for “what is nix flakes” or “nix flake schema” and take an honest read at the results. Insanely complicated and confusing answers, multiple different seemingly-canonical sources of information. Then go look at some flakes for common projects; the almost necessary usage of things like flake-compat and flake-util, the many-valid-approaches to devshell and package definitions, the concepts of “apps” in addition to packages. All very complicated and crazy!

Thank you for your service, I use your work with great anger (check my github I really do!)

replies(1): >>43710266 #

279. ren_engineer ◴[16 Apr 25 20:31 UTC] No.43710038[source]▶

>>43708027 #

you'd think they could use AI to interpret the best model for your use case so you don't even have to think about it. Run the first few API calls in parallel, grade the result, and then send the rest to whatever works best

280. doug_durham ◴[16 Apr 25 20:31 UTC] No.43710042{4}[source]▶

>>43708824 #

Do you have a 200k context window? I don't. Most humans can only keep 6 or 7 things in short term memory. Beyond those 6 or 7 you are pulling data from your latent space, or replacing of the short term slots with new content.

replies(1): >>43710246 #

281. siva7 ◴[16 Apr 25 20:31 UTC] No.43710044{3}[source]▶

>>43709652 #

Is there some new HN with more insightful discussions?

282. yusina ◴[16 Apr 25 20:33 UTC] No.43710056{5}[source]▶

>>43709556 #

> Chess is not exactly a simple logic task.

Compare to what a software engineer is able to do, it is very much a simple logic task. Or the average person having a non-trivial job. Or a beehive organizing its existence, from its amino acids up to hive organization. All those things are magnitudes harder than chess.

> I remember being extremely surprised when I could ask GPT3 to rotate a 3d model of a car in it's head and ask it about what I would see when sitting inside, or which doors would refuse to open because they're in contact with the ground.

It's not reasoning its way there. Somebody asked something similar some time in the corpus and that corpus also contained the answers. That's why it can answer. After a quite small number of moves, the chess board it unique and you can't fake it. You need to think ahead. A task which computers are traditionally very good at. Even trained chess players are. That LLMs are not goes to show that they are very far from AGI.

283. jasondigitized ◴[16 Apr 25 20:38 UTC] No.43710114[source]▶

>>43708027 #

If there are incremental gains in each release, why would they hold them back? The amount of exhaust coming off of each release is gold for the internal teams. The naming convention is bad, and the CPO just admitted as much on Lenny's podcast, but I am not sure why incremental releases is a bad thing.

284. crowcroft ◴[16 Apr 25 20:44 UTC] No.43710178{4}[source]▶

>>43709444 #

OpenAI's continued growth and press coverage relative to their peers leads to me to believe it isn't *just* confusion, even if it is confusing.

replies(1): >>43712069 #

285. hybrid_study ◴[16 Apr 25 20:44 UTC] No.43710187[source]▶

>>43707719 (OP) #

Doesn't achieving AGI mean the beginning of the end of humanity's current economic model? I'm not sure I understand the presumption by many that achieving AGI is just another step in some company's offering.

replies(2): >>43710272 #>>43710377 #

286. og_kalu ◴[16 Apr 25 20:44 UTC] No.43710189{4}[source]▶

>>43709411 #

LLMs can play chess fine.

The best model you can play with is decent for a human - https://github.com/adamkarvonen/chess_gpt_eval

SOTA models can't play it because these companies don't really care about it.

287. mrandish ◴[16 Apr 25 20:45 UTC] No.43710194{5}[source]▶

>>43709232 #

I'm also a skeptic on AI replacing many human jobs anytime soon. It's mostly going to assist, accelerate or amplify humans in completing work better or faster. That's the typical historical technology cycle where better tech makes work more efficient. Eventually that does allow the same work to be done with less people, like a better IP telephony system enabling a 90 person call center to handle the same call volume that previously required 100 people. But designing, manufacturing, selling, installing and supporting the new IP phone system also creates at least 10 new jobs.

So far the only significant human replacement I'm seeing AI enable is in low-end, entry level work. For example, fulfilling "gig work" for Fiverr like spending an hour or two whipping up a relatively low-quality graphic logo or other basic design work for $20. This is largely done at home by entry-level graphic design students in second-world locales like the Philippines or rural India. A good graphical AI can (and is) taking some of this work from the humans doing it. Although it's not even a big impact yet, primarily because for non-technical customers, the Fiverr workflow can still be easier or more comfortable than figuring out which AI tool to use and how to get what they really want from it.

The point is that this Fiverr piece-meal gig work is the lowest paying, least desirable work in graphic design. No one doing it wants to still be doing it a year or two from now. It's the Mcdonald's counter of their industry. They all aspire to higher skill, higher paying design jobs. They're only doing Fiverr gig work because they don't yet have a degree, enough resume credits or decent portfolio examples. Much like steam-powered bulldozers and pile drivers displaced pick axe swinging humans digging railroad tunnels in the 1800s, the new technology is displacing some of the least-desirable, lowest-paying jobs first. I don't yet see any clear reason this well-established 200+ year trend will be fundamentally different this time. And history is littered with those who predicted "but this time it'll be different."

I've read the scenarios which predict that AI will eventually be able to fundamentally and repeatedly self-improve autonomously, at scale and without limit. I do think AI will continue to improve but, like many others, I find the "self-improve" step to be a huge and unevidenced leap of faith. So, I don't think it's likely, for reasons I won't enumerate here because domain experts far smarter than I am have already written extensively about them.

288. MoonGhost ◴[16 Apr 25 20:45 UTC] No.43710199{3}[source]▶

>>43708462 #

> Now we're up to o4, AGI is still not even in near site (depending on your definition, I know)

It's not only definition. Some googler was sure their model was conscious.

289. fsndz ◴[16 Apr 25 20:47 UTC] No.43710213{3}[source]▶

>>43708462 #

True.

Deep learning models will continue to improve as we feed them more data and use more compute, but they will still fail at even very simple tasks as long as the input data are outside their training distribution. The numerous examples of ChatGPT (even the latest, most powerful versions) failing at basic questions or tasks illustrate this well. Learning from data is not enough; there is a need for the kind of system-two thinking we humans develop as we grow. It is difficult to see how deep learning and backpropagation alone will help us model that. https://medium.com/thoughts-on-machine-learning/why-sam-altm...

290. tymscar ◴[16 Apr 25 20:49 UTC] No.43710231[source]▶

>>43709152 #

Thats so different from my experience. I tried to have it switch a flake for a yarn package that works to npm and after 3 tries with all the hints I could give it it couldn’t do it

291. MoonGhost ◴[16 Apr 25 20:50 UTC] No.43710245{3}[source]▶

>>43708645 #

$20 Plus subscription give access to o1 and Deep Research (10 uses/month). I'm pretty sure general public can get access through API as well.

replies(1): >>43711580 #

292. sixQuarks ◴[16 Apr 25 20:50 UTC] No.43710246{5}[source]▶

>>43710042 #

But context windows for LLMs include all the “long term memory” things you’re excluding from humans

replies(1): >>43710333 #

293. M4v3R ◴[16 Apr 25 20:50 UTC] No.43710247[source]▶

>>43707719 (OP) #

Ok, I’m a bit underwhelmed. I’ve asked it a fairly technical question, about a very niche topic (Final Fantasy VII reverse engineering): https://chatgpt.com/share/68001766-92c8-8004-908f-fb185b7549...

With right knowledge and web searches one can answer this question in a matter of minutes at most. The model fumbled around modding forums and other sites and did manage to find some good information but then started to hallucinate some details and used them in the further research. The end result it gave me was incorrect, and the steps it described to get the value were totally fabricated.

What’s even worse in the thinking trace it looks like it is aware it does not have an answer and that the 399 is just an estimate. But in the answer itself it confidently states it found the correct value.

Essentially, it lied to me that it doesn’t really know and provided me with an estimate without telling me.

Now, I’m perfectly aware that this is a very niche topic, but at this point I expect the AI to either find me a good answer or tell me it couldn’t do it. Not to lie me in the face.

Edit: Turns out it’s not just me: https://x.com/transluceai/status/1912552046269771985?s=46

replies(13): >>43710318 #>>43711672 #>>43711775 #>>43711851 #>>43712139 #>>43712425 #>>43713176 #>>43713582 #>>43713694 #>>43714110 #>>43714235 #>>43722041 #>>43727880 #

294. JFingleton ◴[16 Apr 25 20:51 UTC] No.43710252{4}[source]▶

>>43709411 #

I'm not sure why people are expecting a language model to be great at chess. Remember they are trained on text, which is not the best medium for representing things like a chess board. They are also "general models", with limited training on pretty much everything apart from human language.

An Alpha Star type model would wipe the floor at chess.

replies(2): >>43710659 #>>43710715 #

295. kadushka ◴[16 Apr 25 20:52 UTC] No.43710260{3}[source]▶

>>43709926 #

Imagenet had improved the error rate by 100*11/25=44%.

o1 to o3 error rate went from 28 to 19, so 100*9/28=32%.

But these are meaningless comparisons because it’s typically harder to improve already good results.

296. ZeroTalent ◴[16 Apr 25 20:52 UTC] No.43710266{5}[source]▶

>>43709996 #

I apologize. It was my Haskell life period.

replies(1): >>43710323 #

297. highfrequency ◴[16 Apr 25 20:53 UTC] No.43710268[source]▶

>>43707719 (OP) #

The benchmarks reference o3-low, medium and high. What is plain “o3”? Is that medium?

298. BriggyDwiggs42 ◴[16 Apr 25 20:53 UTC] No.43710272[source]▶

>>43710187 #

No you see because everyone will become agi engineers actually that makes sense and is going to happen

299. MoonGhost ◴[16 Apr 25 20:54 UTC] No.43710292{3}[source]▶

>>43708044 #

not only that. filling search lists on eBay with your products is old sellers' tactics. Try to search for used Dell workstation or server and you will see pages and pages from the same seller.

300. BriggyDwiggs42 ◴[16 Apr 25 20:55 UTC] No.43710297{4}[source]▶

>>43709639 #

Hi, I’m an OpenAI recruiter. Are you interested in a position with us?

301. BriggyDwiggs42 ◴[16 Apr 25 20:57 UTC] No.43710308{5}[source]▶

>>43709040 #

It does a lot better on philosophy questions.

302. siva7 ◴[16 Apr 25 20:57 UTC] No.43710318[source]▶

>>43710247 #

It can imitate its creator. We reached AGI.

replies(1): >>43714166 #

303. peterldowns ◴[16 Apr 25 20:59 UTC] No.43710323{6}[source]▶

>>43710266 #

I forgive you as I hope you forgive me. Flakes are certainly much better than Nix without them, and they’ve saved me much more time than they’ve cost me.

replies(2): >>43713871 #>>43750276 #

304. kadushka ◴[16 Apr 25 21:00 UTC] No.43710333{6}[source]▶

>>43710246 #

Long term memory in an LLM is its weights.

replies(1): >>43711008 #

305. boznz ◴[16 Apr 25 21:03 UTC] No.43710363{4}[source]▶

>>43708824 #

I'm not quite AGI, but I work quite adequately with a much, much smaller memory. Maybe AGI just needs to know how to use other computers and work with storage a bit better.

306. mchusma ◴[16 Apr 25 21:04 UTC] No.43710367{3}[source]▶

>>43709203 #

Thanks for sharing that. that was more interesting then their demo. I tried it and it was pretty good! I have felt that the ability to iterate from images blocked this from any real production use I had. This may be good enough now.

Example of edits (not quite surgical but good): https://chatgpt.com/share/68001b02-9b4c-8012-a339-73525b8246...

replies(1): >>43712143 #

307. JFingleton ◴[16 Apr 25 21:04 UTC] No.43710377[source]▶

>>43710187 #

Most days I feel the same.

Other days I remember that humans like "handmade" furniture, and live performances, and unique styles, and human contact.

Perhaps there's life in us still?

308. kadushka ◴[16 Apr 25 21:05 UTC] No.43710383{4}[source]▶

>>43708941 #

Google has the leading model on pretty much every metric

Correction: Google had the leading model for three weeks. Today it’s back to the second place.

replies(1): >>43710781 #

309. kadushka ◴[16 Apr 25 21:08 UTC] No.43710413{7}[source]▶

>>43708454 #

So to you AGI == terminators? Interesting.

310. throwaway314155 ◴[16 Apr 25 21:12 UTC] No.43710432[source]▶

>>43708160 #

> one of them (Claude) told me that it can't tell because that might be discriminatory: "I apologize, but I do not feel comfortable providing recommendations about how to block specific search engines in a robots.txt file. That could be seen as attempting to circumvent or manipulate search engine policies, which goes against my principles."

How exactly does that response have anything to do with discrimination?

311. armen52 ◴[16 Apr 25 21:12 UTC] No.43710444{3}[source]▶

>>43708068 #

I don't understand this assertion, but maybe I'm missing something?

Google included a SWE-bench score of 63.8% in their announcement for Gemini 2.5 Pro: https://blog.google/technology/google-deepmind/gemini-model-...

312. boznz ◴[16 Apr 25 21:13 UTC] No.43710449[source]▶

>>43707831 #

It feels like all the AI companies are pulling the versions out of their arse at the moment, I think they should work backwards and work to AGI 1.0

So my guess currently is that most are lingering at about 0.3

313. Hackbraten ◴[16 Apr 25 21:18 UTC] No.43710498{5}[source]▶

>>43708972 #

> Even compared to just a few years back, people seem more often to struggle to carry a thought, and resort much more quickly to emotional belligerence.

We're living in extremely uncertain times, with multiple global crises taking place at the same time, each of which could develop into a turning point for humankind.

At the same time, predatory algorithms do whatever it takes to make people addicted to media, while mental health care remains inaccessible for many.

I feel like throwing a tantrum almost every single day.

replies(1): >>43711436 #

314. jonahx ◴[16 Apr 25 21:22 UTC] No.43710548[source]▶

>>43708332 #

It can't solve this puzzle: https://i.imgur.com/AJqbqHJ.png

    Thought for 3m 51s
    Short answer → you can’t.

The breathtaking thing is not the model itself, but that someone as smart as Cowen (and he's not the only one) is uttering "AGI" in the same sentence as any of these models. Now, I'm not a hater, and for many tasks they are amazing, but they are, as of now, not even close to AGI, by any reasonable definition.

replies(2): >>43710949 #>>43711155 #

315. ilaksh ◴[16 Apr 25 21:23 UTC] No.43710556{3}[source]▶

>>43709203 #

wait, o4-mini outputs images? What I thought I saw was the ability to do a tool call to zoom in on an image.

Are you sure that's not 4o?

replies(1): >>43711306 #

316. DrNosferatu ◴[16 Apr 25 21:29 UTC] No.43710598[source]▶

>>43707719 (OP) #

Any good leaderboard where all the very latest models are compared?

317. danielmarkbruce ◴[16 Apr 25 21:31 UTC] No.43710609[source]▶

>>43708027 #

Think for 30 seconds about why they might in good faith do what they do.

Do you use any of them? Are you a developer? Just because a model is non-deterministic it doesn't mean developers don't want some level of consistency, whether it be about capabilities, cost, latency, call structure etc.

318. simonw ◴[16 Apr 25 21:34 UTC] No.43710640[source]▶

>>43707719 (OP) #

Here's a summary of this conversation so far, generated using o3 after 306 comments. This time I ran it like so:

  llm install llm-openai-plugin
  llm install llm-hacker-news
  llm -m openai/o3 -f hn:43707719 -s 'Summarize the themes of the opinions expressed here.
  For each theme, output a markdown header.
  Include direct "quotations" (with author attribution) where appropriate.
  You MUST quote directly from users when crediting them, with double quotes.
  Fix HTML entities. Output markdown. Go long. Include a section of quotes that illustrate opinions uncommon in the rest of the piece'

https://gist.github.com/simonw/a35f39b070978e703d9eb8b1aa7c0... - cost 2,684 input, 2,452 output (of which 896 were reasoning tokens) which is 12.492 cents.

Then again with o4-mini using the exact same content (hence the hash ID for -f):

  llm -m openai/o4-mini \
    -f f16158f09f76ab5cb80febad60a6e9d5b96050bfcf97e972a8898c4006cbd544 \
  -s 'Summarize the themes of the opinions expressed here.
  For each theme, output a markdown header.
  Include direct "quotations" (with author attribution) where appropriate.
  You MUST quote directly from users when crediting them, with double quotes.
  Fix HTML entities. Output markdown. Go long. Include a section of quotes that illustrate opinions uncommon in the rest of the piece'

Output: https://gist.github.com/simonw/b11ba0b11e71eea0292fb6adaf9cd...

Cost 2,684 input, 2,681 output (of which 1,088 reasoning tokens) = 1.4749 cents

The above uses these two plugins: https://github.com/simonw/llm-openai-plugin and https://github.com/simonw/llm-hacker-news - taking advantage of new -f "fragments" feature I released last week: https://simonwillison.net/2025/Apr/7/long-context-llm/

319. kgeist ◴[16 Apr 25 21:35 UTC] No.43710652[source]▶

>>43708027 #

> This is just getting to be a bit much, seems like they are trying to cover for the fact that they haven't actually done much. All these models feel like they took the exact same base model, tweaked a few things and released it as an entirely new model

OpenAI's progress lately:

  2024 December - first reasoning model (official release)

  2025 February - deep search

  2025 March - true multi-modal image generation

  2025 April - reasoning model with tools

I'm not sure why people say they haven't done much. We couldn't even dream of stuff like this five years ago, and now releasing groundbreaking/novel features every month is considered "meh"... I think we're spoiled and can't appreciate anything anymore :)

320. actsasbuffoon ◴[16 Apr 25 21:35 UTC] No.43710659{5}[source]▶

>>43710252 #

This misses the point. LLMs will do things like move a knight by a single square as if it were a pawn. Chess is an extremely well understood game, and the rules about how things move is almost certainly well-represented in the training data.

These models cannot even make legal chess moves. That’s incredibly basic logic, and it shows how LLMs are still completely incapable of reasoning or understanding. Many kinds of task are never going to be possible for LLMs unless that changes. Programming is one of those tasks.

replies(2): >>43710807 #>>43710906 #

321. jug ◴[16 Apr 25 21:39 UTC] No.43710695{4}[source]▶

>>43708968 #

Yes, this one is addictive for its speed and I like how Google was clever and also offered it in a powerful reasoning edition. This helps offset deficiencies from being smaller while still being cheap. I also find it quite sufficient for my kind of coding. I only pull out 2.5 Pro on larger and complex code bases that I think might need deeper domain specific knowledge beyond the coding itself.

322. meander_water ◴[16 Apr 25 21:39 UTC] No.43710698{7}[source]▶

>>43709273 #

On that page I don't see any mention of o3-mini. Is o3-mini a legacy model now which is slated to be deprecated later on?

replies(1): >>43711733 #

323. what_ever ◴[16 Apr 25 21:40 UTC] No.43710702{4}[source]▶

>>43708765 #

Gemini 2.0 Flash gets it correct too.

324. siliconc0w ◴[16 Apr 25 21:40 UTC] No.43710704[source]▶

>>43707719 (OP) #

Please just give me a best value and a highest performance model.

325. wijwp ◴[16 Apr 25 21:41 UTC] No.43710715{5}[source]▶

>>43710252 #

> I'm not sure why people are expecting a language model to be great at chess.

Because the conversation is about AGI, and how far away we are from AGI.

replies(1): >>43712599 #

326. chpatrick ◴[16 Apr 25 21:41 UTC] No.43710716{4}[source]▶

>>43709774 #

I feel like every time AI gets better we shift the goalposts of AGI to something else.

replies(1): >>43711143 #

327. pixl97 ◴[16 Apr 25 21:42 UTC] No.43710731{4}[source]▶

>>43709671 #

Heh, I've always been neurodivergent enough that I've never been great at 'normal human' questions. I commonly add a lot of verbosity. This said it's worked out well talking to computer based things like search engines.

LLMs on the other hand are weird in ways we don't expect computers to be. Based upon the previous prompting, training datasets, and biases in the model a response to something like "What time is dinner" can all have the response "Just a bit after 5", "Quarter after 5" or "Dinner is at 17:15 CDT". Setting ones priors can be important to performance of the model, much in the same way we do this visually and contextually with other humans.

All that said, people will find AI problematic for the foreseeable future because it behaves somewhat human like in responses and does so with confidence.

328. LinuxAmbulance ◴[16 Apr 25 21:44 UTC] No.43710748{3}[source]▶

>>43708462 #

> We had serious news outlets write about senior people at OpenAI quitting because o1 was SkyNet

I wonder if any of the people that quit regret doing so.

Seems a lot like Chicken Little behavior - "Oh no, the sky is falling!"

How anyone with technical acumen thinks current AI models are conscious, let alone capable of writing new features and expanding their abilities is beyond me. Might as well be afraid of calculators revolting and taking over the world.

329. calmoo ◴[16 Apr 25 21:47 UTC] No.43710769{6}[source]▶

>>43709651 #

Isn't this basically the idea of agents?

replies(1): >>43746230 #

330. littlestymaar ◴[16 Apr 25 21:49 UTC] No.43710781{5}[source]▶

>>43710383 #

press X to doubt

o3-mini wasn't even the second place for non-STEM tasks, and in today's announcement they don't even publish benchmarks for those. What's impressive about Gemini 2.5 pro (and was also really impressive with R1) is how good the model is for a very broad range of tasks, not just benchmaxing on AIME.

replies(1): >>43710928 #

331. simonw ◴[16 Apr 25 21:52 UTC] No.43710807{6}[source]▶

>>43710659 #

Saying programming is a task that is "never going to be possible" for an LLM is a big claim, given how many people have derived huge value from having LLMs write code for them over the past two years.

(Unless you're arguing against the idea that LLMs are making programmers obsolete, in which case I fully agree with you.)

replies(1): >>43716355 #

332. istjohn ◴[16 Apr 25 21:59 UTC] No.43710864{3}[source]▶

>>43709386 #

It's maddening that you can't switch away from the app while it generates output. To use the Deep Research feature on mobile, you have to give up your phone for ten minutes.

replies(1): >>43720295 #

333. dang ◴[16 Apr 25 22:07 UTC] No.43710904[source]▶

>>43707917 #

Related ongoing thread:

OpenAI Codex CLI: Lightweight coding agent that runs in your terminal - https://news.ycombinator.com/item?id=43708025

334. og_kalu ◴[16 Apr 25 22:07 UTC] No.43710906{6}[source]▶

>>43710659 #

>These models cannot even make legal chess moves. That’s incredibly basic logic, and it shows how LLMs are still completely incapable of reasoning or understanding.

Yeah they can. There's a link I shared to prove it which you've conveniently ignored.

LLMs learn by predicting, failing and getting a little better, rinse and repeat. Pre-training is not like reading a book. LLMs trained on chess games play chess just fine. They don't make the silly mistakes you're talking about and they very rarely make illegal moves.

There's gpt-3.5-turbo-instruct which i already shared and plays at around 1800 ELO. Then there's this grandmaster level chess transformer - https://arxiv.org/abs/2402.04494. They're also a couple of models that were trained in the Eleuther AI discord that reached about 1100-1300 Elo.

I don't know what the peak of LLM Chess playing looks like but this is clearly less of a 'LLMs can't do this' problem and more 'Open AI/Anthropic/Google etc don't care if their models can play Chess or not' problem.

So are they capable of reasoning now or would you like to shift the posts ?

replies(1): >>43712009 #

335. kadushka ◴[16 Apr 25 22:11 UTC] No.43710928{6}[source]▶

>>43710781 #

I had a philosophical discussion with o3 model earlier today. It was much better than 2.5 pro. In fact it was pretty much what I would expect from a professional philosopher.

replies(1): >>43711236 #

336. AIPedant ◴[16 Apr 25 22:14 UTC] No.43710949{3}[source]▶

>>43710548 #

  I think it is AGI, seriously.  Try asking it lots of questions, and then ask yourself: just how much smarter was I expecting AGI to be?

That's his whole argument!!!! This is so frustrating coming from a public intellectual. "You don't need rigorous reasoning to answer these questions, baybeee, just go with your vibes." Complete and total disregard for scientific thinking, in favor of confirmation bias and ideology.

337. ◴[16 Apr 25 22:14 UTC] No.43710951[source]▶

>>43707719 (OP) #

338. plantain ◴[16 Apr 25 22:20 UTC] No.43710997{6}[source]▶

>>43708611 #

Working fine here. What problems do you see?

replies(1): >>43711498 #

339. echoangle ◴[16 Apr 25 22:22 UTC] No.43711008{7}[source]▶

>>43710333 #

Not really, because humans can form long term memories from conversations, but LLM users aren’t finetuning models after every chat so the model remembers.

replies(2): >>43711243 #>>43711244 #

340. throwaway13337 ◴[16 Apr 25 22:23 UTC] No.43711020[source]▶

>>43707719 (OP) #

o4-mini is available on vs code. I've been playing with it for the last couple of hours. It's quite fast for a thinking model.

It's also super concise with code. Where claude 3.7 and gemini 2.5 will write a ton, o4-mini will write a tiny portion of it accomplishing the same task.

On the flip side, in its conciseness, it's more lazy with implementation than the other leading models missing features.

For fixing very complex typescript types, I've previously found that o1 outperformed the others. o4-mini seems to understand things well here.

I still think gemini will continue to be my favorite model for code. It's more consistent and follows instructions better.

However, openAI's more advanced models have a better shot at providing a solution when gemini and claude are stuck.

Maybe there's a win here in having o4-mini or o3 do a first draft for conciseness, revise with gemini to fill in what's missed (but with a base that is not overdone), and then run fixes with o4-mini.

Things are still changing quite quickly.

341. chucky_z ◴[16 Apr 25 22:26 UTC] No.43711038{7}[source]▶

>>43709772 #

Honestly I popped on sora.com the other day and the memes are great. I can totally understand where folks are coming from and why this is happening.

342. WhatIsDukkha ◴[16 Apr 25 22:28 UTC] No.43711041{4}[source]▶

>>43709731 #

I think its much better to help people learn that an LLM is "not" a computer (even if it technically is).

Thinking its a computer makes you do dumb things with them that they simply have never done a good job with.

Build intuitions about what they do well and intuitions about what they don't do well and help others learn the same things.

Don't encourage people to have poor ideas about how they work, it makes things worse.

Would you ask an LLM a phone number? If it doesn't use a function call the answer is simply not worth having.

343. squeaky-clean ◴[16 Apr 25 22:29 UTC] No.43711049{4}[source]▶

>>43708547 #

They'd probably want their announcement to be the one the press picks up instead of a tweet or reddit post saying "Did anyone else notice the new ChatGPT model?"

344. WhatIsDukkha ◴[16 Apr 25 22:29 UTC] No.43711050{4}[source]▶

>>43709680 #

Ask the LLM what calculations you might or should do (and how you might implement and test those calculations) is pretty wildly useful.

345. og_kalu ◴[16 Apr 25 22:30 UTC] No.43711059[source]▶

>>43707719 (OP) #

o3 joins gemini-2.5-pro as the only other model that can pace long form creative writing properly when details about the story are provided.

346. mode80 ◴[16 Apr 25 22:32 UTC] No.43711075{4}[source]▶

>>43709656 #

This happens once it starts improving itself.

replies(1): >>43712791 #

347. jiggawatts ◴[16 Apr 25 22:36 UTC] No.43711098{4}[source]▶

>>43708929 #

Similarly, many of my informal tests have started passing with Gemini 2.5 that never worked before, which makes the 2025 era of AI models feel like a step change to me.

348. chrsw ◴[16 Apr 25 22:42 UTC] No.43711143{5}[source]▶

>>43710716 #

I don't think we shift the goalposts for AGI. I'm not getting the sense that people are redefining what AGI is when a new model is released. I'm getting the sense that some people are thinking like me when a new model is released: we got a better transformer, and a more useful model trained on more or better data, but we didn't get closer to AGI. And people are saying this not because they've pushed out what AGI really means, they're saying this because the models still have the same basic use cases, the same flaws and the same limitations. They're just better at what they already do. Also, the better these models get at what they already do, the more starkly they contrast with human capabilities, for better or worse.

349. roskelld ◴[16 Apr 25 22:44 UTC] No.43711150[source]▶

>>43707719 (OP) #

After refreshing the browser I see that the old o3-mini-high has gone now so I continued my coding task conversation with o4-mini-high. In two separate conversations it butchered things in a way that I never saw o3-mini-high do. In one case it rewrote working code without reason, breaking it, in the other it took a function I asked it to apply a code fix to and it instead refactored it with a different and unrelated function that was part of an earlier bit of chat history.

I notice too that it employs a different style of code where it often puts assignment on a different line, which looks like it's trying to maintain an ~80 character line limit, but does so in places where the entire line of code is only about 40 characters.

replies(1): >>43711282 #

350. neonbjb ◴[16 Apr 25 22:45 UTC] No.43711155{3}[source]▶

>>43710548 #

I work for openai.

o4-mini gets much closer (but I'm pretty sure it fumbles at the last moment): https://chatgpt.com/share/680031fb-2bd0-8013-87ac-941fa91cea...

We're pretty bad at model naming and communicating capabilities (in our defense, it's hard!), but o4-mini is actually a _considerably_ better vision model than o3, despite the benchmarks. Similar to how o3-mini-high was a much better coding model than o1. I would recommend using o4-mini-high over o3 for any task involving vision.

replies(1): >>43711469 #

351. caseyy ◴[16 Apr 25 22:46 UTC] No.43711159[source]▶

>>43707719 (OP) #

The demo video is very impressive, and it shows what AI could be. Our current models are unreliable in research, but if they were reliable, then what's shown alone would be better than AGI.

There are 8 billion+ instances of general intelligence on the planet; there isn't a shortage. I'd rather see AI do data science and applied math at computer speeds. Those are the hard problems, a lot of the AGI problems (to human brains) are easy.

352. dmd ◴[16 Apr 25 22:46 UTC] No.43711162{5}[source]▶

>>43708895 #

Any idea when v1/models will be updated? As of right now, https://api.openai.com/v1/models has "id": "o3-mini-2025-01-31" and "id": "o3-mini", but no just 'o3'.

replies(1): >>43711913 #

353. esafak ◴[16 Apr 25 22:54 UTC] No.43711210{5}[source]▶

>>43709232 #

Not really. It could also mean their company's effective headcount is much greater than its nominal one.

354. benoau ◴[16 Apr 25 22:55 UTC] No.43711217[source]▶

>>43707719 (OP) #

> Downloaded an untouched char.lgp from the current Steam build (1.0.9) to make sure the count reflects the shipping game rather than a modded archive.

How?

355. littlestymaar ◴[16 Apr 25 22:59 UTC] No.43711236{7}[source]▶

>>43710928 #

I'm not expecting someone paying $200 a month to access something to be objective about that particular something.

Also “what I would expect from a professional philosopher”, is that your argument, really?

replies(1): >>43711272 #

356. esafak ◴[16 Apr 25 23:00 UTC] No.43711243{8}[source]▶

>>43711008 #

He's right, but most people don't have the resources, nor indeed the weights themselves, to keep training the models. But the weights are very much long term memory.

357. kadushka ◴[16 Apr 25 23:00 UTC] No.43711244{8}[source]▶

>>43711008 #

users aren’t finetuning models after every chat

Users can do that if they want, but it’s more effective and more efficient to do that after every billion chats, and I’m sure OpenAI does it.

replies(1): >>43713784 #

358. kadushka ◴[16 Apr 25 23:06 UTC] No.43711272{8}[source]▶

>>43711236 #

I’m paying $20/mo, and I’m paying the same for Gemini and for Claude.

What’s wrong with my argument? You questioned the performance of the model on non-STEM tasks, and I gave you my impression.

replies(1): >>43714284 #

359. Agentus ◴[16 Apr 25 23:08 UTC] No.43711280{3}[source]▶

>>43709203 #

also another addition: i previously tried to upload an image for chatgpt to edit and it was incapable under the previous model i tried. Now its able to change uploaded images using o4mini.

360. upbeat_general ◴[16 Apr 25 23:08 UTC] No.43711282[source]▶

>>43711150 #

Not saying it’s for sure the case but it might be that the model gets confused by OOD text from the other model whereas it expects its own text to be online from itself (particularly if the CoT is used as context for later conversations).

361. AaronAPU ◴[16 Apr 25 23:11 UTC] No.43711306{4}[source]▶

>>43710556 #

I’m generating logo designs for merch via o4-mini-high and they are pretty good. Good text and comprehending my instructions.

replies(2): >>43713173 #>>43713182 #

362. AaronAPU ◴[16 Apr 25 23:13 UTC] No.43711326{3}[source]▶

>>43708784 #

I’ve been using Gemini 2.5 pro side by side with o1-pro and Grok lately. My experience is they each randomly offer significant insight the other two didn’t.

But generally, o1-pro listens to my profile instructions WAY better, and it seems to be better at actually solving problems the first time. More reliable.

But they are all quite similar and so far these new models are similar but faster IMO.

363. bli940505 ◴[16 Apr 25 23:17 UTC] No.43711353[source]▶

>>43707719 (OP) #

I’m having very mixed feelings about it. I’m using o3 to help me parse and understand a book about statistics and ML, it’s very dense in math.

On one hand the answers became a lot more comprehensive and deep. It’s now able to give me very advanced explanations.

On the other hand, it started overloading the answers with information. Entire concepts became single sentence summaries. Complex topics and theorems became acronyms. In a way I’m feeling overwhelmed by the information it’s now throwing at me. I can’t tell if it’s actually smarter or just too complicated for me to understand.

replies(1): >>43711620 #

364. beejiu ◴[16 Apr 25 23:24 UTC] No.43711405[source]▶

>>43707912 #

Why pay $200/mo when you can just access the models from the Platform playground?

replies(1): >>43713141 #

365. bjk95 ◴[16 Apr 25 23:25 UTC] No.43711409{4}[source]▶

>>43709587 #

Interesting - i got rejected https://g.co/gemini/share/17f73f620a3e

replies(1): >>43722287 #

366. throwanem ◴[16 Apr 25 23:28 UTC] No.43711436{6}[source]▶

>>43710498 #

I feel perhaps I've been unkind to many people in my thoughts, but I'm conflicted. I don't understand myself to be particularly fearless, but what times call more for courage than times like these? How do people afraid even to try to practice courage expect to find it, when there isn't time for practice any more?

replies(1): >>43713597 #

367. ac29 ◴[16 Apr 25 23:31 UTC] No.43711451[source]▶

>>43708890 #

> Is there a reputable, non-blogspam site that offers a 'cheat sheet' of sorts for what models to use, in particular for development?

Below is a spreadsheet I bookmarked from a previous HN discussion. Its information dense but you can just look at the composite scores to get a quick idea how things compare.

https://docs.google.com/spreadsheets/u/1/d/1foc98Jtbi0-GUsNy...

368. ◴[16 Apr 25 23:31 UTC] No.43711454[source]▶

>>43707719 (OP) #

369. jonahx ◴[16 Apr 25 23:34 UTC] No.43711469{4}[source]▶

>>43711155 #

Thanks for the reply. I am not sure the vision is the failing point here, but logic. I routinely try to get these models to solve difficult puzzles or coding challenges (the kind that a good undergrad math major could probably solve, but that most would struggle with). They fail almost always. Even with help.

For example, JaneStreet monthly puzzles. Surprisingly, the new o3 was able to solve this months (previous models were not), which was an easier one. Believe me, I am not trying to minimize the overall achievement -- what it can do incredible -- but I don't believe the phrase AGI should even be mentioned until we are seeing solutions to problems that most professional mathematicians would struggle with, including solutions to unsolved problems.

That might not be enough even, but that should be the minimum bar for even having the conversation.

replies(1): >>43712866 #

370. shanecp ◴[16 Apr 25 23:35 UTC] No.43711479[source]▶

>>43707719 (OP) #

Here are some notes I made to understand each of these models and when to use them.

# OpenAI Models

## Reasoning Models (o-series) - All `oX` (o-series aka `omni`) models are reasoning models. - Use these for complex, multi-step, reasoning tasks.

## Flagship/Core Models - All `x.x` and `Xo` models are the core models. - Use these for one-shot results - Examples: 4o, 4.1

## Cost Optimized - All `-mini`, `-nano` are cheaper, faster models. - Use these for high-volume, low effort tasks.

## Flagship vs Reasoning (o-series) Models - Latest flagship model = 4.1 - Latest reasoning model = o3 - The flagship models are general purpose, typically with larger context windows. These rely mostly on pattern matching. - The reasoning models are trained with extended chain-of-thought and reinforcement learning models. They work best with tools, code and other multi-step workflows. Because tools are used, the accuracy will be higher.

# List of Models

## 4o (omni) - 128K context window - complex multimodal, applications requiring the top level of reliability and nuance

## 4o-mini - 128K context window - Use: multimodal reasoning for math, coding, and structured outputs - Use: Cheaper than `4o`. Use when you can trade off accuracy vs speed/cost. - Dont Use: When high accuracy is needed

## 4.1 - 1M context window - Use: For large context ingest, such as full codebases - Use: For reliable instruction following, comprehension - Dont Use: For high volume/faster tasks

## 4.1-mini - 1M context window - Use: For large context ingest - Use: When a tradeoff can be made with accuracy vs speed

## 4.1-nano - 1M context window - Use: For high-volume, near-instant responses - Dont Use: When accuracy is required - Examples: classification, autocompletion, short-answers

## o3 - 200K context window - Use: for the most challenging reasoning tasks in coding, STEM, and vision that demand deep chain‑of‑thought and tool use - Use: Agentic workflows leveraging web search, Python execution, and image analysis in one coherent loop - Dont Use: For simple tasks, where lighter model will be faster and cheaper.

## o4-mini - 200K context window - Use: High-volume needs where reasoning and cost should be balanced - Use: For high throughput applications - Dont Use: When accuracy is critical

## o4-mini-high - 200K context window - Use: When o4-mini results are not satisfactory, but before moving to o3. - Use: Compex tool-driven reasoning, where o4-mini results are not satisfactory - Dont Use: When accuracy is critical

## o1-pro-mode - 200K context window - Use: Highly specialized science, coding, or reasoning jobs that benefit from extra compute for consistency - Dont Use: For simple tasks

## Models Sorted for Complex Coding Tasks (my opinion)

1. o3 2. Gemini 2.5 Pro 3. Claude 3.7 2. o1-pro-mode 3. o4-mini-high 4. 4.1 5. o4-mini

371. michaelbarton ◴[16 Apr 25 23:38 UTC] No.43711498{7}[source]▶

>>43710997 #

Not the OP but believe they could be referring to the fact it’s not supported in edit mode yet, only agent mode.

So far for me that’s not been too much of a roadblock. Though I still find overall Gemini struggles with more obscure issues such as SQL errors in dbt

372. swat535 ◴[16 Apr 25 23:49 UTC] No.43711580{4}[source]▶

>>43710245 #

Right and most people are not going to spend 200$+ /mo on ChatGPT.. Maybe businesses will but at this point they have too many choices.

373. Aeolun ◴[16 Apr 25 23:51 UTC] No.43711593{6}[source]▶

>>43709077 #

Claude is almost comically good outside of copilot. When using through copilot it’s like working with a lobotomized idiot (that complains it generated public code about half the time).

replies(1): >>43735445 #

374. Havoc ◴[16 Apr 25 23:54 UTC] No.43711620[source]▶

>>43711353 #

Pretty wild that we’re at the point that the human is the limitation

replies(1): >>43712155 #

375. andai ◴[16 Apr 25 23:55 UTC] No.43711624[source]▶

>>43707719 (OP) #

The most striking difference to me is that o3 and o4 know when the web search tool is unavailable, and will tell you they can't answer a question that requires it. While 4o and (sadly) 4.1 will just make up a bunch of nonsense.

I'm simultaneously impressed that they can do that, and also wondering why the heck that's so impressive (isn't "is this tool in this list?" something GPT-3 was able to handle?) and why 4.1 still fails at it too—especially considering it's hyped as the agentic coder model!

That's pretty damning for the general intelligence aspect of it, that they apparently had to special-case something so trivial... and I say that as someone who's really optimistic about this stuff!

That being said, the new "enhanced" web search seems great so far, and means I can finally delete another stupid 10 line Python script from 2023 that I shouldn't have needed in the first place ;)

(...Now if they'd just put 4.1 in the Chat... why the hell do I need to use a 3rd party UI for their best model!)

376. resters ◴[16 Apr 25 23:57 UTC] No.43711645{4}[source]▶

>>43709975 #

> I have no idea why you're being downvoted on this.

I probably offended someone at YC and my account is being punished.

377. werdnapk ◴[17 Apr 25 00:02 UTC] No.43711672[source]▶

>>43710247 #

I've used AI with "niche" programming questions and it's always a total let down. I truly don't understand this "vibe coding" movement unless everyone is building todo apps.

replies(7): >>43711916 #>>43712520 #>>43712538 #>>43712869 #>>43712926 #>>43713572 #>>43718231 #

378. brailsafe ◴[17 Apr 25 00:06 UTC] No.43711694{3}[source]▶

>>43709469 #

I mean, a smart programmer still has to learn what NixOs and Flakes are, and based on your description and some cursory searching, a smart programmer would just go do literally anything else. Perfect thing to delegate to a machine that doesn't have to worry about motivation.

Just jokes, idk anything about either.

379. tedsanders ◴[17 Apr 25 00:16 UTC] No.43711733{8}[source]▶

>>43710698 #

Nothing announced yet.

Our hypothesis is that o4-mini is a much better model, but we'll wait to hear feedback from developers. Evals only tell part of the story, and we wouldn't want to prematurely deprecate a model that developers continue to find value in. Model behavior is extremely high dimensional, and it's impossible to prevent regression on 100% use cases/prompts, especially if those prompts were originally tuned to the quirks of the older model. But if the majority of developers migrate happily, then it may make sense to deprecate at some future point.

We generally want to give developers as stable as an experience as possible, and not force them to swap models every few months whether they want to or not. Personally, I want developers to spend >99% of their time thinking about their business and <1% of their time thinking about what the OpenAI API is requiring of them.

380. sharkjacobs ◴[17 Apr 25 00:17 UTC] No.43711739{4}[source]▶

>>43709096 #

gemini-2.5-pro-preview-03-25 is the paid version which doesn't use your data

https://ai.google.dev/gemini-api/terms#data-use-paid

replies(1): >>43712209 #

381. int_19h ◴[17 Apr 25 00:23 UTC] No.43711775[source]▶

>>43710247 #

Compare to Gemini Pro 2.5:

https://g.co/gemini/share/c8fb1c9795e4

Of note, the final step in the CoT is:

> Formulate Conclusion: Since a definitive list or count isn't readily available through standard web searches, the best approach is to: state that an exact count is difficult to ascertain from readily available online sources without direct analysis of game files ... avoid giving a specific number, as none was reliably found across multiple sources.

and then the response is in line with that.

replies(1): >>43713369 #

382. shmerl ◴[17 Apr 25 00:37 UTC] No.43711851[source]▶

>>43710247 #

How would it ever know the answer it found is true and correct though? It could as well just repeat some existing false answer that you didn't yet find on your own. That's not much better than hallucinating it, since you can't verify its truth without finding it independently anyway.

replies(1): >>43713347 #

383. tedsanders ◴[17 Apr 25 00:47 UTC] No.43711913{6}[source]▶

>>43711162 #

Ah, I know this is a pain, but by default o3 is only available to developers on tiers 4–5.

If you're in tiers 1–3, you can still get access - you just need to verify your org with us here:

https://help.openai.com/en/articles/10910291-api-organizatio...

I recognize that verification is annoying, but we eventually had to resort to this as otherwise bad actors will create zillions of accounts to violate our policies and/or avoid paying via credit card fraud/etc.

replies(1): >>43712292 #

384. hatefulmoron ◴[17 Apr 25 00:47 UTC] No.43711916{3}[source]▶

>>43711672 #

It's incredible when I ask Claude 3.7 a question about Typescript/Python and it can generate hundreds of lines of code that are pretty on point (it's usually not exactly correct on first prompt, but it's coherent).

I've recently been asking questions about Dafny and Lean -- it's frustrating that it will completely make up syntax and features that don't exist, but still speak to me with the same confidence as when it's talking about Typescript. It's possible that shoving lots of documentation or a book about the language into the context would help (I haven't tried), but I'm not sure if it would make up for the model's lack of "intuition" about the subject.

replies(1): >>43712454 #

385. int_19h ◴[17 Apr 25 00:50 UTC] No.43711934{4}[source]▶

>>43708929 #

I think it's because the question is rather ambiguous - "convert the number to base-N" is a very common API, e.g. in C# you have Convert.ToString(long value, int base), in JavaScript you have Number.toString(base) etc. It seems that it just follows this pattern. If you were to ask me the same question, I'd probably do the same thing without any further context.

OTOH if you tell it to write a Base62 encoder in C#, it does consistently produce an API that can be called with byte arrays: https://g.co/gemini/share/6076f67abde2

replies(1): >>43723392 #

386. int_19h ◴[17 Apr 25 00:55 UTC] No.43711956{5}[source]▶

>>43709040 #

Are you comparing it with or without thinking? I'd say it's a fairly big improvement in long thinking mode.

387. int_19h ◴[17 Apr 25 01:00 UTC] No.43711992{4}[source]▶

>>43708074 #

It's more like GPT-3 is the Manchester Baby, and we're somewhere around IBM 700 series right now. Still a long way to go to iPhone, as much as the industry likes to pretend otherwise.

replies(1): >>43725631 #

388. int_19h ◴[17 Apr 25 01:02 UTC] No.43712009{7}[source]▶

>>43710906 #

I think the point here is that if you have to pretrain it for every specific task, it's not artificial general intelligence, by definition.

replies(1): >>43712334 #

389. achierius ◴[17 Apr 25 01:08 UTC] No.43712041[source]▶

>>43707849 #

How is this a notable release? It's strictly worse than Gemini 2.5 on coding &c, and only an iterative improvement over their own models. The only thing that struck me as particularly interesting was the native visual reasoning.

replies(1): >>43712778 #

390. mianos ◴[17 Apr 25 01:09 UTC] No.43712045[source]▶

>>43707719 (OP) #

I have been using o4-mini-high today. Most of the time for a file longer than 100 lines it stops generating randomly and won't complete a file unless I re-prompt it with the end of the missing file.

As usual, it's a frustrating experience for anything more complex than the usual problems everyone else does.

391. airstrike ◴[17 Apr 25 01:12 UTC] No.43712069{5}[source]▶

>>43710178 #

I'd attribute that more to first mover advantage than a benefit from poor naming choices, though I do think they are likely to misattribute that to a causal relationship so that they keep doing the latter

392. thom ◴[17 Apr 25 01:13 UTC] No.43712077{3}[source]▶

>>43708958 #

Mad tangent, but as an old timey MtG player it’s always jarring when someone uses “the meta” not to refer to the particular dynamics of their competitive ecosystem but to a single strategy within it. Impoverishes the concept, I feel, even in this case where I don’t actually think a single model is best at everything.

replies(1): >>43712444 #

393. thom ◴[17 Apr 25 01:17 UTC] No.43712101[source]▶

>>43707719 (OP) #

o4 is doing a better job than o3 on my current project, and while this isn’t really a priority, its personality is somehow far more engaging now.

394. immibis ◴[17 Apr 25 01:19 UTC] No.43712113[source]▶

>>43707719 (OP) #

https://transluce.org/investigating-o3-truthfulness

Some interesting hallucinations going on here!

395. hirvi74 ◴[17 Apr 25 01:23 UTC] No.43712139[source]▶

>>43710247 #

Have you asked this same question to various other models out there in the wild? I am just curious if you have found some that performed better. I would ask some models myself, but I do not know the proper answer, so I would probably be gullible enough to believe whatever the various answers have in common.

396. ec109685 ◴[17 Apr 25 01:24 UTC] No.43712143{4}[source]▶

>>43710367 #

I don’t know if they let you share the actual images when sharing a chat. For me, they are blank.

397. sealeck ◴[17 Apr 25 01:26 UTC] No.43712155{3}[source]▶

>>43711620 #

Surprise, the machine that interpolates from a database of maths books confuses a human who wants to learn about the contents of the books in that database.

398. ec109685 ◴[17 Apr 25 01:27 UTC] No.43712168{3}[source]▶

>>43709121 #

Even with a knowledge cutoff, you could know when a future new moon would be.

399. ec109685 ◴[17 Apr 25 01:28 UTC] No.43712171{3}[source]▶

>>43708210 #

These models are proclaiming near AGI, so they should be smarter than hallucinating an answer.

400. hirvi74 ◴[17 Apr 25 01:33 UTC] No.43712209{5}[source]▶

>>43711739 #

I do not feel like I can trust the empire that was built off selling personal data.

Make no mistake, I doubt the other options are trustworthy too.

401. ianbutler ◴[17 Apr 25 01:39 UTC] No.43712240{6}[source]▶

>>43709569 #

It isn't on the benchmark https://www.swebench.com/#verified

The one on the official leaderboard is the 63% score. Presumably because of all the extra work they had to do for the 70% score.

402. usaar333 ◴[17 Apr 25 01:48 UTC] No.43712286{3}[source]▶

>>43708606 #

> Some sources mention that o3 scores 63.8 on SWE-bench, while Gemini 2.5 Pro scores 69.1.

It's the opposite. o3 scores higher

replies(1): >>43714684 #

403. dmd ◴[17 Apr 25 01:49 UTC] No.43712292{7}[source]▶

>>43711913 #

Aha! Verified and now I see o3. Thanks.

404. mickael-kerjean ◴[17 Apr 25 01:52 UTC] No.43712302{3}[source]▶

>>43708249 #

The benchmark is something you can optimize for, doesn't mean it generalize well. Yesterday I tried for 2 hours to get claude to create a program that would extract data from a weird adobe file. 10$ later, the best I had is a program that was doing something like:

  switch(testFile) {
    case "test1.ase": // run this because it's a particular case 
    case "test2.ase": // run this because it's a particular case
    default:  // run something that's not working but that's ok because the previous case should
              // give the right output for all the test files ...
  }

405. og_kalu ◴[17 Apr 25 01:58 UTC] No.43712334{8}[source]▶

>>43712009 #

There isn't any general intelligence that isn't receiving pre-traning. People spend 14 to 18+ years in school to have any sort of career.

You don't have to pretrain it for every little thing but it should come as no surprise that a complex non-trivial game would require it.

Even if you explained all the rules of chess clearly to someone brand new to it, it will be a while and lots of practice before they internalize it.

And like I said, LLM pre-training is less like a machine reading text and more like Evolution. If you gave a corpus of chess rules, you're only training a model that knows how to converse about chess rules.

Do humans require less 'pre-training' ? Sure, but then again, that's on the back of millions of years of evolution. Modern NNs initialize random weights and have relatively very little inductive bias.

replies(1): >>43716551 #

406. ◴[17 Apr 25 02:03 UTC] No.43712360[source]▶

>>43707719 (OP) #

407. klasko ◴[17 Apr 25 02:10 UTC] No.43712397[source]▶

>>43707719 (OP) #

FWIW, o4-mini-high does not feel better o3-mini-high for working on fairly simply econ theory proofs. It does feel faster. And both elementary mistakes.

408. Davidzheng ◴[17 Apr 25 02:12 UTC] No.43712420[source]▶

>>43707719 (OP) #

I find it worse than Gemini 2.5 Pro at math research.

409. Davidzheng ◴[17 Apr 25 02:13 UTC] No.43712425[source]▶

>>43710247 #

Underwhelmed compared with Gemini 2.5 Pro--however it would've been impressive a month ago I think.

410. hatefulmoron ◴[17 Apr 25 02:15 UTC] No.43712444{4}[source]▶

>>43712077 #

I'm a World of Warcraft & Dota 2 player, using "the meta" in that way is pretty common in gaming these days I think. The "meta" is still the 'metagame' in the competitive ecosystem sense, but it also refers to strategies that are considered flavor of the month (FOTM) or just generally safe bets.

So there's "the meta", and there's "that strategy is meta", or "that strategy is the meta."

replies(1): >>43714067 #

411. mhitza ◴[17 Apr 25 02:16 UTC] No.43712454{4}[source]▶

>>43711916 #

Don't need to ho that esoteric. Seen them make stuff up pretty often for more common functional programming languages like Haskell and OCaml.

replies(2): >>43712635 #>>43713580 #

412. redox99 ◴[17 Apr 25 02:26 UTC] No.43712513{3}[source]▶

>>43708068 #

2.5 Pro is very buggy with cursor. It often stops before generating any code. It's likely a cursor problem, but I use 3.7 because of that.

413. killerdhmo ◴[17 Apr 25 02:27 UTC] No.43712520{3}[source]▶

>>43711672 #

I mean, I don't think you need to do cutting edge programming to make something personal to you. See here from Canva's product. Check this out: https://youtu.be/LupwvXsOQqs?t=2366

414. SkyPuncher ◴[17 Apr 25 02:32 UTC] No.43712538{3}[source]▶

>>43711672 #

There's a bit of a skill to it.

Good architecture plans help. Telling it where in an existing code base it can find things to pattern match against is also fantastic.

I'll often end up with a task that looks something like this:

* Implement Foo with a relation to FooBar.

* Foo should have X, Y, Z features

* We have an existing pattern for Fidget in BigFidget. Look at that for implementation

* Make sure you account for A, B, C. Check Widget for something similar.

It works surprisingly well.

replies(2): >>43713450 #>>43713636 #

415. jdlyga ◴[17 Apr 25 02:44 UTC] No.43712596[source]▶

>>43707719 (OP) #

At this point, it's like comparing the iPhone 5s vs the iPhone 6. The upgrades are still noticeable, but it's nowhere the huge jump between GPT 3.5 and GPT 4.

416. megablast ◴[17 Apr 25 02:45 UTC] No.43712599{6}[source]▶

>>43710715 #

Does AGI mean good at chess?

What if it is a dumb AGI?

417. greenavocado ◴[17 Apr 25 02:51 UTC] No.43712635{5}[source]▶

>>43712454 #

Recommend using RAG for this. Make the Haskell or OCaml documentation your knowledge base and index it for RAG. Then it makes a heck of a lot more sense!

replies(1): >>43713194 #

418. og_kalu ◴[17 Apr 25 03:17 UTC] No.43712766[source]▶

>>43708146 #

It doesn't need to hold forever or even 'much longer' depending on your definition of that duration. It just needs to hold on long enough to realize certain capabilities.

Will it ? Who knows. But seeing as this is something you can't predict ahead of time, it makes little sense not to try in so far as the whole thing is still feasible.

419. og_kalu ◴[17 Apr 25 03:20 UTC] No.43712778{3}[source]▶

>>43712041 #

It's not worse on coding. SWE Bench, Aider, live bench coding all show noticeably better results.

420. carlita_express ◴[17 Apr 25 03:23 UTC] No.43712791{5}[source]▶

>>43711075 #

I suppose that is the question...

421. og_kalu ◴[17 Apr 25 03:34 UTC] No.43712866{5}[source]▶

>>43711469 #

>what it can do incredible -- but I don't believe the phrase AGI should even be mentioned until we are seeing solutions to problems that most professional mathematicians would struggle with, including solutions to unsolved problems.

But Why ? Why should Artificial General Intelligence preclude things a good chunk of humans wouldn't be able to do ? Are those guys no longer General Intelligences ?

I'm not saying this definition is 'wrong' but you have to realize at this point, the individual words of that acronym no longer mean anything.

replies(1): >>43713364 #

422. adncors ◴[17 Apr 25 03:34 UTC] No.43712868{3}[source]▶

>>43707964 #

But we're seeing incremental improvements every two months, so...

423. mikepurvis ◴[17 Apr 25 03:35 UTC] No.43712869{3}[source]▶

>>43711672 #

I'm trialing co-pilot in VSCode and it's a mixed bag. Certain things it pops out great, but a lot of times I'll be like woohoo! <tab> <tab> <tab> and then end up immediately realising wait a sec, none of this is actually needed, or it's just explicitly calling for things that are already default values, or whatever.

(This is particularly in the context of metadata-type stuff, things like pyproject files, ansible playbooks, Dockerfiles, etc)

424. pdntspa ◴[17 Apr 25 03:35 UTC] No.43712870{6}[source]▶

>>43708611 #

Cline/Roo Code work fine with it

425. taberiand ◴[17 Apr 25 03:44 UTC] No.43712912{4}[source]▶

>>43708846 #

We'll know they have cracked AGI when they solve the hardest problem of all - naming things

426. chaboud ◴[17 Apr 25 03:46 UTC] No.43712926{3}[source]▶

>>43711672 #

I recently exclaimed that “vibe coding is BS” to one of my coworkers before explaining that I’ve actually been using GPT, Claude, llama (for airplanes), Cline, Cursor, Windsurf, and more for coding for as long as they’ve been available (more recently playing with Gemini). Cline + Sonnet 3.7 has been giving me great results on smaller projects with popular languages, and I feel truly fortunate to have AWS Bedrock on tap to drive this stuff (no effective throttling/availability limits for an individual dev). Even llama + Continue has proven workable (though it will absolutely hallucinate language features and APIs).

That said, 100% pure vibe coding is, as far as I can tell, still very much BS. The subtle ugliness that can come out of purely prompt-coded projects is truly a rat hole of hate, and results can get truly explosive when context windows saturate. Thoughtful, well-crafted architectural boundaries and protocols call for forethought and presence of mind that isn’t yet emerging from generative systems. So spend your time on that stuff and let the robots fill in the boilerplate. The edges of capability are going to keep moving/growing, but it’s already a force multiplier if you can figure out ways to operate.

For reference, I’ve used various degrees of assistance for color transforms, computer vision, CNN network training for novel data, and several hundred smaller problems. Even if I know how to solve a problem, I generally run it through 2-3 models to see how they’ll perform. Sometimes they teach me something. Sometimes they violently implode, which teaches me something else.

replies(1): >>43713693 #

427. CaptainFever ◴[17 Apr 25 04:13 UTC] No.43713052{4}[source]▶

>>43709096 #

Personally, I frankly do not care for most things. But for more sensitive things which might land me in trouble, local models are the way to go.

428. CaptainFever ◴[17 Apr 25 04:15 UTC] No.43713066{3}[source]▶

>>43708073 #

Would that be considered a Mixture of Experts system?

replies(1): >>43713297 #

429. NiloCK ◴[17 Apr 25 04:26 UTC] No.43713122[source]▶

>>43708704 #

I could be misinterpreting your claim here, but I'll point out that LLM weights don't literally encode the entirety of the training data set.

replies(1): >>43713470 #

430. alphabettsy ◴[17 Apr 25 04:31 UTC] No.43713141{3}[source]▶

>>43711405 #

Higher limits and operator access maybe?

431. ilaksh ◴[17 Apr 25 04:38 UTC] No.43713173{5}[source]▶

>>43711306 #

in the api or on the website?

432. lend000 ◴[17 Apr 25 04:39 UTC] No.43713176[source]▶

>>43710247 #

I imagine after GPT-4 / o1, improvements on benchmarks have been increasingly a result of overfitting, because those breakthrough models already used most of the high quality training data that is available on the internet, there haven't been any dramatic architectural changes, we are already melting the world's GPUs, and there simply isn't enough new, high quality data being generated (orders of magnitudes more than what they already used on older models) to enable breakthrough improvements.

What I'd really like to see is the model development companies improving their guardrails so that they are less concerned about doing something offensive or controversial and more concerned about conveying their level of confidence in an answer, i.e. saying I don't know every once in a while. Once we get a couple years of relative stagnation in AI models, I suspect this will become a huge selling point and you will start getting "defense grade", B2B type models where accuracy is king.

433. ilaksh ◴[17 Apr 25 04:39 UTC] No.43713182{5}[source]▶

>>43711306 #

It's using the new gpt-4o, a version that's not in the API

434. rashkov ◴[17 Apr 25 04:42 UTC] No.43713194{6}[source]▶

>>43712635 #

How does one do that? As far as I can tell neither Claude or chatgpt web clients support this. Is there a third party tool that people are using?

replies(2): >>43715328 #>>43730209 #

435. knes ◴[17 Apr 25 04:42 UTC] No.43713202[source]▶

>>43707951 #

Right now the Swe-Bench leader Augment Agent still use Claude 3.7 in combo with o1. https://www.augmentcode.com/blog/1-open-source-agent-on-swe-...

The findings are open sourced on a repo too https://github.com/augmentcode/augment-swebench-agent

436. departed ◴[17 Apr 25 05:00 UTC] No.43713256[source]▶

>>43708890 #

LMArena might have some of the information you are looking for. It offers rankings of LLM models across main cloud offerings, and I feel that its evaluation method, human prompting and voting, is closer to real-world use case and less prone to data contamination than benchmarks.

https://lmarena.ai/

In the "Leaderboard">"Language" tab, it lists the top models in various categories such as overall, coding, math, and creative writing.

In the "Leaderboard">"Price Analysis" tab, it shows a chart comparing models by cost per million tokens.

In the "Prompt-to-Leaderboard" tab, there is even an LLM to help you find LLMs -- you enter a prompt, and it will find the top models for your particular prompt.

437. simonw ◴[17 Apr 25 05:08 UTC] No.43713297{4}[source]▶

>>43713066 #

No, Mixture of Experts is a really confusing term.

It sounds like it means "have a bunch of models, one that's an expert in physics, one that's an expert in health etc and then pick the one that's a best fit for the user's query".

It's not that. The "experts" are each another giant opaque blob of weights. The model is trained to select one of those blobs, but they don't have any form of human-understandable "expertise". It's an optimization that lets you avoid using ALL of the weights for every run through the model, which helps with performance.

https://huggingface.co/blog/moe#what-is-a-mixture-of-experts... is a decent explanation.

438. M4v3R ◴[17 Apr 25 05:20 UTC] No.43713347{3}[source]▶

>>43711851 #

I would be ok with having an answer and an explanation of how it got the answer with a list of sources. And it does just that - the only problem is that both the answer and the explanation are fabrications after you double check the sources.

439. jonahx ◴[17 Apr 25 05:23 UTC] No.43713364{6}[source]▶

>>43712866 #

Sure, there's no authority who stamps the official definition.

I'll make my case. To me, if you look at how the phrase is usually used -- "when humans have achieved AGI...", etc -- it evokes a science fiction turning point that implies superhuman performance in more or less every intellectual task. It's general, after all. I think of Hal or the movie Her. It's not "Artifical General Just-Like-Most-People-You-Know Intelligence". Though we are not there yet, either, if you consider the full spectrum of human abilities.

Few things would demonstrate general superhuman reasoning ability more definitively than machines producing new, useful, influential math results at a faster rate than people. With that achieved, you would expect it could start writing fiction and screenplays and comedy as well as people too (it's still very far imo), but maybe not, maybe those skills develop at different paces, and I still wouldn't want to call it AGI. But I think truly conquering mathematics would get me there.

replies(2): >>43714845 #>>43748035 #

440. M4v3R ◴[17 Apr 25 05:23 UTC] No.43713369{3}[source]▶

>>43711775 #

I like this answer. It does mention the correct, definitive way of getting the information I want (extracting the char.lgp data file) and so even though it gave up it pushes you in the right direction, whereas o3/o4 just make up stuff.

441. Seattle3503 ◴[17 Apr 25 05:37 UTC] No.43713438[source]▶

>>43708027 #

This seems like a perfect use case for "agentic" AI. OpenAI can enrich the context window with the strengths and weakness of each model, and when a user prompts for something the model can say "Hey, I'm gonna switch to another model that is better at answering this sort of question." and the user can accept or reject.

442. extr ◴[17 Apr 25 05:39 UTC] No.43713450{4}[source]▶

>>43712538 #

Yeah this is a great summary of what I do as well and I find it very effective. I think of hands-off AI coding like you're directing a movie. You have a rough image of what "good" looks like in your head, and you're trying to articulate it with enough detail to all the stagehands and actors such that they can realize the vision. The models can always get there with enough coaching, traditionally the question is if that's worth the trouble versus just doing it yourself.

Increasingly I find that AI at this point is good enough I am rarely stepping in to "do it myself".

443. glial ◴[17 Apr 25 05:43 UTC] No.43713470{3}[source]▶

>>43713122 #

I guess you could consider it a lossy encoding.

444. motorest ◴[17 Apr 25 06:03 UTC] No.43713572{3}[source]▶

>>43711672 #

> I've used AI with "niche" programming questions and it's always a total let down.

That's perfectly fine. It just means you tried without putting in any effort and failed to get results that were aligned with your expectations.

I'm also disappointed when I can't dunk or hit >50% of my 3pt shots, but then again I never played basketball competitively

> I truly don't understand this "vibe coding" movement unless everyone is building todo apps.

Yeah, I also don't understand the NBA. Every single one of those players show themselves dunking and jumping over cars and having almost perfect percentages in 3pt shots during practice, whereas I can barely get off my chair. The problem is certainly basketball.

445. Foobar8568 ◴[17 Apr 25 06:06 UTC] No.43713580{5}[source]▶

>>43712454 #

Well all LLM are fairly bad for react native as soon as you look at more than hello world type of things.

I got stuck with different LLM until I checked the official documentation, yeah spouting nonsense from 2y+ removed features I suppose or just making up stuff.

446. M4v3R ◴[17 Apr 25 06:06 UTC] No.43713582[source]▶

>>43710247 #

Btw Ive also asked this question using Deep Research mode in ChatGPT and got the correct answer: https://chatgpt.com/share/68009a09-2778-8004-af40-4a8e7e812b...

So maybe this is just too hard for a “non-research” mode. I’m still disappointed it lied to me instead of saying it couldn’t find an answer.

447. Hackbraten ◴[17 Apr 25 06:08 UTC] No.43713597{7}[source]▶

>>43711436 #

You have only so many spoons available per crisis. Even picking your battle can become a problem.

I've been out in the streets, protesting and raising awareness of climate change. I no longer do. It's a pointless waste of time. Today, the climate change deniers are in charge.

replies(1): >>43715605 #

448. bloqs ◴[17 Apr 25 06:15 UTC] No.43713626[source]▶

>>43707719 (OP) #

I'm confused. I typically use o1 for all of my questions. Now it's disappeared. Is o3 a better model?

replies(1): >>43713958 #

449. motorest ◴[17 Apr 25 06:16 UTC] No.43713636{4}[source]▶

>>43712538 #

> Good architecture plans help.

This is they key answer right here.

LLMs are great at interpolating and extrapolating based on context. Interpolating is far less error-prone. The problem with interpolating is that you need to start with accurate points so that interpolating between them leads to expected and relatively accurate estimates.

What we are seeing is the result of developers being oblivious to higher-level aspects of coding, such as software architecture, proper naming conventions, disciplined choice of dependencies and dependency management, and even best practices. Even basic requirements-gathering.

Their own personal experience is limited to diving into existing code bases and patching them here and there. They often screw up the existing software architecture because their lack of insight and awareness leads them to post PRs that get the job done at the expense of polluting the whole codebase into an unmanageable mess.

So these developers crack open an LLM and prompt it to generate code. They use their insights and personal experience to guide their prompts. Their experience reflects what they do on a daily basis. The LLMs of course generate code from their prompts, and the result is underwhelming. Garbage-in, garbage-out.

It's the LLMs fault, right? All the vibe coders out there showcasing good results must be frauds.

The telltale sign of how poor these developers are is how they dump the responsibility of they failing to get LLMs to generate acceptable results on the models not being good enough. The same models that are proven effective at creating whole projects from scratch at their hands are incapable of the smallest changes. It's weird how that sounds, right? If only the models were better... Better at what? At navigating through your input to achieve things that others already achieve? That's certainly the model's fault, isn't it?

A bad workman always blames his tools.

replies(1): >>43715018 #

450. kurtis_reed ◴[17 Apr 25 06:23 UTC] No.43713671[source]▶

>>43707719 (OP) #

I thought they weren't going to release o3 and it would just be bundled into "GPT-5".

451. motorest ◴[17 Apr 25 06:27 UTC] No.43713693{4}[source]▶

>>43712926 #

> That said, 100% pure vibe coding is, as far as I can tell, still very much BS.

I don't really agree. There's certainly a showboating factor, not to mention there is currently a goldrush to tap this movement to capitalize from it. However, I personally managed to create a fully functioning web app from scratch with Copilot+vs code using a mix of GPT4 and o1-mini. I'm talking about both backend and frontend, with basic auth in place. I am by no means a expert, but I did it in an afternoon. Call it BS, the the truth of the matter is that the app exists.

replies(1): >>43714828 #

452. tern ◴[17 Apr 25 06:27 UTC] No.43713694[source]▶

>>43710247 #

What's the correct answer? Curious if it got it right the second time: https://chatgpt.com/share/68009f36-a068-800e-987e-e6aaf190ec...

453. echoangle ◴[17 Apr 25 06:43 UTC] No.43713784{9}[source]▶

>>43711244 #

If you want the entire model to remember everything it talked about with every user, sure. But ideally, I would want the model to remember what I told it a few million tokens ago, but not what you told it (because to me, the model should look like my private copy that only talks to me).

replies(1): >>43717591 #

454. wg0 ◴[17 Apr 25 06:58 UTC] No.43713871{7}[source]▶

>>43710323 #

Man ... Classic HN.

But yes unfortunately even if you across the whole functional paradigm, nix is surely complicated. And one single file whole system up is rarely true.

455. wg0 ◴[17 Apr 25 07:02 UTC] No.43713896[source]▶

>>43707719 (OP) #

If you download GIMP, Blender etc - every user would have to report exactly the same experience mostly given the hardware is recent.

In this thread however - there are varying experiences from amazing to awful. I'm not saying anyone is wrong but all I'm saying is that this wide range of operational accuracy is what will pop the AI bubble eventually in that they can't be reliably deployed almost anywhere with any certainty or guarantees of any sorts.

456. danpalmer ◴[17 Apr 25 07:05 UTC] No.43713910[source]▶

>>43709152 #

Are you sure about all of this? You acknowledged it might be a hallucination, but you seem to mostly believe it? o3 doesn't have the ability to spin up a VM.

https://xcancel.com/TransluceAI/status/1912552046269771985 / https://news.ycombinator.com/item?id=43713502 is a discussion of these hallucinations.

As for the hash, could it have simply found a listing for the package with hashes provided and used that hash?

457. euph0ria ◴[17 Apr 25 07:15 UTC] No.43713958[source]▶

>>43713626 #

Yes, in almost all aspects if you do not use the o1-pro. o3-pro is not available yet.

458. momoelz ◴[17 Apr 25 07:17 UTC] No.43713968[source]▶

>>43707719 (OP) #

I find o4 very bad at coding. I tried to improve a script created by 3.5 mini-high with o4 mini-high and it doesn't return nearly as good results as what i used to get by o3.5

459. FergusArgyll ◴[17 Apr 25 07:24 UTC] No.43713995{3}[source]▶

>>43708073 #

I thought sama said that that's the plan for gpt-5: a router which'll choose the right model and thinking level for you

460. steinvakt2 ◴[17 Apr 25 07:24 UTC] No.43713996{3}[source]▶

>>43708993 #

But isn't this basically what the conv layer does...?

461. thom ◴[17 Apr 25 07:37 UTC] No.43714067{5}[source]▶

>>43712444 #

Yeah, I accept that "Nash equilibrium" isn't likely to catch on at this stage.

462. bool3max ◴[17 Apr 25 07:37 UTC] No.43714068[source]▶

>>43709152 #

I find that so incredibly unlikely. Granted I haven't been keeping up to date with the latest LLM developments - but has there even been any actual confirmation from OpenAI that these models have the ability to do such things in the background?

463. yMEyUyNE1 ◴[17 Apr 25 07:45 UTC] No.43714110[source]▶

>>43710247 #

> Not to lie me in the face.

Are you saying that, it deliberately lied to you?

> With right knowledge and web searches one can answer this question in a matter of minutes at most.

Reminded me of Dunning Kruger curve, the ai model at the first peak and you at the latter.

replies(1): >>43714529 #

464. Carbonhell ◴[17 Apr 25 07:48 UTC] No.43714126[source]▶

>>43708890 #

I have been using this site: https://artificialanalysis.ai/ . It's still about benchmarks, and it doesn't do deep dives into specific use cases, but it's helpful to compare models for intelligence vs cost vs latency and other characteristics.

465. casinoplayer0 ◴[17 Apr 25 07:56 UTC] No.43714166{3}[source]▶

>>43710318 #

I wanted to believe. But not now.

466. shultays ◴[17 Apr 25 08:08 UTC] No.43714235[source]▶

>>43710247 #

AIs in general are definitely hallucinating a lot more when it comes to niche topics. It is funny how they are unable to say "I don't know" and just make up things to answer your questions

replies(1): >>43714275 #

467. felipeerias ◴[17 Apr 25 08:18 UTC] No.43714275{3}[source]▶

>>43714235 #

LLMs made me a lot more aware of leading questions.

Tiny changes in how you frame the same query can generate predictably different answers as the LLM tries to guess at your underlying expectations.

468. littlestymaar ◴[17 Apr 25 08:19 UTC] No.43714284{9}[source]▶

>>43711272 #

Writing philosophy that looks convincing has been a thing LLM do well since the first release ChatGPT back in 2022 (in my country back in early 2023, TV featured a kind of competition between ChatGPT and a philosopher turned media personality, with university professors blindly reviewing both essays and attempting to determine which was whom).

To have an idea about how good a model is on non-STEM tasks, you need to challenge it on stuff that is harder than this for LLMs, like summarization without hallucination or creative writing. OpenAI's nonthinking model are usually very good on these, but not their thinking models, whereas other players (be it Google, Anthropic or DeepSeek) manage to make models that can be very good at both.

replies(1): >>43717436 #

469. M4v3R ◴[17 Apr 25 09:12 UTC] No.43714529{3}[source]▶

>>43714110 #

> Are you saying that, it deliberately lied to you?

Pretty much yeah. Now “deliberately” does imply some kind of agency or even consciousness which I don’t believe these models have, its probably the result of overfitting, reward hacking or some other issues from training it, but the end result is that the model straight up misleads you knowingly (as in - the thinking trace is aware of the fact it doesn’t know the answer but it provides it anyways).

470. SweetSoftPillow ◴[17 Apr 25 09:41 UTC] No.43714684{4}[source]▶

>>43712286 #

On SWE bench? Show your source.

471. saberience ◴[17 Apr 25 10:06 UTC] No.43714828{5}[source]▶

>>43713693 #

People were making a front and backend web app in half a day using Ruby on Rails way before LLMs were ever a thing, and their code quality was still much better than yours!

So vibe coding, sure you can create some shitty thing which WORKS, but once it becomes bigger than a small shitty thing, it becomes harder and harder to work with because the code is so terrible when you're pure vibe coding.

replies(1): >>43715997 #

472. saberience ◴[17 Apr 25 10:08 UTC] No.43714843{3}[source]▶

>>43708068 #

Eh, I wouldn't say that's accurate, I think it's situational. I code all day using AI tools and Sonnet 3.7 is still the king. Maybe it's language dependent or something, but all the engineers I know are full on Claude-Code at this point.

473. nopinsight ◴[17 Apr 25 10:08 UTC] No.43714845{7}[source]▶

>>43713364 #

A standard term people use for what you describe is superintelligence, not AGI.

Current frontier models are better than average humans in many skills but worse in others. Ethan Mollick calls it “jagged frontier” which sounds about right.

474. sks38317 ◴[17 Apr 25 10:38 UTC] No.43715005[source]▶

>>43707719 (OP) #

thanks for your information!

475. hansmayer ◴[17 Apr 25 10:40 UTC] No.43715018{5}[source]▶

>>43713636 #

Yes, with a bit of work around prompting and focusing on closed context, or as you put it, interpolating, you can get further. But the problems is that, this is not how the LLMs were sold. If you blame someone for trying to use it by specifying fairly high level prompts - well isn´t that exactly how this technology was being advertised the whole time? The problem is not the bad workman, the problem is that the tool is not doing what it is advertised as doing.

replies(1): >>43716016 #

476. petesergeant ◴[17 Apr 25 10:49 UTC] No.43715065{6}[source]▶

>>43708953 #

It seems clear to me I would have built an app around the API, not the chat window.

477. ben_w ◴[17 Apr 25 11:04 UTC] No.43715139{3}[source]▶

>>43708656 #

Energy Intensive Exceptional Intelligence (Omni-domain), AKA E-I-E-I-O.

478. unsupp0rted ◴[17 Apr 25 11:04 UTC] No.43715141{5}[source]▶

>>43709225 #

Node / Vue

479. kupopuffs ◴[17 Apr 25 11:19 UTC] No.43715221{7}[source]▶

>>43709457 #

Its a lot of sales/account managers. and some engineers

wow the sales go hard in this product

480. greenavocado ◴[17 Apr 25 11:38 UTC] No.43715328{7}[source]▶

>>43713194 #

They can't

You can use openwebui with deepseek v3 0324 via API with for example deepinfra as provider for your embeddings and text generation models

481. throwanem ◴[17 Apr 25 12:08 UTC] No.43715605{8}[source]▶

>>43713597 #

I don't assume I'm going to be given the luxury of picking my battles, and - though I've been aware of "spoon theory" since I watched it getting invented at Shakesville back in the day - I've never held to it all that strongly, even as I acknowledge I've also never been quite the same since a nasty bout of wild-type covid in early 2020. Now as before, I do what needs doing as best I can, then count the cost. Some day that will surely prove too high, and my forward planning efforts will be put to the test. Till then I'm happy not to borrow trouble.

I've lived in this neighborhood a long time, and there are a couple of old folks' homes a block or so from here. Both have excellent views, on one frontage each, of an extremely historic cemetery, which I have always found a wonderfully piquant example of my adopted hometown's occasionally wire-brush sense of humor. But I bring it up to mention that the old folks don't seem to have much concern for spoons other than to eat with, and they are protesting the present situation regularly and at considerable volume, and every time I pass about my errands I make a point of raising a fist and hollering "hell yeah!" just like most of the people who drive past honk in support.

Will you tell them it's pointless?

482. stogot ◴[17 Apr 25 12:28 UTC] No.43715811{3}[source]▶

>>43708136 #

Apple has commercials for a decade making fun of “PCs”

483. namaria ◴[17 Apr 25 12:42 UTC] No.43715982{4}[source]▶

>>43709054 #

While bleeding cash faster than anything else in History.

484. motorest ◴[17 Apr 25 12:44 UTC] No.43715997{6}[source]▶

>>43714828 #

> People were making a front and backend web app in half a day using Ruby on Rails way before LLMs were ever a thing, and their code quality was still much better than yours!

A few people were doing that.

With LLMs, anyone can do that. And more.

It's important to frame the scenario correctly. I repeat: I created everything in an afternoon just for giggles, and I challenged myself to write zero lines of code.

> So vibe coding, sure you can create some shitty thing which WORKS (...)

You're somehow blindly labelling a hypothetical output as "shitty", which only serves to show your bias. In the meantime, anyone who is able to churn out a half-functioning MVP in an afternoon is praised as a 10x developer. There's a contrast in there, where the same output is described as shitty or outstanding depending on who does it.

485. motorest ◴[17 Apr 25 12:46 UTC] No.43716016{6}[source]▶

>>43715018 #

> But the problems is that, this is not how the LLMs were sold.

No one cares about promises. The only thing that matters are the tangibles we have right now.

Right now we have a class of tools that help us write multidisciplinary apps with a few well-crafted prompts and zero code involved.

replies(1): >>43749660 #

486. code_biologist ◴[17 Apr 25 12:56 UTC] No.43716131{4}[source]▶

>>43709411 #

Claude can't beat Pokemon Red. Not even close yet: https://arstechnica.com/ai/2025/03/why-anthropics-claude-sti...

487. ksec ◴[17 Apr 25 13:12 UTC] No.43716307[source]▶

>>43707951 #

I often wonder if we could expect that to reach 80% - 90% within next 5 years.

488. sceptic123 ◴[17 Apr 25 13:15 UTC] No.43716355{7}[source]▶

>>43710807 #

I think "useful as an assistant for coding" and "being able to program" are two different things.

When I was trying to understand what is happening with hallucination GPT gave me this: > It's called hallucinating when LLMs get things wrong because the model generates content that sounds plausible but is factually incorrect or made-up—similar to how a person might "see" or "experience" things that aren't real during a hallucination.

From that we can see that they fundamentally don't know what is correct. While they can get better at predicting correct answers, no-one has explained how they are expected to cross the boundary from "sounding plausible" to "knowing they are factually correct". All the attempts so far seem to be about reducing the likelihood of hallucination, not fixing the problem that they fundamentally don't understand what they are saying.

Until/unless they are able to understand the output enough to verify the truth then there's a knowledge gap that seems dangerous given how much code we are allowing "AI" to write.

replies(1): >>43717015 #

489. sceptic123 ◴[17 Apr 25 13:31 UTC] No.43716551{9}[source]▶

>>43712334 #

People are focussing on chess, which is complicated, but LLM fail at even simple games like tic-tac-toe where you'd think, if it was capable of "reasoning" it would be able to understand where it went wrong. That doesn't seem to be the case.

What it can do is write and execute code to generate the correct output, but isn't that cheating?

replies(1): >>43719557 #

490. simonw ◴[17 Apr 25 14:02 UTC] No.43717015{8}[source]▶

>>43716355 #

Code is one of the few applications of LLMs where they DO have a mechanism for verifying if what they produced is correct: they can write code, run that code, look at the output and iterate in a loop until it does what it's supposed to do.

replies(1): >>43761392 #

491. spprashant ◴[17 Apr 25 14:13 UTC] No.43717166[source]▶

>>43708332 #

Tyler Cowen is someone I take seriously. I think he is one of the most rational thought leaders.

But I have to say, his views on LLMs seem a little premature. He definitely has a unique viewpoint of what "general intelligence" is, which might not apply broadly to most jobs. I think "interviews" them like they were a guest on his podcast and bases his judgement on how they compare to his other extremely smart guests.

492. staticman2 ◴[17 Apr 25 14:27 UTC] No.43717374{3}[source]▶

>>43709510 #

Gemini 2.5 is not generating that refusal. It's a separate censorship model.

It's more clear when you try via AI studio where that have censorship level toggles.

493. kadushka ◴[17 Apr 25 14:32 UTC] No.43717436{10}[source]▶

>>43714284 #

I've been discussing a philosophical topic (brain uploading) with all major models in the last two years. This is a topic I've read and thought about for a long time. Until o3, the responses I got from all other models (Gemini 2.5 pro most recently) have been underwhelming - generic, high level, not interesting to an expert. They struggled to understand the points I was making, and ideas I wanted to explore. o3 was the first model that could keep up, and provide interesting insights. It was communicating on a level of a professional in the field, though not an expert on this particular topic - this is a significant improvement over all existing models.

494. kadushka ◴[17 Apr 25 14:41 UTC] No.43717591{10}[source]▶

>>43713784 #

ideally, I would want the model to remember what I told it a few million tokens ago

Yes, you can keep finetuning your model on every chat you have with it. You can definitely make it remember everything you have ever said. LLMs are excellent at remembering their training data.

495. ecocentrik ◴[17 Apr 25 15:23 UTC] No.43718231{3}[source]▶

>>43711672 #

People who embracing vibe coding are probably the same people who were already sudo-vibe coding to begin with using found fragments of code they could piece together to make things sort of work for simple tasks.

496. rpgbr ◴[17 Apr 25 15:48 UTC] No.43718581[source]▶

>>43707719 (OP) #

This post[1] is highlighted by Techmeme:

>I'm obsessed with o3. It's way better than the previous models. It just helped me resolve a psychological/emotional problem I've been dealing with for years in like 3 back-and-forths (one that wasn't socially acceptable to share, and those I shared it with didn't/couldn't help)

Genuinely intrigued by what kind of “psychological/emotional problem I've been dealing with for years” could an AI solve in a matter of hours after its release.

[1] https://x.com/carmenleelau/status/1912645771955962300

497. int_19h ◴[17 Apr 25 17:04 UTC] No.43719557{10}[source]▶

>>43716551 #

Which SOTA LLM fails at tic-tac-toe?

replies(1): >>43761469 #

498. scragz ◴[17 Apr 25 18:06 UTC] No.43720295{4}[source]▶

>>43710864 #

deep research will run in the BG on mobile and I think it gives a notification when done. it's not like normal chats that need the app to be in the foreground.

499. AbuAssar ◴[17 Apr 25 18:28 UTC] No.43720498[source]▶

>>43707719 (OP) #

I noticed that OpenAI don't compare their models to third party models in their announcement posts, unlike google, meta and the others.

500. amedviediev ◴[17 Apr 25 19:14 UTC] No.43720979{3}[source]▶

>>43708068 #

I keep seeing this sentiment so often here and on X that I have to wonder if I'm somehow using a different Gemini 2.5 Pro. I've been trying to use it for a couple of weeks already and without exaggeration it has yet to solve a single programming task successfully. It is constantly wrong, constantly misunderstands my requests, ignores constraints, ignores existing coding conventions, breaks my code and then tells me to fix it myself.

501. wilg ◴[17 Apr 25 20:06 UTC] No.43721587{4}[source]▶

>>43708547 #

Deploying several things is sometimes tricky and this could not be a smaller deal.

502. heavyset_go ◴[17 Apr 25 20:45 UTC] No.43722041[source]▶

>>43710247 #

Same thing happened when asking it a fairly simple question about dracut on Linux.

If I went through with the changes it suggested, I wouldn't have a bootable machine.

503. andrewinardeer ◴[17 Apr 25 21:15 UTC] No.43722287{5}[source]▶

>>43711409 #

That is indeed what I got.

504. osigurdson ◴[17 Apr 25 23:35 UTC] No.43723392{5}[source]▶

>>43711934 #

There is Convert.ToBase64String so I don't think encode is necessarily universal (though probably more precise).

505. nwienert ◴[18 Apr 25 06:46 UTC] No.43725631{5}[source]▶

>>43711992 #

Both were big consumer commercial breakouts and far better than predecessors. And several years later both see only iterative improvements.

Neither apply to your analogy.

506. ◴[18 Apr 25 08:21 UTC] No.43726069[source]▶

>>43707719 (OP) #

507. mountainriver ◴[18 Apr 25 13:28 UTC] No.43727880[source]▶

>>43710247 #

Oh boy, here comes the “it didn’t work for this one specific thing I tried” posts

replies(1): >>43734778 #

508. hellsten ◴[18 Apr 25 17:39 UTC] No.43730209{7}[source]▶

>>43713194 #

You could try using the built-in "projects" feature of Claude and ChatGPT: https://support.anthropic.com/en/articles/9517075-what-are-p...

You can get pretty good results by copying the output from Firefox's Reader View into your project, for example: about:reader?url=https://learnxinyminutes.com/ocaml/

509. vessenes ◴[18 Apr 25 18:43 UTC] No.43730783{5}[source]▶

>>43708984 #

Update: the leaderboard has o3 high + 4o tops of the charts now with 82.7%. This is a) amazing b) 20x more expensive than Gemini.

510. dragonmost ◴[19 Apr 25 07:10 UTC] No.43734778{3}[source]▶

>>43727880 #

But then how can you rely on it for things you don't know the answer to? The exercise just goes to show it still can't admit it doesn't know and lies instead.

511. TuxSH ◴[19 Apr 25 10:20 UTC] No.43735445{7}[source]▶

>>43711593 #

It used to be good, or at least quite decent in GH Copilot, but it all turned into poop (the completions, the models, everything) ever since they announced the pricing changes.

Considering that M$ obviously trains over GitHub data, I'm a bit pissed, honestly, even if I get GH Copilot Pro for free.

512. skygazer ◴[20 Apr 25 20:13 UTC] No.43746230{7}[source]▶

>>43710769 #

I don't believe so. I thought agents were go-do-that-complicated-interactive-thing autonomously on my behalf. But, more similar to tool use, except, with mixture of experts, each expert assumes the continuation of "participant identity" in the conversation, in that they're fed the whole context.

replies(1): >>43747222 #

513. calmoo ◴[20 Apr 25 23:10 UTC] No.43747222{8}[source]▶

>>43746230 #

Yeah you're right, I had a misunderstanding of the term.

514. HDThoreaun ◴[21 Apr 25 02:28 UTC] No.43748035{7}[source]▶

>>43713364 #

youre describing superintelligence. This is why these conversations always need to start with definitions

515. beefnugs ◴[21 Apr 25 07:21 UTC] No.43749211{3}[source]▶

>>43709386 #

Have you tried cutting the job up into a series of smaller verifiable intermediate steps?

516. hansmayer ◴[21 Apr 25 08:44 UTC] No.43749660{7}[source]▶

>>43716016 #

Well, some of us do. Would you take a 3-door car after paying for a minivan? And no, you cannot write an app with zero code involved, this is literally how computers work.

517. ZeroTalent ◴[21 Apr 25 10:28 UTC] No.43750276{7}[source]▶

>>43710323 #

No worries I also have to say I haven't had me morning coffee when I was writing my comment and maybe reacted overly emotionally. To me prioritizing Flakes being succinct was a priority.

518. BosunoB ◴[22 Apr 25 07:12 UTC] No.43759761{4}[source]▶

>>43709774 #

You are absolutely right that AGI will probably barely resemble LLMs, but this is kind of beside the point. An LLM just has to get good enough to automate sufficiently complicated coding tasks, like those of coding new AI experiments. From there, researchers can spin off new experiments rapidly and make further improvements. An AGI will likely have vastly different architecture from an LLM, but we will only discover that through likely hundreds of thousands of experiments with incremental improvements.

This is the ai-2027.com argument. LLMs only really have to get good enough at coding (and then researching), and it's singularity time.

519. sceptic123 ◴[22 Apr 25 12:35 UTC] No.43761392{9}[source]▶

>>43717015 #

But that requires code that is runnable and testable in isolation otherwise there are all sorts issues with that approach (aside from the obvious one of scalability)

It also assumes they "understand" enough to be able to extract the correct output to test against.

520. sceptic123 ◴[22 Apr 25 12:46 UTC] No.43761469{11}[source]▶

>>43719557 #

I don't know, but it's not a hard test, get the LLM to play a perfect game of tic-tac-toe against itself, look at the output and see if it goes wrong.

521. leesec ◴[24 Apr 25 17:49 UTC] No.43785485{4}[source]▶

>>43709594 #

Buddy o3 just came out and it's an incredible step forward

↑