It appears that OpenAI is in panic mode after the release of DeepSeek. Before they were confident in competing against Google on any AI model they release.

Now they are scrambling against open-source after their disastrous operator demonstration and using this deep research demo as cover. Nothing that Google or Perplexity could not already do themselves.

By the end of them month, this feature is going be added by a bunch of other open-source projects and this feature won't be as interesting very quickly.

replies(1): >>42920459 #

10. ◴[03 Feb 25 00:24 UTC] No.42913382[source]▶

>>42913251 (OP) #

11. PartiallyTyped ◴[03 Feb 25 00:25 UTC] No.42913389[source]▶

>>42913251 (OP) #

I feel that a lot of this can already be achieved via aider (not affiliated), and any of the top models.

replies(1): >>42913494 #

12. ◴[03 Feb 25 00:25 UTC] No.42913390[source]▶

>>42913251 (OP) #

13. blackeyeblitzar ◴[03 Feb 25 00:25 UTC] No.42913395[source]▶

>>42913365 #

I’m not sure I understand what you mean by “the button”. If you’re comparing this to DeepSeek’s copying, it’s not really the same thing right? DeepSeek essentially stole intellectual property by violating OpenAI’s terms of service. As I understand it, this is a copy of Google’s Deep Research

replies(4): >>42913478 #>>42913484 #>>42913486 #>>42913495 #

14. DigitalSea ◴[03 Feb 25 00:27 UTC] No.42913417[source]▶

>>42913251 (OP) #

Not sure if people picked up on it, but this is being powered by the unreleased o3 model. Which might explain why it leaps ahead in benchmarks considerably and aligns with the claims o3 is too expensive to release publicly. Seems to be quite an impressive model and the leading out of Google, DeepSeek and Perplexity.

replies(7): >>42913431 #>>42913449 #>>42913546 #>>42913602 #>>42913952 #>>42914290 #>>42914844 #

15. chrismarlow9 ◴[03 Feb 25 00:29 UTC] No.42913430[source]▶

>>42913251 (OP) #

This smells like when Google released Gemini to have a product in the space.

replies(2): >>42913562 #>>42914567 #

16. ◴[03 Feb 25 00:29 UTC] No.42913431[source]▶

>>42913417 #

17. febin ◴[03 Feb 25 00:29 UTC] No.42913436[source]▶

>>42913251 (OP) #

Is this "deep research" tool exploiting open knowledge creators, using their work without compensation?

replies(7): >>42913450 #>>42913461 #>>42913564 #>>42913592 #>>42914073 #>>42914283 #>>42914956 #

18. usaar333 ◴[03 Feb 25 00:30 UTC] No.42913442[source]▶

>>42913251 (OP) #

Overall impressive.

Though, the jump for Gaia relative to SOTA is relatively not that high. Especially given that this is o3

19. xbmcuser ◴[03 Feb 25 00:31 UTC] No.42913449[source]▶

>>42913417 #

It was expensive as they wanted to charge more for it but deepseek has forced their hand

replies(2): >>42913565 #>>42913923 #

20. johnneville ◴[03 Feb 25 00:31 UTC] No.42913450[source]▶

>>42913436 #

would it be open knowledge if it required payment to access ?

21. ldjkfkdsjnv ◴[03 Feb 25 00:32 UTC] No.42913456[source]▶

>>42913251 (OP) #

Say whatever you want about openAI, they are shipping more than any other company on the planet.

replies(3): >>42913493 #>>42913516 #>>42914095 #

22. scarab92 ◴[03 Feb 25 00:32 UTC] No.42913461[source]▶

>>42913436 #

Are you exploiting open knowledge creators, using their work without compensation?

replies(1): >>42913496 #

23. YmiYugy ◴[03 Feb 25 00:33 UTC] No.42913462[source]▶

>>42913251 (OP) #

If I understood the graphs correctly, it only achieves 20% pass rate on their internal tests. So I have to wait 30min and pay a lot of money just to sift through walls of most likely incorrect text? Unless the possibility of hallucinations is negligible, this is just way too much content to review at once. The process probably needs to be a lot more iterative.

replies(8): >>42913483 #>>42913574 #>>42913685 #>>42913698 #>>42913867 #>>42914288 #>>42914603 #>>42918940 #

24. lompad ◴[03 Feb 25 00:36 UTC] No.42913478{3}[source]▶

>>42913395 #

Deepseek proved that there is no moat. Thus no path to profitability for openai, anthropic & co.

Stealing from thieves is fine by me. Sama was the one claiming that all information could be used to train LLMs, without permisdion of the copyright holders.

Now the same is being done to openai. Well, too bad.

replies(1): >>42914522 #

25. tmnvdb ◴[03 Feb 25 00:36 UTC] No.42913483[source]▶

>>42913462 #

Only if you are asking questions at the level of a cutting edge benchmark

replies(1): >>42913614 #

26. figers ◴[03 Feb 25 00:36 UTC] No.42913484{3}[source]▶

>>42913395 #

Didn't OpenAI steal everyone's data they could consume from the internet? Actively being sued by the NY Times and others for this...

replies(1): >>42914713 #

27. therealpygon ◴[03 Feb 25 00:37 UTC] No.42913486{3}[source]▶

>>42913395 #

Care to explain how something that cannot be copyrighted and was not generated by a human is “intellectual property“? Or are you just parroting a narrative?

replies(1): >>42914730 #

28. thefourthchime ◴[03 Feb 25 00:37 UTC] No.42913489[source]▶

>>42913251 (OP) #

OpenAI has a deep bench. I bet they pushed this out to change the narrative about deepseek

replies(1): >>42913532 #

29. adriand ◴[03 Feb 25 00:37 UTC] No.42913491[source]▶

>>42913251 (OP) #

Feels like only a matter of time before these crawlers are blocked from large swathes of the internet. I understand that they’re already prohibited from Reddit and YouTube. If that spreads, this approach might be in trouble.

replies(8): >>42913543 #>>42913591 #>>42913628 #>>42913637 #>>42913652 #>>42914185 #>>42916122 #>>42916480 #

30. apstls ◴[03 Feb 25 00:37 UTC] No.42913492[source]▶

>>42913368 #

What is the current state of DSPy optimizers? When I originally checked it out it appeared to just be optimizing the set of examples used for n-shot prompting.

31. kortilla ◴[03 Feb 25 00:37 UTC] No.42913493[source]▶

>>42913456 #

What does that even mean? Treating each iterative model as a new product is not any different than Google changing its search or youtube recommendation algorithm.

Different pre-cooked prompts and filters don’t really amount to new products either, despite them being marketed as such. It’s like adobe treating each tool in photoshop as its own product.

replies(1): >>42913519 #

32. tmnvdb ◴[03 Feb 25 00:37 UTC] No.42913494[source]▶

>>42913389 #

Do you have any benchmarks to back up your 'feelings'?

replies(1): >>42913551 #

33. caspper69 ◴[03 Feb 25 00:37 UTC] No.42913495{3}[source]▶

>>42913395 #

I chuckle every time I see this. Poor OpenAI.

Meanwhile, their entire training corpus was the result of scraping the intellectual property and copyrighted materials of THE ENTIRE PUBLIC INTERNET.

Woe is them to be sure.

replies(1): >>42914426 #

34. febin ◴[03 Feb 25 00:37 UTC] No.42913496{3}[source]▶

>>42913461 #

The creators are aware that a human is using this, can we say the same for AI, does it have their consent?

replies(1): >>42913553 #

35. reader9274 ◴[03 Feb 25 00:39 UTC] No.42913513[source]▶

>>42913251 (OP) #

I think we're all reaching AI fatigue. Fewer and fewer people care anymore

replies(3): >>42913549 #>>42913656 #>>42915259 #

36. viraptor ◴[03 Feb 25 00:40 UTC] No.42913516[source]▶

>>42913456 #

How do you even compare different companies? I'd say the massive farms ship every year more than OpenAI ever did.

37. tmnvdb ◴[03 Feb 25 00:40 UTC] No.42913519{3}[source]▶

>>42913493 #

Have you even watched the video? This is a new capability and not a trivial one.

38. gwerbret ◴[03 Feb 25 00:42 UTC] No.42913531[source]▶

>>42913251 (OP) #

To anyone who's tried it: how does it handle captchas? I can't imagine that OpenAI's IP addresses are anyone's favorites for unfettered access to web properties these days.

replies(2): >>42913921 #>>42919020 #

39. btown ◴[03 Feb 25 00:42 UTC] No.42913532[source]▶

>>42913489 #

Also named specifically to muddle the SEO for the term "deep." Nothing that OpenAI does is unintentional.

replies(6): >>42913556 #>>42913594 #>>42913606 #>>42913622 #>>42913668 #>>42913782 #

40. xt00 ◴[03 Feb 25 00:42 UTC] No.42913534[source]▶

>>42913251 (OP) #

"will find, analyze, and synthesize hundreds of online sources"

Synthesize? Seems like the wrong word -- I think they would want to say something like, "analyze, and synthesize useful outputs from hundreds of online sources"..

replies(3): >>42913587 #>>42913630 #>>42913816 #

41. drcode ◴[03 Feb 25 00:43 UTC] No.42913543[source]▶

>>42913491 #

I suppose there is an equilibrium, where sites that penalize these types of crawlers will also get less traffic from people reading ai citations, so for many sites the upsides of allowing it will be greater than the downsides.

42. tmnvdb ◴[03 Feb 25 00:43 UTC] No.42913544[source]▶

>>42913251 (OP) #

Eating popcorn while the scaling doubters scramble to move the goalposts for the nth time.

replies(2): >>42913659 #>>42915262 #

43. bbor ◴[03 Feb 25 00:43 UTC] No.42913546[source]▶

>>42913417 #

Interesting, thanks for highlighting! Did not pick up on that. Re:"leading", tho:

Effectiveness in this task environment is well beyond the specific model involved, no? Plus they'd be fools (IMHO) to only use one size of model for each step in a research task -- sure, o3 might be an advantage when synthesizing a final answer or choosing between conflicting sources, but there are many, many steps required to get to that point.

replies(1): >>42915086 #

44. rvnx ◴[03 Feb 25 00:43 UTC] No.42913549[source]▶

>>42913513 #

Especially this is not a breakthrough justifying a 340B USD valuation, but rather the work that junior developers can do; implement a loop of Bing Searches connected to an LLM.

replies(1): >>42913586 #

45. PartiallyTyped ◴[03 Feb 25 00:43 UTC] No.42913551{3}[source]▶

>>42913494 #

I really don't like the snarky tone of the parent comment.

Nonetheless, I don't think this is even something that can easily be benchmarked. I'd recommend you take a look at aider [1], and consider how I drew similarities between it and what's presented here.

Has ClosedAI presented any benchmarks / evaluation protocols?

[1] https://aider.chat/

replies(1): >>42913919 #

46. handfuloflight ◴[03 Feb 25 00:44 UTC] No.42913553{4}[source]▶

>>42913496 #

Then consent is granted by transitive property because these AI are yielded by humans.

replies(1): >>42914058 #

47. ADeerAppeared ◴[03 Feb 25 00:44 UTC] No.42913555[source]▶

>>42913251 (OP) #

I'm sorry but what the fuck is this product pitch?

Anyone who's done any kind of substantial document research knows that it's a NIGHTMARE of chasing loose ends & citogenesis.

Trusting an LLM to critically evaluate every source and to be deeply suspect of any unproven claim is a ridiculous thing to do. These are not hard reasoning systems, they are probabilistic language models.

replies(3): >>42913638 #>>42913741 #>>42914021 #

48. bbor ◴[03 Feb 25 00:44 UTC] No.42913556{3}[source]▶

>>42913532 #

I have never believed a conspiracy theory more instantly. Deep Search vs. DeepSeek is way more than enough to confuse the average layman! Especially when you're googling something you heard about at work a few hours ago, or on Bloomberg TV

replies(1): >>42913588 #

49. lysace ◴[03 Feb 25 00:44 UTC] No.42913562[source]▶

>>42913430 #

Eh, not really. Google failed to launch first out of internal political dysfunction and then made a crash effort to launch something to counter the first ChatGPT release.

I highly doubt that the concerns of internal political commissars were holding up this particular openai release.

replies(1): >>42916670 #

50. tmnvdb ◴[03 Feb 25 00:45 UTC] No.42913564[source]▶

>>42913436 #

How is using public information "exploitation"? A human researcher with Google would do the same.

replies(1): >>42914069 #

51. Sparkyte ◴[03 Feb 25 00:45 UTC] No.42913565{3}[source]▶

>>42913449 #

Rightfully so, some models are getting super efficient.

52. brokensegue ◴[03 Feb 25 00:46 UTC] No.42913574[source]▶

>>42913462 #

26.6% on humanity's last exam is actually impressive.

pass rate really only matters in context of the difficulty of the tasks

replies(1): >>42913971 #

53. cye131 ◴[03 Feb 25 00:46 UTC] No.42913575[source]▶

>>42913251 (OP) #

Does anyone actually have access to this? It says available for pro users on the website today - I have pro via my employer but see no "deep research" option in the message composer.

replies(8): >>42913593 #>>42913623 #>>42914236 #>>42914246 #>>42914362 #>>42914823 #>>42914854 #>>42917612 #

54. hi_hi ◴[03 Feb 25 00:47 UTC] No.42913579[source]▶

>>42913251 (OP) #

This is terrifying. Even though they acknowledge the issues with hallucinations/errors, that is going to be completely overlooked by everyone using this, and then injecting the outputs into their own powerpoints.

Management Consulting was bad enough before the ability to mass produce these graphs and stats on a whim. At least there was some understanding behind the scenes of where the numbers came from, and sources would/could be provided.

The more powerful these tools become, the more prevelant this effect of seepage will become.

replies(7): >>42913597 #>>42913642 #>>42913709 #>>42913713 #>>42914919 #>>42915243 #>>42915568 #

55. nicce ◴[03 Feb 25 00:47 UTC] No.42913587[source]▶

>>42913534 #

On the other hand, accurate if it is prone for hallucination…

56. tmnvdb ◴[03 Feb 25 00:47 UTC] No.42913586{3}[source]▶

>>42913549 #

Peak HN comment

replies(1): >>42913674 #

57. bonoboTP ◴[03 Feb 25 00:48 UTC] No.42913588{4}[source]▶

>>42913556 #

You might as well say that DeepSeek wanted to cause confusion with DeepMind. Deep isn't such a distinguishing name, deep learning has been a buzzword since 2012.

replies(1): >>42913756 #

58. bbor ◴[03 Feb 25 00:48 UTC] No.42913591[source]▶

>>42913491 #

TBF OpenAI in particular bought access to Reddit. Otherwise yeah this is my main confusion with all of these products, Perplexity being the biggest -- how do you get around the status-quo of refusing access to bots? Just to start off with, there is no Google Search API, and they work hard to make sure headless browsers can't access the normal service.

They do say "Currently, deep research can access the open web...", so maybe "open" there implies something significant. Like, "websites that have agreements with OpenAI and/or do not enforce norobot policies".

replies(1): >>42914076 #

59. protocolture ◴[03 Feb 25 00:48 UTC] No.42913592[source]▶

>>42913436 #

You exploited my eyes by making me read this comment. Wheres my compensation.

60. snewman ◴[03 Feb 25 00:48 UTC] No.42913593[source]▶

>>42913575 #

Two different people I know with pro subscriptions report not having access yet.

replies(1): >>42914176 #

61. leonheld ◴[03 Feb 25 00:48 UTC] No.42913594{3}[source]▶

>>42913532 #

Oh God, this is such an astute observation. I think it worked so well on me that I didn't even think about the "deep" portion initially. Goes to show how effective these things are psychologically.

62. ai-christianson ◴[03 Feb 25 00:49 UTC] No.42913602[source]▶

>>42913417 #

Has anyone here tried it out yet?

replies(2): >>42914074 #>>42914866 #

63. kevlened ◴[03 Feb 25 00:49 UTC] No.42913606{3}[source]▶

>>42913532 #

It's more likely this is a response to Gemini Deep Research released in December

https://blog.google/products/gemini/google-gemini-deep-resea...

replies(2): >>42913651 #>>42913680 #

64. rvnx ◴[03 Feb 25 00:50 UTC] No.42913614{3}[source]▶

>>42913483 #

This is one of the actual questions:

> In Greek mythology, who was Jason's maternal great-grandfather?

https://www.google.com/search?q=In+Greek+mythology%2C+who+wa...

replies(4): >>42913751 #>>42913754 #>>42914173 #>>42915030 #

65. dougb5 ◴[03 Feb 25 00:52 UTC] No.42913622{3}[source]▶

>>42913532 #

Does the naming scheme they've used for models so far suggest that they care about SEO?

66. greatpostman ◴[03 Feb 25 00:52 UTC] No.42913623[source]▶

>>42913575 #

Have pro, can’t see it yet

67. scarab92 ◴[03 Feb 25 00:52 UTC] No.42913628[source]▶

>>42913491 #

I doubt those crawler rules will be honoured for long.

I wouldn’t even be surprised if a law is passed requiring sites to provide equal access to humans whether accessed directly or via these models.

It’s too important an innovation to stall, especially considering the US’s competitors (China) won’t respect robots.txt either.

68. tmnvdb ◴[03 Feb 25 00:52 UTC] No.42913630[source]▶

>>42913534 #

You can synthesize the parts to get the whole. Both uses are correct AFAIK

69. cj ◴[03 Feb 25 00:53 UTC] No.42913637[source]▶

>>42913491 #

Anyone selling anything would want to remain crawlable if people use this to research something that could lead to a purchase.

replies(1): >>42913838 #

70. timsh ◴[03 Feb 25 00:53 UTC] No.42913638[source]▶

>>42913555 #

this is so precise. I guess we’ll need a global version of https://datacolada.org/ quite soon to not get hit by a bus in every scientific field

71. tmnvdb ◴[03 Feb 25 00:54 UTC] No.42913642[source]▶

>>42913579 #

> At least there was some understanding behind the scenes of where the numbers came from, and sources would/could be provided.

Oh Sweet summer child.

replies(1): >>42918843 #

72. VerdisQuo5678 ◴[03 Feb 25 00:54 UTC] No.42913649[source]▶

>>42913251 (OP) #

The accuracy of this tool does not matter. This is exclusively designed for box ticking "reports" that nobody reads and a produced for the sake of itself.

replies(2): >>42913682 #>>42913884 #

73. ldjkfkdsjnv ◴[03 Feb 25 00:55 UTC] No.42913650[source]▶

>>42913251 (OP) #

So much cynicism and hate in these comments, especially as we are likely witnessing AGI come to life. Its still early, but it might be coming. Where is the excitement? This is an interesting time to be alive.

HN has a huge cultural problem that makes this website almost irrelevant. All the interesting takes have moved to X/twitter

replies(8): >>42913676 #>>42913765 #>>42913916 #>>42913940 #>>42914121 #>>42914182 #>>42914446 #>>42916186 #

74. nicce ◴[03 Feb 25 00:55 UTC] No.42913651{4}[source]▶

>>42913606 #

Two birds with one stone: timing for Deepseek and feature for Gemini

75. crazylogger ◴[03 Feb 25 00:55 UTC] No.42913652[source]▶

>>42913491 #

This is trivially bypassed by OpenAI asking the user to take control of their computer (or a sandboxed browser within it,) then for all intents and purposes it’s the user themselves accessing your site (with some productivity/accessibility aid from OAI.)

76. bonoboTP ◴[03 Feb 25 00:55 UTC] No.42913656[source]▶

>>42913513 #

Sure if you're viewing this as some kind of spectator thing, or entertainment, maybe it's less interesting. But it doesn't really matter whether "people care". What matters is whether it's useful and has impact. It's enough if the small number of people use it for whom it is useful. It doesn't matter if the average Joe on the street is excited by it.

Few people care or even know about various advances in various specialized fields. It's enough if AI simply seeps into various applications in boring and non-flashy ways for it to have significant effects that will affect a wider range of people, whether they get hyped by the news announcements or not. Jobs etc.

An analogy: the Internet as such is not very exciting nowadays, certainly not in the way it was exciting in the 90s with all the news segments about surfing the information superhighway or whatever. There was a lot of buzz around the web, but then it got normalized. It didn't disappear, it just got taken for granted. No average person got excited around HTML5 or IPv6. It just chugs along in the background. AI will similarly simply build into the fabric of how things get done. Sometimes visibly to the average person, sometimes just behind the scenes.

replies(1): >>42914269 #

77. elicksaur ◴[03 Feb 25 00:55 UTC] No.42913659[source]▶

>>42913544 #

Its number for one of the benchmark has:

**with browsing + python tools

Maybe we have different definitions of scaling?

replies(1): >>42913906 #

78. ◴[03 Feb 25 00:57 UTC] No.42913668{3}[source]▶

>>42913532 #

79. wilg ◴[03 Feb 25 00:57 UTC] No.42913669[source]▶

>>42913251 (OP) #

I think this looks cool. Apparently unlike everyone else on this website?

replies(1): >>42913776 #

80. rvnx ◴[03 Feb 25 00:57 UTC] No.42913674{4}[source]▶

>>42913586 #

Doesn't make it untrue.

Agents that can search the internet exist for a while now and have been essentially solved and happily used in platforms like Perplexity.

It's really "meh", very far from revolutionary.

Keep in mind this company is trying to convince everybody they need 500B USD now (through the Stargate project).

replies(3): >>42913714 #>>42914669 #>>42915080 #

81. bonoboTP ◴[03 Feb 25 00:58 UTC] No.42913676[source]▶

>>42913650 #

HN is and has always been quite negative/pessimistic/cynical in general. That Dropbox comment was quite a long time ago already.

82. petra ◴[03 Feb 25 00:58 UTC] No.42913680{4}[source]▶

>>42913606 #

That Google product isnt that good, it can't really replace research done by a person.

replies(3): >>42913841 #>>42915060 #>>42915184 #

83. arbywhy ◴[03 Feb 25 00:58 UTC] No.42913682[source]▶

>>42913649 #

99% of corpo upper management slide deck work. ai only makes more of this useless pencil-neck board of directors slop.

replies(1): >>42913892 #

84. spyckie2 ◴[03 Feb 25 00:59 UTC] No.42913685[source]▶

>>42913462 #

I mean you want it to grill your steak and eat it for you too?

I mean I too can complain that my iPhone doesn’t automatically screen out spammers and send my mom flowers on Mother’s Day.

replies(1): >>42913788 #

85. roenxi ◴[03 Feb 25 01:00 UTC] No.42913698[source]▶

>>42913462 #

Maybe. Not enough data to say. Say it does a days worth of work in a query. It is sensible to use if it takes less than a day to review ~5 days worth of work. I don't know if we're near that threshold yet but conceptually this would work well for actual research where the amount of preparation is large compared to the amount of output written.

And eyeballing the benchmarks, it'll probably reach a >50% rate per query by the end of the year. Seems to double every model or two.

86. autoconfig ◴[03 Feb 25 01:01 UTC] No.42913709[source]▶

>>42913579 #

Either you care about being correct or you don't. If you don't care then it doesn't matter whether you made it up or the AI did. If you care then you'll fact check before publishing. I don't see why this changes.

replies(10): >>42913768 #>>42913843 #>>42913844 #>>42914080 #>>42914117 #>>42914243 #>>42915119 #>>42915840 #>>42920726 #>>42936702 #

87. scarab92 ◴[03 Feb 25 01:02 UTC] No.42913713[source]▶

>>42913579 #

Think of it like a vaccine.

The majority of human written consultant reports are already complete rubbish. Low accuracy, low signal-to-noise, generic platitudes in a quantity-over-quality format.

LLMs are innoculating people to this kind of low information value content.

People who produce LLM quality output, are now being accused of using LLMs, and can no longer pretend to be adding value.

The result of this is going to be higher quality expectations from consultants and a shaking out of people who produce word vommit rather than accurate, insightful, contextually relevent information.

replies(3): >>42913886 #>>42915909 #>>42920728 #

88. spyckie2 ◴[03 Feb 25 01:03 UTC] No.42913714{5}[source]▶

>>42913674 #

To go from partially automated to fully automated is thousands of non trivial edge cases and unforeseen decision points that must be tamed.

To say this is trivial is like saying the one shot ai prompted twitter clone is the same thing as twitter.

Peak HN indeed.

89. therealmarv ◴[03 Feb 25 01:05 UTC] No.42913732[source]▶

>>42913251 (OP) #

I don't know. OpenAI is so bad in naming... the average person on the street will confuse Deepseek with Deep Research. Also not to forget o1, o3 ... 4o

replies(3): >>42913745 #>>42913750 #>>42913760 #

90. Havoc ◴[03 Feb 25 01:05 UTC] No.42913738[source]▶

>>42913251 (OP) #

The descriptions of the product sounded substantially more impressive than the actual samples tbh.

Still I think there is a big market for this sort of „go away for 30 mins and figure this out“ style agent

replies(1): >>42914797 #

91. panarky ◴[03 Feb 25 01:06 UTC] No.42913741[source]▶

>>42913555 #

> they are probabilistic language models

This is like arguing an Airbus cannot possibly fly because it is 165 tonnes of aluminum, steel and plastic.

The proof is in the fact that it flies, not what it is constructed from.

replies(1): >>42914232 #

92. szvsw ◴[03 Feb 25 01:06 UTC] No.42913745[source]▶

>>42913732 #

> the average person on the street will confuse Deepseek with Deep Research.

That’s probably a feature not a bug (from OpenAI’s perspective…).

93. hipadev23 ◴[03 Feb 25 01:07 UTC] No.42913750[source]▶

>>42913732 #

Yes.

94. tmnvdb ◴[03 Feb 25 01:07 UTC] No.42913751{4}[source]▶

>>42913614 #

This is a hard question for language models since it targets one of their known weaknesses.

replies(3): >>42913804 #>>42913858 #>>42913875 #

95. elicksaur ◴[03 Feb 25 01:08 UTC] No.42913754{4}[source]▶

>>42913614 #

In Greek mythology, Jason's maternal great-grandfather was Einstein.

96. viraptor ◴[03 Feb 25 01:08 UTC] No.42913756{5}[source]▶

>>42913588 #

Deepmind is not a consumer product. Gemini is part of it but nobody calls it deepmind.

replies(1): >>42913817 #

97. tmnvdb ◴[03 Feb 25 01:09 UTC] No.42913760[source]▶

>>42913732 #

You're not wrong but it feels like bikeshedding at this point.

98. layer8 ◴[03 Feb 25 01:09 UTC] No.42913761[source]▶

>>42913251 (OP) #

From the demo: “Use bullets and tables where necessary for clarity.” It’s weird that it would be necessary to specify that. I suppose they want to showcase that you can influence the output style, but it’s strange that you’d have to explicitly specify the use of something that is “necessary for clarity”. It comes across as either a flaw in the default execution, or as a merely performative incantation.

99. roenxi ◴[03 Feb 25 01:10 UTC] No.42913765[source]▶

>>42913650 #

We're looking at trends that may well obliterate the economic value of a well trained human mind sitting behind a keyboard all day. That is a bit of a threat to most people on HN if the trending continues at the current rate and direction.

100. prng2021 ◴[03 Feb 25 01:10 UTC] No.42913766[source]▶

>>42913251 (OP) #

"Deep research was trained using end-to-end reinforcement learning"

Does this mean they skipped supervised fine tuning like DeepSeek did with R1?

replies(1): >>42914204 #

101. spaceywilly ◴[03 Feb 25 01:10 UTC] No.42913768{3}[source]▶

>>42913709 #

I think a lot about how differentiating facts and quality content is like differentiating signal from noise in electronics. The signal to noise ratio on many online platforms was already quite low. Tools like this will absolutely add more noise, and arguably the nature of the tools themselves make it harder to separate the noise.

I think this is a real problem for these AI tools. If you can’t separate the signal from the noise, it doesn’t provide any real value, like an out of range FM radio station.

replies(1): >>42913967 #

102. tmnvdb ◴[03 Feb 25 01:11 UTC] No.42913776[source]▶

>>42913669 #

HN is full of people who want to feel smart by complaining.

replies(1): >>42930490 #

103. jasonjmcghee ◴[03 Feb 25 01:12 UTC] No.42913779[source]▶

>>42913251 (OP) #

Surprised more comments aren't mentioning deepseek has this feature (for free) already. Assuming this is why OpenAI scrambled to release it.

The examples they have on the page work well on chat.deepseek.com with r1 and search options both enabled.

Do I blindly trust the accuracy of either though? Absolutely not. I'm pretty concerned about these models falling into gaming SEO and finding inaccurate facts and presenting them as fact. (How easy is it to fool / prompt inject these models?)

But has utility if held right.

replies(2): >>42913820 #>>42913887 #

104. xnx ◴[03 Feb 25 01:12 UTC] No.42913782{3}[source]▶

>>42913532 #

Google publicly announced a model named "Deep Research" on December 11th: https://blog.google/products/gemini/google-gemini-deep-resea...

105. pjs_ ◴[03 Feb 25 01:12 UTC] No.42913784[source]▶

>>42913251 (OP) #

McKinsey mode

replies(3): >>42914738 #>>42914827 #>>42915821 #

106. scarab92 ◴[03 Feb 25 01:13 UTC] No.42913788{3}[source]▶

>>42913685 #

Why doesn't the iPhone screen spammers yet? Pixel has had this feature for a decade.

replies(1): >>42916014 #

107. layer8 ◴[03 Feb 25 01:15 UTC] No.42913804{5}[source]▶

>>42913751 #

Users don’t care about how hard something is for LLMs if they receive incorrect output.

108. spyckie2 ◴[03 Feb 25 01:15 UTC] No.42913806[source]▶

>>42913251 (OP) #

Why is HN not creating policy against moral prigotry? There is no useful discussion here anymore.

Seriously begging the mods to take a closer look, or at least PG to not abandon his curated internet space.

replies(1): >>42913830 #

109. pjot ◴[03 Feb 25 01:16 UTC] No.42913816[source]▶

>>42913534 #

From New Oxford dictionary:

  > combine (a number of things) into a coherent whole: pupils should synthesize the data they have gathered | Darwinian theory has been synthesized with modern genetics.

110. bonoboTP ◴[03 Feb 25 01:16 UTC] No.42913817{6}[source]▶

>>42913756 #

The point is, "deep" is an extremely generic word in the AI space.

111. nicce ◴[03 Feb 25 01:16 UTC] No.42913820[source]▶

>>42913779 #

I wish Kagi would work with similar performance. Their lenses feature is perfect for this and they already filter out most of the SEO spam based on trackers and other typical red flags.

112. esafak ◴[03 Feb 25 01:18 UTC] No.42913830[source]▶

>>42913806 #

priggery or bigotry? And what are you referring to?

113. reaperman ◴[03 Feb 25 01:19 UTC] No.42913838{3}[source]▶

>>42913637 #

Not necessarily. Southwest airlines doesnt allow itself on price comparison sites or Google Flights.

Amazon listings are blocked from google shopping and other price comparison sites.

replies(3): >>42914545 #>>42915788 #>>42920312 #

114. nicce ◴[03 Feb 25 01:19 UTC] No.42913841{5}[source]▶

>>42913680 #

Just one tool in the toolbox. It helps to see if some sources have been missed.

115. layer8 ◴[03 Feb 25 01:19 UTC] No.42913843{3}[source]▶

>>42913709 #

People are much less scrupulous using LLM output than making up stuff themselves, because then they can blame the LLM.

116. hi_hi ◴[03 Feb 25 01:19 UTC] No.42913844{3}[source]▶

>>42913709 #

Because maybe you want to, but you have a boss breathing down your neck and KPIs to meet and you haven't slept properly in days and just need a win, so you get the AI to put together some impressive looking graphs and stats that will look impressive in that client showcase thats due in a few hours.

Things aren't quite so black and white in reality.

replies(1): >>42914162 #

117. esafak ◴[03 Feb 25 01:20 UTC] No.42913847[source]▶

>>42913251 (OP) #

Is there a benchmark we can compare this against You.com's research mode? It looks like R1 forced them to release o3 prematurely and give it Internet access. And they didn't want to say they released o3 so they called it 'Deep Research'.

118. andyg_blog ◴[03 Feb 25 01:21 UTC] No.42913858{5}[source]▶

>>42913751 #

Greek mythology? But seriously please elaborate for my less educated self.

replies(2): >>42914055 #>>42914950 #

119. itkovian_ ◴[03 Feb 25 01:22 UTC] No.42913867[source]▶

>>42913462 #

Here's an example of the type of question it is acheiving 20% on;

The set of natural transformations between two functors F,G ⁣:C→DF,G:C→D can be expressed as the end Nat(F,G)≅∫AHomD(F(A),G(A)). Nat(F,G)≅∫A HomD (F(A),G(A)).

Define set of natural cotransformations from FF to GG to be the coend CoNat(F,G)≅∫AHomD(F(A),G(A)). CoNat(F,G)≅∫AHomD (F(A),G(A)).

Let: - F=B∙(Σ4)∗/F=B∙ (Σ4 )∗/ be the under ∞∞-category of the nerve of the delooping of the symmetric group Σ4Σ4 on 4 letters under the unique 00-simplex ∗∗ of B∙Σ4B∙ Σ4 . - G=B∙(Σ7)∗/G=B∙ (Σ7 )∗/ be the under ∞∞-category nerve of the delooping of the symmetric group Σ7Σ7 on 7 letters under the unique 00-simplex ∗∗ of B∙Σ7B∙ Σ7 .

How many natural cotransformations are there between FF and GG?

replies(6): >>42914310 #>>42914528 #>>42914988 #>>42915478 #>>42915923 #>>42921302 #

120. 11101010001100 ◴[03 Feb 25 01:23 UTC] No.42913875{5}[source]▶

>>42913751 #

It's categorically more than a weakness.

121. tomrod ◴[03 Feb 25 01:24 UTC] No.42913884[source]▶

>>42913649 #

The new term for this is "AI Loopidity", highlighting the unintelligent ouroboros nature of one side using AI to generate content and then another side to consume content.

replies(1): >>42914719 #

122. layer8 ◴[03 Feb 25 01:24 UTC] No.42913886{3}[source]▶

>>42913713 #

This has been downvoted, but I think there’s actually a chance it might become true (until AGI comes along at least).

123. starchild3001 ◴[03 Feb 25 01:24 UTC] No.42913887[source]▶

>>42913779 #

Not really accurate. The "Search" functionality you're describing in DeepSeek is comparable to OpenAI's existing "Search GPT." OpenAI's recent announcement refers to a more advanced capability, similar to Gemini's existing "deep research" feature. DeepSeek's current offerings are significantly more limited in scope.

replies(3): >>42913965 #>>42914018 #>>42914809 #

124. reaperman ◴[03 Feb 25 01:25 UTC] No.42913892{3}[source]▶

>>42913682 #

“Pencil-neck” is a strange insult to use here. How are software developers, or hardware design engineers, or finance workers any less “pencil-neck” than “board of directors”?

125. tmnvdb ◴[03 Feb 25 01:27 UTC] No.42913906{3}[source]▶

>>42913659 #

I would consider unsupervised tool usage an achievement

replies(1): >>42914607 #

126. rvz ◴[03 Feb 25 01:28 UTC] No.42913916[source]▶

>>42913650 #

> "So much cynicism and hate in these comments, especially as we are likely witnessing AGI come to life. Its still early, but it might be coming. Where is the excitement? This is an interesting time to be alive."

Maybe you can define what "AGI" really means and what the end-game and the economic implications are when 'AGI" is some-what achieved? OpenAI somehow believes that they haven't achieved "AGI" yet, which they continue to do this on purpose for obvious reasons.

The first hint I will give you is that it certainly won't be a utopia.

127. tmnvdb ◴[03 Feb 25 01:28 UTC] No.42913919{4}[source]▶

>>42913551 #

Yes, they show benchmarks in the article linked here. Did you not read it?

replies(1): >>42918539 #

128. layer8 ◴[03 Feb 25 01:29 UTC] No.42913921[source]▶

>>42913531 #

And is it smart enough to use archive.today for paywalled articles. ;)

129. willy_k ◴[03 Feb 25 01:29 UTC] No.42913923{3}[source]▶

>>42913449 #

They’ve only released o3-mini, which is a powerful model but not the full o3 that is being claimed as too expensive to release. That being said, DeepSeek for sure forced their hand to release o3-mini to the public.

replies(2): >>42915341 #>>42916026 #

130. spyckie2 ◴[03 Feb 25 01:30 UTC] No.42913927[source]▶

>>42913251 (OP) #

Is this ability really a prerequisite to AGI and ASI?

Reasoning, problem solving, research validation - at the fundamental outset it is all refinement thinking.

Research is one of those areas where I remain skeptical it is that important because the only valid proof is in the execution outcome, not the compiled answer.

For instance you can research all you want about the best vacuum on the internet but until you try it out yourself you are going to be caught in between marketing, fake reviews, influencers, etc. maybe the science fields are shielded from this (by being boring) but imagine medical pharmas realizing that they can get whatever paper to say whatever by flooding the internet with their curated blog articles containing advanced medical “research findings”. At some point you cannot trust the internet at all and I imagine that might be soon.

I worry especially with the rapidly changing landscape of the amount of generated text in the internet that research will lose a lot of value due to massive amounts of information garbage.

It will be a thing we used to do when the internet was still “real”.

replies(3): >>42914133 #>>42914744 #>>42914986 #

131. layer8 ◴[03 Feb 25 01:32 UTC] No.42913940[source]▶

>>42913650 #

“May you live in interesting times” is usually taken as a curse. ;)

More seriously, it’s unclear why one should be excited by the prospect of AGI, especially when instrumentalized by corporations and authoritarian governments.

132. mistercheph ◴[03 Feb 25 01:34 UTC] No.42913952[source]▶

>>42913417 #

I'm sure o3 will be a generation ahead of whatever deepseek, google and meta are doing today when it launches in 10 months, super impressive stuff.

replies(1): >>42914406 #

133. jasonjmcghee ◴[03 Feb 25 01:36 UTC] No.42913965{3}[source]▶

>>42913887 #

Doesn't seem like access is available to try "deep research" yet on OpenAI, so I can only speak to what I tried, which was their examples on the blog post (using DeepSeek w/ R1 + Search) and results were pretty similar.

AFAIK OpenAI's current offering uses 4o, and it does a web search and then pipes it into 4o. I'm guessing adding CoT + other R1/o3 like stuff is one of the key effective differences. But time will tell how different it is. Maybe it's a dramatic improvement.

134. WOTERMEON ◴[03 Feb 25 01:37 UTC] No.42913967{4}[source]▶

>>42913768 #

Not only that: by publishing noise, you’re lowering the signal/noise ratio.

135. ◴[03 Feb 25 01:38 UTC] No.42913971{3}[source]▶

>>42913574 #

136. WiSaGaN ◴[03 Feb 25 01:46 UTC] No.42914018{3}[source]▶

>>42913887 #

SearchGPT is bad because its underlying model is not a reasoning one. Deepseek one mentioned above is closer to deep research than searchgpt.

137. lukeschlather ◴[03 Feb 25 01:47 UTC] No.42914021[source]▶

>>42913555 #

o1 and o3 are definitely not your run of the mill LLM. I've had o1 correct my logic, and it had correct math to back up why I was wrong. I'm very skeptical, but I do think at some point AI is going to be able to do this sort of thing.

138. ejang0 ◴[03 Feb 25 01:49 UTC] No.42914043[source]▶

>>42913251 (OP) #

Can anyone confirm if this is available in Canada and other countries? This site says "We are still working on bringing access to users in the United Kingdom, Switzerland, and the European Economic Area." But I'm not sure about other countries. I don't have Pro currently, only Plus.

replies(3): >>42914048 #>>42930745 #>>42936422 #

139. carbocation ◴[03 Feb 25 01:50 UTC] No.42914048[source]▶

>>42914043 #

I don't even see it in the US right now.

replies(1): >>42916559 #

140. tmnvdb ◴[03 Feb 25 01:51 UTC] No.42914055{6}[source]▶

>>42913858 #

LLMs often don't do well on tasks that require composition into smaller subtasks. In this case there is a chain of relations that depend on the previous result.

141. hnisoss ◴[03 Feb 25 01:51 UTC] No.42914058{5}[source]▶

>>42913553 #

Yea but guy paying closedai to get "insights" that basically copy-pasted content from my blog is definitely violating my blogs copyright, and in the end no coin comes to me either. What about that?

replies(1): >>42914190 #

142. hnisoss ◴[03 Feb 25 01:53 UTC] No.42914069{3}[source]▶

>>42913564 #

So its fine for OpenAI to effectively sell your CC BY-NC content to others?

143. hnisoss ◴[03 Feb 25 01:54 UTC] No.42914073[source]▶

>>42913436 #

Of course. It's a child's play for SamA et al.

144. maroonblazer ◴[03 Feb 25 01:54 UTC] No.42914074{3}[source]▶

>>42913602 #

Per the below, seems it's not available to many yet.

https://news.ycombinator.com/item?id=42913575

145. wahnfrieden ◴[03 Feb 25 01:54 UTC] No.42914076{3}[source]▶

>>42913591 #

Client-side browsers that crawl for users (and prompt for logins or captcha as needed) won't be as easily blockable

146. sbarre ◴[03 Feb 25 01:55 UTC] No.42914080{3}[source]▶

>>42913709 #

How hard it is to produce credible-looking bullshit makes a really big difference in these scenarios.

Consultants aren't the ones doing the fact-checking, that falls to the client, who ironically tend to assume the consultants did it.

147. navigaid ◴[03 Feb 25 01:57 UTC] No.42914095[source]▶

>>42913456 #

Name one single open source model released by OpenAI since 2020

148. azinman2 ◴[03 Feb 25 02:00 UTC] No.42914117{3}[source]▶

>>42913709 #

When things are easy, you’re going to take the easy path even if it means quality goes down. It’s about trade offs. If you had to do it yourself, perhaps quality would have been higher because you had no other choice.

Lots of kids don’t want to do homework. That said, previously many would because there wasn’t another choice. But now they can just ask ChatGPT for the answers they’ll write that down verbatim with zero learning taking place.

Caring isn’t a binary thing or works in isolation.

replies(3): >>42914715 #>>42915214 #>>42916335 #

149. rpcope1 ◴[03 Feb 25 02:01 UTC] No.42914121[source]▶

>>42913650 #

> especially as we are likely witnessing AGI come to life

Man, I've got a great deal on some oceanfront property in Wyoming for you.

150. observationist ◴[03 Feb 25 02:03 UTC] No.42914133[source]▶

>>42913927 #

It's a direction in a vast landscape, not a feature of itself - being better at different tasks, like search generally, and research in conjunction with reasoning, gets the model closer to AGI. An AGI will be able to do these tasks - so the point of the research is to have more Venn diagrams of capabilities like these to help narrow down the view on things that might actually be fundamental mechanisms involved in AGI.

Moravec detailed the idea of a landscape of human capabilities slowly being submerged by AI capabilities, and the point at which AI can do anything a human can, in practice or in principle, we'll know for certain we've reached truly general AI. This idea includes things like feeling pain and pleasure, planning, complex social, oral, and ethical dynamics, and anything else you can possibly think of as relevant to human intelligence. Deep Research is just another island being slowly submerged by the relentless and relentlessly accelerating flood.

replies(1): >>42914200 #

151. 6gvONxR4sf7o ◴[03 Feb 25 02:05 UTC] No.42914149[source]▶

>>42913251 (OP) #

There are some people in the blogosphere who are known experts in their niche or even niche-famous because they write popular useful stuff. And there are a ton more people who write useful stuff because they want that 'exposure.' At least, they do in the very broadest sense of writing it for another human to read it. I wonder if these people will keep writing when their readership is all bots. Dead internet here we come.

replies(2): >>42914184 #>>42915426 #

152. dauhak ◴[03 Feb 25 02:07 UTC] No.42914162{4}[source]▶

>>42913844 #

I mean those same conditions already just lead the human to cutting corners and making stuff up themselves. You're describing the problem where bad incentives/conditions lead to sloppy work, that happens with or without AI

Catching errors/validating work is obviously a different process when they're coming from an AI vs a human, but I don't see how it's fundamentally that different here. If the outputs are heavily cited then that might go someway into being able to more easily catch and correct slip-ups

replies(2): >>42914305 #>>42915282 #

153. pama ◴[03 Feb 25 02:10 UTC] No.42914173{4}[source]▶

>>42913614 #

No it is not an actual question on this exam. From the paper: “To ensure question quality and integrity, we enforce strict submission criteria. Questions should be precise, unambiguous, solvable, and non-searchable, ensuring models cannot rely on memorization or simple retrieval methods. All submissions must be original work or non-trivial syntheses of published information, though contributions from unpublished research are acceptable. Questions typically require graduate-level expertise or test knowledge of highly specific topics (e.g., precise historical details, trivia, local customs) and have specific, unambiguous answers…”. (Emphasis mine)

replies(2): >>42914454 #>>42916230 #

154. ◴[03 Feb 25 02:10 UTC] No.42914176{3}[source]▶

>>42913593 #

155. qgin ◴[03 Feb 25 02:11 UTC] No.42914182[source]▶

>>42913650 #

Never underestimate HN's capacity to be cynical about literally everything

156. seanmcdirmid ◴[03 Feb 25 02:11 UTC] No.42914184[source]▶

>>42914149 #

I'm all for writing just for the bots, if I can figure it out. A lot of academic papers aren't really read anyways, just briefly glanced at so they can be cited together, large publications like journal pubs or dissertations even less so. But the ability to add to a world of knowledge that is very easy to access by people who want to use it...that is very appealing to me as an author. No more trudging through a bunch of papers with titles that might be relevant to what I want to know about...and no more trudging through my papers, I'm OK with that.

157. optimalsolver ◴[03 Feb 25 02:11 UTC] No.42914185[source]▶

>>42913491 #

Big Tech Podcast listener?

158. handfuloflight ◴[03 Feb 25 02:12 UTC] No.42914190{6}[source]▶

>>42914058 #

Could you provide an example where OpenAI outputting verbatim quotes actually constitutes the copyright violation? Because mechanically retrieving relevant quotes seems analogous to grep/search - the copyright status would depend on how downstream users transform and use that content. Like how quoting your blog in a technical analysis or critique is fair use, but wholesale republishing isn't. This suggests the violation occurs at usage time, not retrieval time.

159. numba888 ◴[03 Feb 25 02:14 UTC] No.42914200{3}[source]▶

>>42914133 #

> hings like feeling pain and pleasure

can machine feel? without that there is no AGI according to definition above.

and the second question: are animals "GI"? they don't have language and don't solve math problems, never heard of np-complete.

replies(1): >>42914599 #

160. OutOfHere ◴[03 Feb 25 02:15 UTC] No.42914204[source]▶

>>42913766 #

No, it just suggests that RL was used over a base SFT model, and moreover that RL here was tuned to this research task. Personally I don't think that RL is strictly necessary for this task at all, but perhaps it helps.

161. ADeerAppeared ◴[03 Feb 25 02:20 UTC] No.42914232{3}[source]▶

>>42913741 #

> The proof is in the fact that it flies, not what it is constructed from.

And LLMs do not.

> "But it looks like reasoning to me"

My condolences. You should go see a doctor about your inability to count the number of 'R's in a word.

replies(3): >>42914355 #>>42914694 #>>42915202 #

162. fosterfriends ◴[03 Feb 25 02:20 UTC] No.42914236[source]▶

>>42913575 #

I have pro, in US, not seeing yet

replies(1): >>42914963 #

163. ADeerAppeared ◴[03 Feb 25 02:22 UTC] No.42914243{3}[source]▶

>>42913709 #

> If you care then you'll fact check before publishing.

Doing a proper fact check is as much work as doing the entire research by hand, and therefore, this system is useless to anyone who cares about the result being correct.

> I don't see why this changes.

And because of the above this system should not exist.

164. fizx ◴[03 Feb 25 02:23 UTC] No.42914246[source]▶

>>42913575 #

same same

165. InkCanon ◴[03 Feb 25 02:27 UTC] No.42914269{3}[source]▶

>>42913656 #

Not sure if it's just me, but it looks like all SOTA companies are doubling down to chase the new benchmark, which beyond hype, doesn't seem to translate into real world uses. Why don't these companies just plug it into a popular git repo and say, hey our AI fixed these 100 issues! Or something real? The only people who seem to be doing something real is DeepMind.

166. febin ◴[03 Feb 25 02:30 UTC] No.42914283[source]▶

>>42913436 #

I see many are offended, but I am genuinely asking a question.

I want to understand does this mean it's ethical for anyone to create a research AI tool that will go through arXiv and related GitHub repo and use it to solve problems, implement ideas like cursor.

167. random_cynic ◴[03 Feb 25 02:30 UTC] No.42914288[source]▶

>>42913462 #

The difference is that it takes few minutes to an hour at most so it can be run multiple times a day, using the results of previous runs to further refine the search and reasoning process to get better outcomes. Pretty much how any human research works but much faster and with potentially vastly more world-knowledge and reasoning capability than average humans. And these capabilities will rapidly improve with further RL.

168. lordofgibbons ◴[03 Feb 25 02:30 UTC] No.42914290[source]▶

>>42913417 #

> Which might explain why it leaps ahead in benchmarks considerably and aligns with the claims o3 is too expensive to release publicly

It's the only tool/system (I won't call it an LLM) in their released benchmarks that has access to tools and the web. So, I'd wager the performance gains are strictly due to that.

If an LLM (o3) is too expensive to be released to the public, why would you use it in a tool that has to make hundreds of inference calls to it to answer a single question? You'd use a much cheaper model. Most likely o3-mini or o1-mini combined with o4-mini for some tasks.

replies(1): >>42917836 #

169. hi_hi ◴[03 Feb 25 02:33 UTC] No.42914305{5}[source]▶

>>42914162 #

Yep, I agree with this to some extent, but I think the difference in the future is all that stress will be bypassed and people will reach for the AI from the start.

Previously there was alot of stress/pressure which might or might not have led to sloppy work (some consultants are of a high quality). With this, there will be no stress which will (always?) lead to sloppy work. Perhaps there's an argument for the high quality consultants using the tools to produce accurate and high quality work. There will obviously be a sliding scale here. Time will tell.

I'd wager the end result will be sloppy work, at scale :-)

170. Bjorkbat ◴[03 Feb 25 02:38 UTC] No.42914326[source]▶

>>42913251 (OP) #

Actually sounds pretty cool, but the graph on expert level tasks is confusing my expectations. Saying it has a pass rate of less than 20% sounds a lot like saying this thing is wrong most of the time.

Granted, these strike me as difficult tasks and I’d likely ask it to do far simpler things, but I’m not really sure what to expect from looking at these graphs.

Ah, but the fact that it bothers to cite its sources is a huge plus. Between that and its search abilities it sounds valuable to me

replies(1): >>42914482 #

171. EcommerceFlow ◴[03 Feb 25 02:39 UTC] No.42914332[source]▶

>>42913251 (OP) #

Can't even get Sunday nights off trying to keep up fml.

172. jaco6 ◴[03 Feb 25 02:41 UTC] No.42914337[source]▶

>>42913251 (OP) #

I see lots of warranted skepticism about the capabilities of this tool, but the reality is that this is an incremental step toward full automation of white collar labor. No, it will not make all analysts jobless overnight. But it may reduce hiring of said people by 5 or 10 percent. And as people get better at using the tool and the tool itself gets better, those numbers will grow. Remember that it took decades for the giant pool of typing secretaries in Mad Men to disappear, but they did disappear. Gone forever. Interestingly, anger about the diminishment of secretarial male white collar work in Germany due to the spread of the typewriter a few decades earlier was one of the drivers of the Nazi Party’s popularity (see Evans, the Rise of the Third Reich).

AI’s triumph in the white collar workplace will be gradual, not instantaneous. And it will be grimly quiet, because no one likes white collar workers the way they like blue collar workers, for some odd reason, and there’s no tradition of solidarity among white collar workers. Everyone will just look up one day and find that the local Big Corp headquarters is…empty.

173. chachamatcha ◴[03 Feb 25 02:46 UTC] No.42914362[source]▶

>>42913575 #

also US based, have pro and still no access.

174. petesergeant ◴[03 Feb 25 02:53 UTC] No.42914406{3}[source]▶

>>42913952 #

I’m not sure if you’re implying this subtly in your comment or not, as it’s early here, but it does of course need to be a generation ahead of what 10 months of their competitors moving forward have done too. Nobody is standing still

replies(1): >>42915027 #

175. blackeyeblitzar ◴[03 Feb 25 02:57 UTC] No.42914426{4}[source]▶

>>42913495 #

OpenAI’s scraping will likely be ruled as fair use.

176. dutchbookmaker ◴[03 Feb 25 03:02 UTC] No.42914446[source]▶

>>42913650 #

I would be more excited if it wasn't $200 a month to try.

I don't feel like OpenAI does a good job of getting me excited either.

Find the perfect snowboard? How can that idea get pitched and make the final cut for a $200 a month service? The NFL kicker example is also completely ridiculous.

The business and UX example seems interesting. Would love to see more.

177. neynt ◴[03 Feb 25 03:03 UTC] No.42914454{5}[source]▶

>>42914173 #

It's example #7 on https://lastexam.ai/

replies(1): >>42918951 #

178. michaelgiba ◴[03 Feb 25 03:03 UTC] No.42914456[source]▶

>>42913251 (OP) #

Gemini has had this for a month or two, also named "Deep Research" https://blog.google/products/gemini/google-gemini-deep-resea...

Meta question: what's with all of the naming overlap in the AI world? Triton (Nvidia, OpenAI) and Gro{k,q} (X.ai, groq, OpenAI) all come to mind

replies(8): >>42914559 #>>42914660 #>>42914888 #>>42915135 #>>42915211 #>>42918214 #>>42918218 #>>42918497 #

179. pazimzadeh ◴[03 Feb 25 03:03 UTC] No.42914457[source]▶

>>42913251 (OP) #

> In Nature journal's Scientific Reports conference proceedings from 2012, in the article that did not mention plasmons or plasmonics, what nano-compound is studied?

Aren't there more than one articles that did not mention plasmons or plasmonics in Scientific Reports in 2012?

Also, did they pay for access to all journal contents? that would be useful

replies(1): >>42914477 #

180. getnormality ◴[03 Feb 25 03:04 UTC] No.42914463[source]▶

>>42913251 (OP) #

The demo on global e-commerce trends seems less useful than a Google search, where the AI answer will at least give you links to the claimed information.

181. jmount ◴[03 Feb 25 03:06 UTC] No.42914476[source]▶

>>42913251 (OP) #

I had no idea there was a market for "Compile a research report on how the retail industry has changed in the last 3 years. Use bullets and tables where necessary for clarity." I imagine reading such a result is pure torture.

182. nicce ◴[03 Feb 25 03:06 UTC] No.42914477[source]▶

>>42914457 #

Maybe that is the only one with open access

183. random_cynic ◴[03 Feb 25 03:08 UTC] No.42914482[source]▶

>>42914326 #

I think that's mostly because of the access to information it has. Much of the highly useful information is not on the public internet or shows up on search engines, only domain experts know about them. Also, the websites may be paywalled or gated by login. So a better comparison would be if the models had the same level of access as an expert.

184. blackeyeblitzar ◴[03 Feb 25 03:16 UTC] No.42914522{4}[source]▶

>>42913478 #

> Stealing from thieves is fine by me. Sama was the one claiming that all information could be used to train LLMs, without permisdion of the copyright holders.

OpenAI and other LLMs scraping the internet is probably covered under fair use. DeepSeek’s violation of OpenAI’s terms is pretty clearly a violation of their terms and not legal.

replies(1): >>42915405 #

185. Davidzheng ◴[03 Feb 25 03:17 UTC] No.42914528{3}[source]▶

>>42913867 #

btw isn't this question at least really badly worded (and maybe incorrect?) the definitions they give for F and G are categories not functors... (and both categories are in fact one object with contractible space of morphisms...)

186. shlomo_z ◴[03 Feb 25 03:20 UTC] No.42914545{4}[source]▶

>>42913838 #

Your point is completely valid, but... Southwest now has an arrangement with Google Flights to allow their listings there.

187. ◴[03 Feb 25 03:20 UTC] No.42914554[source]▶

>>42913251 (OP) #

188. james_promoted ◴[03 Feb 25 03:21 UTC] No.42914559[source]▶

>>42914456 #

I've always thought the Triton situation was intentional since the name isn't generic and because the companies are stepping on each others toes here (Nvidia's Triton simplifying owning your inference; OpenAI's Triton eroding the need for familiarity with CUDA). I couldn't figure out who publicly used the name first though.

189. airstrike ◴[03 Feb 25 03:22 UTC] No.42914564[source]▶

>>42913251 (OP) #

"Deep research" is now somehow synonymous to searching online for stats and pulling stuff from Statista? And when I want to make changes to that report, do I have to tweak my prompt and get an entirely different document?

Not sure if I'm too tired and can't see it but the lack of images/examples of the resulting report in this announcement doesn't inspire a lot of confidence just yet.

190. xnx ◴[03 Feb 25 03:23 UTC] No.42914567[source]▶

>>42913430 #

I agree that OpenAI is trying to stay relevant by announcing a lot of have baked products with little to no availability.

> when Google released Gemini to have a product in the space.

Bard preceded Gemini.

191. highfrequency ◴[03 Feb 25 03:26 UTC] No.42914589[source]▶

>>42913251 (OP) #

Can it compile and run (non-Python) code as part of its tool use? Compile-run steps always seemed like they would be a huge value add during reasoning loops - it feels very silly to get output from ChatGPT, try to run it in terminal, get an error and paste the error to have ChatGPT immediately fix it. Surely it should be able to run code during the reasoning loop itself?

replies(1): >>42914736 #

192. xwolfi ◴[03 Feb 25 03:28 UTC] No.42914599{4}[source]▶

>>42914200 #

Are we not machines anyway ? Ofc a machine can feel, just need to have priorities that are aligned to itself, and use strong feedback when that self is either in danger or on the right path to preservation...

Feelings are nothing very special you know...

193. RayVR ◴[03 Feb 25 03:28 UTC] No.42914600[source]▶

>>42913251 (OP) #

Each release from openAI gives me less hope for them and this whole AI boom. They should be leading the charge of highlighting how the current generation of LLMs fail, not churning out half-baked overhyped products.

Yes, they can do some cool tricks, and tool calling is fun. No one should trust the output of these models, though. The hallucinations are bad, and my experience with the "reasoning" models is that as soon as they fuck up (they always do) they go off the rails worse than the base LLMs.

194. dyauspitr ◴[03 Feb 25 03:28 UTC] No.42914603[source]▶

>>42913462 #

On questions even specialists in that field can’t answer correctly.

195. elicksaur ◴[03 Feb 25 03:29 UTC] No.42914607{4}[source]▶

>>42913906 #

But it’s not simply scaling. Who is moving the goalposts exactly?

196. stonogo ◴[03 Feb 25 03:35 UTC] No.42914660[source]▶

>>42914456 #

It's a sort of unofficial trade association where they coalesce on specific redefinitions of terms to meet their sales and PR efforts. First they came for "intelligence," then "open source," then "reason," and it will continue. Any word which the PR wants but they can't achieve gets redefined -- "grok" is a perfect example, since in the original sci-fi book it meant "total understanding." The mythological Triton ruled the deeps, so the "deep learning" sales copy immediately co-opted it.

replies(1): >>42915351 #

197. CamperBob2 ◴[03 Feb 25 03:37 UTC] No.42914669{5}[source]▶

>>42913674 #

Let us know when your Bing-bot scores over 20% on the HLE benchmark.

replies(1): >>42920086 #

198. CamperBob2 ◴[03 Feb 25 03:42 UTC] No.42914694{4}[source]▶

>>42914232 #

OK, what's your next move, now that letter-counting has been solved by the current generation of frontier models?

CoT reasoning is reasoning, whether you like it or not. If you don't understand that, it means the models are already smarter than you.

199. blackeyeblitzar ◴[03 Feb 25 03:45 UTC] No.42914713{4}[source]▶

>>42913484 #

Yes those cases will be interesting. By default a lot of copyrighted content may be legal to use for training (in the US but also many other places) under what’s called fair use. The cases you’re referring to will likely reinforce this, but it isn’t known yet. Note that it’s not just OpenAI on that side of the argument but also other (non tech) organizations that believe protecting fair use here is current law and essential.

200. simonw ◴[03 Feb 25 03:46 UTC] No.42914715{4}[source]▶

>>42914117 #

"Lots of kids don’t want to do homework"

Sure, but if you're a professional you have to care about your reputation. Presenting hallucinated cases from ChatGPT didn't go very well for that lawyer: https://www.nytimes.com/2023/05/27/nyregion/avianca-airline-...

replies(1): >>42915577 #

201. sockaddr ◴[03 Feb 25 03:46 UTC] No.42914719{3}[source]▶

>>42913884 #

Similar to “Bullshit jobs”

All the AI commercials are designed to appeal to people that don’t produce any actual value but haven’t been detected by the system yet.

Need to send email to boss? Press magic button! Job well done, idiot.

Someone send you big scary email? Press magic button! Good job dummy!

Someone wants to go eat some Italian with you, push magic button for totally not-ad result. Enjoy your Olive Garden, moron.

replies(1): >>42915808 #

202. blackeyeblitzar ◴[03 Feb 25 03:47 UTC] No.42914730{4}[source]▶

>>42913486 #

Trade secrets are protected by the law. It doesn’t require copyright.

replies(1): >>42924019 #

203. simonw ◴[03 Feb 25 03:47 UTC] No.42914736[source]▶

>>42914589 #

It sounds like it can run Python, which means it has access to Code Interpreter, which means it can run various other languages as well if you can convince it to do so.

I've used Code Interpreter to compile and run C code - https://simonwillison.net/2024/Mar/23/building-c-extensions-... - and I've managed to get it to run JavaScript (by uploading a Deno binary) and even Lua and PHP in the past as well: https://til.simonwillison.net/llms/code-interpreter-expansio...

204. sockaddr ◴[03 Feb 25 03:48 UTC] No.42914738[source]▶

>>42913784 #

Heh

205. simonw ◴[03 Feb 25 03:49 UTC] No.42914744[source]▶

>>42913927 #

> Is this ability really a prerequisite to AGI and ASI?

That depends entirely on how you choose to define "AGI".

206. TechDebtDevin ◴[03 Feb 25 03:58 UTC] No.42914797[source]▶

>>42913738 #

This is 5-10 years out. What OpenAI is displaying here I've been able to do with relatively little code, a bit of scraping and far less capable models for a year. I really don't see what is novel or useful here.

replies(2): >>42916483 #>>42928676 #

207. TechDebtDevin ◴[03 Feb 25 03:59 UTC] No.42914809{3}[source]▶

>>42913887 #

Are you unaware that there is a "Deepthink (R1)" button right next to the "Search" button on DeepSeek's Chat app. Its been there for some time, even before all the hype regarding R1.

replies(2): >>42914890 #>>42914895 #

208. labanimalster ◴[03 Feb 25 04:02 UTC] No.42914823[source]▶

>>42913575 #

same here

209. TechDebtDevin ◴[03 Feb 25 04:02 UTC] No.42914827[source]▶

>>42913784 #

More like high school intern mode.

210. bitshiftfaced ◴[03 Feb 25 04:05 UTC] No.42914844[source]▶

>>42913417 #

> but this is being powered by the unreleased o3 model

What makes you believe that?

replies(1): >>42914905 #

211. nycdatasci ◴[03 Feb 25 04:07 UTC] No.42914854[source]▶

>>42913575 #

Pro user. No access like everyone else.

OpenAI is very much in an existential crisis and their poor execution is not helping their cause. Operator or “deep research” should be able to assume the role of a Pro user, run a quick test, and reliably report on whether this is working before the press release right?

replies(1): >>42916064 #

212. nycdatasci ◴[03 Feb 25 04:07 UTC] No.42914862[source]▶

>>42913251 (OP) #

Pro user. No access like everyone else.

replies(2): >>42914928 #>>42914974 #

213. nycdatasci ◴[03 Feb 25 04:08 UTC] No.42914866{3}[source]▶

>>42913602 #

Pro user. No access like everyone else.

214. resters ◴[03 Feb 25 04:11 UTC] No.42914883[source]▶

>>42913251 (OP) #

Still not seeing access on my account.

replies(1): >>42914908 #

215. samplatt ◴[03 Feb 25 04:11 UTC] No.42914888[source]▶

>>42914456 #

>Meta question

I think you have to prefix the query with "@Meta AI", hope this helps

216. ◴[03 Feb 25 04:12 UTC] No.42914890{4}[source]▶

>>42914809 #

217. starchild3001 ◴[03 Feb 25 04:13 UTC] No.42914895{4}[source]▶

>>42914809 #

I'm well aware of that. That is not what openai calls "deep research".

218. lolpanda ◴[03 Feb 25 04:14 UTC] No.42914900[source]▶

>>42913251 (OP) #

"synthesize large amounts of online information" does it heavily depend on the search engine performance and relevance of the search results? I don't see any mention of Google or Bing. Is this using their internal search engine then?

219. _bin_ ◴[03 Feb 25 04:15 UTC] No.42914905{3}[source]▶

>>42914844 #

they explicitly stated it in the launch

replies(1): >>42915026 #

220. _bin_ ◴[03 Feb 25 04:15 UTC] No.42914908[source]▶

>>42914883 #

they're not giving it to us lowly $20/month users yet :( gotta take out a second mortgage and throw them 200/month if you want it now

replies(1): >>42918973 #

221. _bin_ ◴[03 Feb 25 04:16 UTC] No.42914919[source]▶

>>42913579 #

let's be real for a sec, i've done consulting and have a lot of friends who still do. three times in four, your mckinsey report isn't super well-founded in reality and involves a lot of guesstimation.

222. samplatt ◴[03 Feb 25 04:18 UTC] No.42914928[source]▶

>>42914862 #

That's the third time in this thread you've stated "OpenAI is in an existential crisis". It looks very suspicious.

replies(1): >>42928206 #

223. _bin_ ◴[03 Feb 25 04:21 UTC] No.42914950{6}[source]▶

>>42913858 #

it tests syllogistic reasoning: Jason's mother was Tyro, whose father was Poesidon, whose father was Kronos. it also tests whether it "eagerly" rather than comprehensively considers something: a maternal great-grandfather could be the father of either one's maternal grandmother or maternal grandfather. so the answer could also be king Aeolus of the Etruscans.

ideally a model would be able to answer this accurately and completely.

replies(1): >>42915082 #

224. rapjr9 ◴[03 Feb 25 04:22 UTC] No.42914956[source]▶

>>42913436 #

It is also an agent, so it is using you without compensation for your work.

225. _bin_ ◴[03 Feb 25 04:24 UTC] No.42914963{3}[source]▶

>>42914236 #

what about a full refresh of the page or perhaps jump into the dev tools and check "disable cache"

could also be aggressive caching from cloudflare. could be they're just trying to announce more stuff to maintain cachet and can't yet support all users forking over 200/month.

replies(1): >>42915542 #

226. _bin_ ◴[03 Feb 25 04:25 UTC] No.42914974[source]▶

>>42914862 #

man you work for high flyer or something? i know that's not really a fair question but oai still seems to lead the pack. i know it's a hype-y area but responding to one (1) model that's comparable to o4 but cheaper with "guys it's so over for openai" is excessive.

replies(1): >>42928275 #

227. RandomWorker ◴[03 Feb 25 04:26 UTC] No.42914983[source]▶

>>42913251 (OP) #

I’m a researcher and honestly not worried. 1. Developing the right question has always been the largest barrier to great research. Not sure OpenAI can develop the right question without the Human experience. The second biggest part of my role is influencing people that my questions are the right questions. Which is made easier when you have a thorough understanding of the first. That being said, I’m sure there will be many people here that will tell me that algorithms already influence people, and ai can think through much of any issues there are.

I do use these systems from time to time, but it just never renders any specific information that would make it great research.

replies(3): >>42915076 #>>42915607 #>>42915880 #

228. BeetleB ◴[03 Feb 25 04:26 UTC] No.42914986[source]▶

>>42913927 #

> For instance you can research all you want about the best vacuum on the internet but until you try it out yourself you are going to be caught in between marketing, fake reviews, influencers, etc.

So you wouldn't use this tool for those types of use cases.

But still, a valid point. I recall I once wanted to compare Hydroflask, Klean Kanteen and Thermos to see how they perform for hot/cold drinks. I was looking specifically for articles/posts where people had performed actual measurements. But those were very hard to find, with almost all Google hits being generic comparisons with no hard data. That didn't stop them from ranking ("Hydroflask is better for warm drinks!")

Would I be able to get this to ignore all of those and use only ones where actual experiments were performed. And moreover, filter out duplicates (e.g. one guy does an experiment, and several other bloggers link to his post and repeat his findings in their own posts - it's one experiment but with many search results).

229. perching_aix ◴[03 Feb 25 04:27 UTC] No.42914988{3}[source]▶

>>42913867 #

It's very interesting to think about what kind of "mental model" might it have, if it's capable of "understanding" all this (to me) gibberish, but is then unable to actually work the problem.

230. bitshiftfaced ◴[03 Feb 25 04:34 UTC] No.42915026{4}[source]▶

>>42914905 #

The linked article says,

> Powered by a version of the upcoming OpenAI o3 model that’s optimized for web browsing and data analysis, it leverages reasoning to search, interpret, and analyze massive amounts of text, images, and PDFs on the internet, pivoting as needed in reaction to information it encounters.

If that's what you're referring to, then it doesn't seem that "explicit" to me. For example, how do we know that it doesn't use less thinking than o3-mini? Google's version of deep research uses their "not cutting edge version" 1.5 model, after all. Are you referring to something else?

replies(1): >>42916157 #

231. bruce511 ◴[03 Feb 25 04:34 UTC] No.42915027{4}[source]▶

>>42914406 #

I read a fair amount of sarcasm in the parent comment ;)

232. johnfn ◴[03 Feb 25 04:35 UTC] No.42915030{4}[source]▶

>>42913614 #

Did you intentionally flip through all the questions to find the one that seemed the easiest? If so, why? That's question #7, and all other 7 questions in the sample set seem ridiculously difficult to me.

233. alvah ◴[03 Feb 25 04:40 UTC] No.42915060{5}[source]▶

>>42913680 #

It absolutely can replace the research done by one person, for my use case at least. It’s also available on their $20/month subscription, unlike OpenAI’s $200/month.

234. RayVR ◴[03 Feb 25 04:45 UTC] No.42915076[source]▶

>>42914983 #

100% agree.

These systems serve best at augmenting information discovery. When I'm tackling a new area or looking for the right terminology, these models provide a quick shortcut because they have good probabilistic "understanding" of my naive, jargon-free description. This allows me to pull in all of the jargon for the area of research I'm interested in, and move on to actually useful resources, whether that be journal articles, textbooks, or - rarely - online posts/blogs/videos.

the current "meta" is probably something like Elicit + notebookLM + Claude for accelerating understanding of complex topics and extracting useful parts. But, again, each step requires that I am closely involved, from selecting the "correct" papers, to carefully aggregating and grooming the information pulled in from notebookLM, to judging the the usefulness of Claude's attempts to extract what I have asked for

235. elashri ◴[03 Feb 25 04:45 UTC] No.42915077[source]▶

>>42913251 (OP) #

It is actually interesting for people working in academia. I would like to test it but no way I can afford $200/m right now.

Can someone test it with this prompt.

"As a research assistant with comprehensive knowledge of particle physics, please provide a detailed analysis of next-generation particle collider projects currently under consideration by the international physics community.

The analysis should encompass the major proposed projects, including the Future Circular Collider (FCC) at CERN, International Linear Collider (ILC), Compact Linear Collider (CLIC), various Muon Collider proposals, and any other significant projects as of 2024.

For each proposal, examine the planned energy ranges and collision types, estimated timeline for construction and operation, technical advantages and challenges, approximate costs, and key physics goals. Include information about current technical design reports, feasibility studies, and the level of international support and collaboration.

Present a thorough comparative analysis that addresses technical feasibility, cost-benefit considerations, scientific potential for new physics discoveries, timeline to first data collection, infrastructure requirements, and environmental impact. The projects should be compared in terms of their relative strengths, weaknesses, and potential contributions to advancing our understanding of fundamental physics.

Please format the response as a structured technical summary suitable for presentation at a topical meeting of particle physicists. Where appropriate, incorporate relevant figures and tables to facilitate clear comparisons between proposals. Base your analysis on information from peer-reviewed sources and official design reports, focusing on the most current available data and design specifications.

Consider the long-term implications of each proposal, including potential upgrade paths, flexibility for future modifications, and integration with existing research infrastructure."

replies(3): >>42915085 #>>42915864 #>>42917383 #

236. alvah ◴[03 Feb 25 04:45 UTC] No.42915080{5}[source]▶

>>42913674 #

I haven’t tried the OpenAI version yet, as I’m on their peasant-level $20 plan, but the Google equivalent is way superior to Perplexity (I use both extensively). The web search Perplexity carries out is superficial compared to the Google product; it misses a large percentage of what Gemini Deep Research finds, and for a particular task in my business this makes a huge difference.

237. nimithryn ◴[03 Feb 25 04:46 UTC] No.42915082{7}[source]▶

>>42914950 #

I think there are more possible answers? Jason's mother differs depending on the author...

For example, Jason's mother was Philonis, daughter of Mestra, daughter of Daedalion, son of Hesporos. So Jason's maternal great-grandfather was Hesporos.

238. sagarpatil ◴[03 Feb 25 04:47 UTC] No.42915085[source]▶

>>42915077 #

I’ll do it. AFL right now.

239. xendipity ◴[03 Feb 25 04:47 UTC] No.42915086{3}[source]▶

>>42913546 #

I don't believe we have any indication that the big offerings (claude.ai, Gemini, operator, tasks, canvas, chatgpt) use multiple models in one call (other than for different modalities like having Gemini create an image). It seems to actually be very difficult technically and I'm curious as to why.

I wonder how much of an impact our being still so early in the productization phase of this all is. Like it takes a ton of work and training and coordination to get multiple models synced up into an offering and I think the companies are still optimizing for getting new ideas out there rather truly optimizing them.

replies(1): >>42915974 #

240. RainyDayTmrw ◴[03 Feb 25 04:53 UTC] No.42915119{3}[source]▶

>>42913709 #

It's possible that you care, but the person next to you doesn't, and external pressures force you to keep up with the person who's willing to shovel AI slop. Most of us don't have a complete luxury of the moral high ground at our jobs.

replies(2): >>42915145 #>>42916253 #

241. toomim ◴[03 Feb 25 04:58 UTC] No.42915135[source]▶

>>42914456 #

John Stewart had something to say about this: https://youtu.be/Byg8VZdKK88?si=pX1WbtRwZCBGpwHS&t=141

replies(1): >>42916141 #

242. doomroot ◴[03 Feb 25 05:00 UTC] No.42915145{4}[source]▶

>>42915119 #

It looks like the moral high just came more in demand.

243. sadeshmukh ◴[03 Feb 25 05:09 UTC] No.42915184{5}[source]▶

>>42913680 #

Nobody was going to hire a researcher for a quick question.

244. panarky ◴[03 Feb 25 05:13 UTC] No.42915202{4}[source]▶

>>42914232 #

"Even though that Airbus looks like it's flying it's really not because my personal definition of 'flying' requires feathers and flapping wings."

245. shihab ◴[03 Feb 25 05:15 UTC] No.42915211[source]▶

>>42914456 #

From the creator of Triton (OpenAI)-

"PS: The name Triton was coined in mid-2019 when I released my PhD paper on the subject. I chose not to rename the project when the "TensorRT Inference Server" was rebranded as "Triton Inference Server" a year later since it's the only thing that ties my helpful PhD advisors to the project."

246. financypants ◴[03 Feb 25 05:16 UTC] No.42915214{4}[source]▶

>>42914117 #

what about tests?

247. n144q ◴[03 Feb 25 05:23 UTC] No.42915243[source]▶

>>42913579 #

I think that ship has sailed many years ago since Facebook allowed false information to spread freely on their site (if not earlier).

248. khazhoux ◴[03 Feb 25 05:27 UTC] No.42915259[source]▶

>>42913513 #

Incorrect. We are not all reaching AI fatigue.

249. khazhoux ◴[03 Feb 25 05:27 UTC] No.42915262[source]▶

>>42913544 #

Trying to parse this. What are you saying?

250. tikhonj ◴[03 Feb 25 05:30 UTC] No.42915282{5}[source]▶

>>42914162 #

Making it easier and cheaper to cut corners and make stuff up will result in more cut corners and more made up stuff. That's not good.

Same problem I have with code models, honestly. We already have way too much boilerplate and bad code; machines to generate more boilerplate and bad code aren't going to help.

replies(1): >>42915305 #

251. mquander ◴[03 Feb 25 05:36 UTC] No.42915305{6}[source]▶

>>42915282 #

The technology also makes it easier and cheaper to make good things, so the direction of the outcome isn't guaranteed.

252. shawabawa3 ◴[03 Feb 25 05:44 UTC] No.42915341{4}[source]▶

>>42913923 #

o3 mini was previewed in December. Deepseek maybe made them release it a few weeks early but it was already on its way

replies(1): >>42916024 #

253. albert_e ◴[03 Feb 25 05:47 UTC] No.42915351{3}[source]▶

>>42914660 #

Also "accuracy" as a measure of model's performance used to mean something objective in the traditional ML world.

Now with LLMs it is what human evaluators feel about the LLM output?

replies(1): >>42915761 #

254. simion314 ◴[03 Feb 25 06:00 UTC] No.42915405{5}[source]▶

>>42914522 #

>DeepSeek’s violation of OpenAI’s terms is pretty clearly a violation of their terms and not legal.

Here is a new thing you learn today, ToS are not laws, you can ignore any ToS and at worst the company might close your account.

255. lmm ◴[03 Feb 25 06:04 UTC] No.42915426[source]▶

>>42914149 #

Of course they will. Loads of people go around taking hundreds of photos with the biggest camera they can afford even though no-one else will ever willingly look at them.

256. slaterbug ◴[03 Feb 25 06:17 UTC] No.42915478{3}[source]▶

>>42913867 #

As someone who doesn't understand anything beyond the word 'set' in that question, can anyone give an indication of how hard of a problem that actually is (within that domain)?

Also I'm curious as to what percentage of the questions in this benchmark are of this type / difficulty, vs the seemingly much easier example of "In Greek mythology, who was Jason's maternal great-grandfather?".

I'd imagine the latter is much easier for an LLM, and almost trivial for any LLM with access to external sources (such as deep research).

257. rajnathani ◴[03 Feb 25 06:21 UTC] No.42915499[source]▶

>>42913251 (OP) #

I remember about 10-15 years ago that Ray Kurzweil (who still works at Google) or someone at Google had this idea for what Google should be able to do: About doing deep research by itself with a simple search query. I can't find the source. Obviously it didn't pan out without transformers.

258. energy123 ◴[03 Feb 25 06:29 UTC] No.42915542{4}[source]▶

>>42914963 #

I relogged, disabled cache and reloaded the page with Ctrl+Shift+R but it doesn't show up.

replies(1): >>42915659 #

259. anthonyshort ◴[03 Feb 25 06:34 UTC] No.42915568[source]▶

>>42913579 #

Then the hallucinated research is published in an article which is then cited by other AI research, continuing the push the false information until it’s hard to know where the lie started.

260. corentin88 ◴[03 Feb 25 06:36 UTC] No.42915576[source]▶

>>42913251 (OP) #

Curious about the use cases here. Building AI Agents? But which one?

261. PeterStuer ◴[03 Feb 25 06:36 UTC] No.42915577{5}[source]▶

>>42914715 #

That's a lawyer in an adverserial situation. Business consultants tell their clients what they want to believe, the facts be dammed.

replies(1): >>42915751 #

262. ◴[03 Feb 25 06:42 UTC] No.42915607[source]▶

>>42914983 #

263. nijaar ◴[03 Feb 25 06:51 UTC] No.42915659{5}[source]▶

>>42915542 #

same here. pro in the US and still no access. i even logged in using my phone and a different browser

264. picografix ◴[03 Feb 25 07:05 UTC] No.42915718[source]▶

>>42913251 (OP) #

I think deep research as a service could be a really strong use case for enterprises, as long as they have access to non-public data. I assume that most of this guarded data is high quality, and seeing progress in these areas might end up being even more impressive than it is now.

265. rsanek ◴[03 Feb 25 07:13 UTC] No.42915751{6}[source]▶

>>42915577 #

it sounds like ai doesn't really change that situation

replies(1): >>42915956 #

266. yorwba ◴[03 Feb 25 07:15 UTC] No.42915761{4}[source]▶

>>42915351 #

Traditional ML is no stranger to measuring accuracy in terms of agreement with human evaluators.

replies(1): >>42915979 #

267. rsanek ◴[03 Feb 25 07:19 UTC] No.42915788{4}[source]▶

>>42913838 #

they finally bowed mid last year https://www.nerdwallet.com/article/travel/southwest-google-f...

268. rsanek ◴[03 Feb 25 07:24 UTC] No.42915808{4}[source]▶

>>42914719 #

i think the apple ads are the poster child here. hopefully we can see more inventive ones than just serving lazy people.

269. rsanek ◴[03 Feb 25 07:27 UTC] No.42915821[source]▶

>>42913784 #

don't be hyperbolic. deep research would need to help cause an opioid crisis for to get to that level.

270. n4r9 ◴[03 Feb 25 07:30 UTC] No.42915840{3}[source]▶

>>42913709 #

It's a bit like saying "my kids are going to hit themselves anyway, so it doesn't matter if I give them foam rods or metal rods".

replies(1): >>42919423 #

271. ◴[03 Feb 25 07:34 UTC] No.42915864[source]▶

>>42915077 #

272. GeoAtreides ◴[03 Feb 25 07:38 UTC] No.42915880[source]▶

>>42914983 #

> Developing the right question has always been the largest barrier to great research.

I thought funding was the biggest barrier to great research

273. DrSiemer ◴[03 Feb 25 07:47 UTC] No.42915909{3}[source]▶

>>42913713 #

Exactly what will happen with art. The tolerance for low quality output will decrease.

274. baal80spam ◴[03 Feb 25 07:51 UTC] No.42915923{3}[source]▶

>>42913867 #

That's easy Dave: 42.

275. asimpletune ◴[03 Feb 25 07:57 UTC] No.42915956{7}[source]▶

>>42915751 #

But the point is it does if you count making it worse changing the situation.

276. someothherguyy ◴[03 Feb 25 07:59 UTC] No.42915974{4}[source]▶

>>42915086 #

...or its all a farce, for now.

277. albert_e ◴[03 Feb 25 08:00 UTC] No.42915979{5}[source]▶

>>42915761 #

A customer churn model or revenue forecast did have hard objective data (ground truth) to compare against - isn't it?

replies(1): >>42916750 #

278. senordevnyc ◴[03 Feb 25 08:07 UTC] No.42916014{4}[source]▶

>>42913788 #

Pixel hasn’t even been around for a decade.

replies(1): >>42916851 #

279. sdesol ◴[03 Feb 25 08:10 UTC] No.42916024{5}[source]▶

>>42915341 #

I guess the question is, did DeepSeek force them to rethink pricing? It's crazy how much cheaper it (v3 and R1) is, but considering they (Deepseek) can't keep up with demand, the price is kind of moot right now. I really do hope they get the hardware to support the API again. The v3 and R1 models that are hosted by others are still cheap compared to the incumbents, but nothing can compete with DeepSeek on price and performance.

280. kandesbunzler ◴[03 Feb 25 08:10 UTC] No.42916026{4}[source]▶

>>42913923 #

no they didn't, this was literally all announced in December with a release date for January

281. kandesbunzler ◴[03 Feb 25 08:18 UTC] No.42916064{3}[source]▶

>>42914854 #

How many times are you going to post this exact same comment here? Are you a Chinese bot or something?

282. anon373839 ◴[03 Feb 25 08:21 UTC] No.42916083[source]▶

>>42913251 (OP) #

Setting aside how well it works, I think this is a pretty nice demonstration of how to do UX for an agentic RAG app. I like that the intermediate steps have been pushed out to a sidebar, with updates that both provide some transparency about the process and make the high latency more palatable.

283. felindev ◴[03 Feb 25 08:27 UTC] No.42916122[source]▶

>>42913491 #

While people might attempt that, it's going to be an arms race, just like ads vs adblocks. There's already multiple crawlers that present fake user-agent when their original one is blocked. Temptation of more data is just to irresistible to them

284. auggierose ◴[03 Feb 25 08:28 UTC] No.42916127[source]▶

>>42913251 (OP) #

The flow reminds me a bit of undermind.ai.

285. rob_c ◴[03 Feb 25 08:28 UTC] No.42916130[source]▶

>>42913251 (OP) #

Feels more and more like openAI doesn't have "that next big thing".

To be clear I'm constantly impressed with what they have and what I get as a customer, but the delivery since 4 hasn't exactly been in line with Altman's Musk-tier vapoware promises...

286. gorgoiler ◴[03 Feb 25 08:29 UTC] No.42916136[source]▶

>>42913251 (OP) #

For “deep research” I’m also reading “getting the answers right”.

Most people I talk to are at the point now where getting completely incorrect answers 10% of the time — either obviously wrong from common sense, or because the answers are self contradictory — undermines a lot of trust in any kind of interaction. Other than double checking something you already know, language models aren’t large enough to actually know everything. They can only sound like they do.

What I’m looking for is therefore not just the correct answer, but the correct answer in an amount of time that’s faster than it would take me to research the answer myself, and also faster than it takes me to verify the answer given by the machine.

It’s one thing to ask a pupil to answer an exam paper to which you know the answers. It’s a whole next level to have it answer questions to which you don’t know the answers, and on whose answers you are relying to be correct.

replies(8): >>42916376 #>>42916453 #>>42916822 #>>42916920 #>>42917151 #>>42917748 #>>42919707 #>>42920231 #

287. justaj ◴[03 Feb 25 08:30 UTC] No.42916141{3}[source]▶

>>42915135 #

Without the tracking bits: https://youtu.be/Byg8VZdKK88#t=141

replies(1): >>42945492 #

288. taran_narat ◴[03 Feb 25 08:32 UTC] No.42916145[source]▶

>>42913251 (OP) #

isn't this just perplexity?

289. golol ◴[03 Feb 25 08:34 UTC] No.42916157{5}[source]▶

>>42915026 #

o3-mini is not really "a version of the o3 model", it is a different model (less parameters). So their language strongly suggests, imo, that Deep Research is powered by a model with the same number of parameters as o3.

290. crvdgc ◴[03 Feb 25 08:40 UTC] No.42916186[source]▶

>>42913650 #

AGI aside, sometimes HN critics/cynicism indeed points out the exact reason why something wouldn't work and is vindicated after the fact, e.g. Apple Vision Pro. I guess it's just hard to predict the future and for me, it's interesting to listen to even pure contrarians.

291. gigatexal ◴[03 Feb 25 08:43 UTC] No.42916207[source]▶

>>42913251 (OP) #

Ok so I do this as a noob in some field. How do I know or trust the research conclusions? How do I know it’s not hallucinated its conclusions? I’ll likely have to do my own research to just verify it and then if I did I might as well have done the research myself.

292. freehorse ◴[03 Feb 25 08:47 UTC] No.42916230{5}[source]▶

>>42914173 #

I am selling a bridge, it is a great bargain.

293. navigate8310 ◴[03 Feb 25 08:50 UTC] No.42916253{4}[source]▶

>>42915119 #

It's the high reps fault then of not caring about quality. Either you assimilate in that low quality lower management using AI slop or change job.

294. freehorse ◴[03 Feb 25 09:04 UTC] No.42916334[source]▶

>>42913251 (OP) #

I love that when "open"ai releases things last year or so, they do not actually release them. So we get the chance in the meantime to all enjoy a bunch of speculative, shilling comments here about this next great thing being miles ahead of competitors/close to AGI/the tool that will actually do X thing that others complain so far llms are failing to do.

295. jstummbillig ◴[03 Feb 25 09:04 UTC] No.42916335{4}[source]▶

>>42914117 #

I don't think it follows that taken an easier path would mean quality goes down.

296. sandos ◴[03 Feb 25 09:11 UTC] No.42916376[source]▶

>>42916136 #

I mean this all falls down due to the need of verification:

"Limitations Deep research unlocks significant new capabilities, but it’s still early and has limitations. It can sometimes hallucinate facts in responses or make incorrect inferences"

How do I know which parts are false? It will take as long to verify as to research!

replies(1): >>42917205 #

297. igleria ◴[03 Feb 25 09:24 UTC] No.42916453[source]▶

>>42916136 #

It's really worrying to me, even as a self proclaimed "LLM <-> AI" skeptic, to see what kind of stuff people pretend to get out from an LLM. Typewriter monkeys as a service almost.

Still useful for the odd task here and there, but not as useful as all the money being invested in this (except for the companies getting that money, that is).

edit: actual example of something I'd expect a real AI to be able to solve by itself, but currently LLMs fail miserably https://x.com/RadishHarmers/status/1885884032220643587

replies(1): >>42916542 #

298. sumedh ◴[03 Feb 25 09:27 UTC] No.42916480[source]▶

>>42913491 #

> Feels like only a matter of time before these crawlers are blocked from large swathes of the internet.

How would you know its a crawler?

299. sumedh ◴[03 Feb 25 09:28 UTC] No.42916483{3}[source]▶

>>42914797 #

Probably the accuracy.

300. mdp2021 ◴[03 Feb 25 09:37 UTC] No.42916542{3}[source]▶

>>42916453 #

> Typewriter monkeys as a service almost. // Still useful for the odd task here and there

1) Paramount task: searching in naturally structured language, as opposed to keywords. Odd tasks: oh yes, several tasks of fuzzy sophisticated text processing previously unsolved.

2) They translate NN encodings in natural language! The issue remains about the quality of /what/ they translate in natural language, but one important chunk of the problem* is in a way solved...

Now, I've been probably one of the most vocal here, shouting "That's the opposite of intelligence!" - even in the past 24 hours -, but be objective: there are also progresses ...

(* Around five years ago we were still stuck with Hinton's problem of interpreting pronouns as pointers in "the item won't fit in the case: it's too big" vs "the item won't fit in the case: it's too small" - look at it now...)

replies(1): >>42916591 #

301. carbocation ◴[03 Feb 25 09:41 UTC] No.42916559{3}[source]▶

>>42914048 #

(Update: it’s visible for me now.)

302. igleria ◴[03 Feb 25 09:46 UTC] No.42916591{4}[source]▶

>>42916542 #

Of course I see progress, but I feel like the bridge of "thinks" versus "regurgitates" is still far off, if it is still in the horizon with the current approach. IMHO.

edit: furthermore, LLMs probably tackle very little "real state" in the "make machines THINK" land. But a crucial piece on the overall puzzle.

303. chrismarlow9 ◴[03 Feb 25 09:59 UTC] No.42916670{3}[source]▶

>>42913562 #

That's some fancy words friend.

304. ptrrrrrrppr ◴[03 Feb 25 10:14 UTC] No.42916750{6}[source]▶

>>42915979 #

you seem to think all classical ML models were supervised, which isn't true. and we have metrics for unsupervised approaches as well

305. chombier ◴[03 Feb 25 10:25 UTC] No.42916822[source]▶

>>42916136 #

> and also faster than it takes me to verify the answer given by the machine.

I always thought there was a kind of NP-flavor to the problems for which LLMs-like AI are helpful in practice, in the sense that solving the problem may be hard but checking the solution must be fast.

Unless the domain can accomodate errors/hallucination, checking the solution (by a human) should be exponentially faster than finding it (by some AI) otherwise there's little practical gain.

306. dazzaji ◴[03 Feb 25 10:27 UTC] No.42916832[source]▶

>>42913251 (OP) #

Late Sunday night, I gained access to OpenAI’s newly launched Deep Research and immediately tested it on a draft blog post about Uniform Electronic Transactions Act (UETA) compliance and AI-agent error handling [1]. Here’s what I found:

Within minutes, it generated a detailed, well-cited research report that significantly expanded my original analysis, covering: * Legal precedents & case law interpretations (including a nuanced breakdown of UETA Section 10). * Comparative international frameworks (EU, UK, Canada). * Real-world technical implementations (Stripe’s AI-driven transaction handling). * Industry perspectives & business impact (trust, risk allocation, compliance). * Emerging regulatory standards (EU AI Act, FTC oversight, ISO/NIST AI governance).

What stood out most was its ability to: - Synthesize complex legal, business, and technical concepts into clear, actionable insights. - Connect legal frameworks, industry trends, and real-world case studies. - Maintain a business-first focus, emphasizing practical benefits. - Integrate 2024 developments with historical context for a deeper analysis.

The depth and coherence of the output were comparable to what I would expect from a team of domain experts—but delivered in a fraction of the time.

From the announcement: Deep Research leverages OpenAI’s next-generation model, optimized for multi-step research, reasoning, and synthesis. It has already set new performance benchmarks, achieving 26.6% accuracy on Humanity’s Last Exam (the highest of any OpenAI model) and a 72.57% average accuracy on the GAIA Benchmark, demonstrating advanced reasoning and research capabilities.

Currently available to Pro users (with up to 100 queries per month), it will soon expand to Plus and Team users. While OpenAI acknowledges limitations—such as occasional hallucinations and challenges in source verification—its iterative deployment strategy and continuous refinement approach are promising.

My key takeaway: This LLM agent-based tool has the potential to save hours of manual research while delivering high-quality, well-documented outputs. Automating tasks that traditionally require expert-level investigation, it can complete complex research in 5–30 minutes (just 6 minutes for my task), with citations and structured reasoning.

I don’t see any other comments yet from people who have actually used it, but it’s only been a few hours.I’d love to hear how it’s performing for others. What use cases have you explored? How did it do?

(Note: This review is based on a single use case. I’ll provide further updates as I conduct broader testing.)

[1] https://www.dazzagreenwood.com/p/ueta-and-llm-agents-a-deep-...

replies(1): >>42917014 #

307. scarab92 ◴[03 Feb 25 10:31 UTC] No.42916851{5}[source]▶

>>42916014 #

The Pixel branding is 12 years old, and IIRC this feature also existed in Nexus before that.

replies(1): >>42919989 #

308. Xuban ◴[03 Feb 25 10:33 UTC] No.42916859[source]▶

>>42913251 (OP) #

This make sense, I often use the normal search feature to research a very large ammount of information and it mostly does not work well. If the new search feature increases the number of websites scrapped and the pertinence of the websites, I'm all in.

309. timabdulla ◴[03 Feb 25 10:40 UTC] No.42916899[source]▶

>>42913251 (OP) #

I just gave it a whirl. Pretty neat, but definitely watch out for hallucinations. For instance, I asked it to compile a report on myself (vain, I know.) In this 500-word report (ok, I'm not that important, I guess), it made at least three errors.

It stated that I had 47,000 reputation points on Stack Overflow -- quite a surprise to me, given my minimal activity on Stack Overflow over the years. I popped over to the link it had cited (my profile on Stack Overflow) and it seems it confused my number of people reached (47k) with my reputation, a sadly paltry 525.

Then it cited an answer I gave on Stack Overflow on the topic of monkey-patching in PHP, using this as evidence for my technical expertise. Turns out that about 15 years ago, I _asked_ a question on this topic, but the answer was submitted by someone else. Looks like I don't have much expertise, after all.

Finally, it found a gem of a quote from an interview I gave. Or wait, that was my brother! Confusingly, we founded a company together, and we were both mentioned in the same article, but he was the interviewee, not I.

I would say it's decent enough for a springboard, but you should definitely treat the output with caution and follow the links provided to make sure everything is accurate.

replies(9): >>42918384 #>>42918393 #>>42918413 #>>42918436 #>>42918487 #>>42918880 #>>42918972 #>>42921017 #>>42921709 #

310. jeswin ◴[03 Feb 25 10:44 UTC] No.42916920[source]▶

>>42916136 #

> Most people I talk to are at the point now where getting completely incorrect answers 10% of the time

A year back that number was 30%, and a couple of years back it was 60%. There will be a point where it'll be good enough. There are also better and better ways to verify answers these days.

It'll never be a solution for everything, but that's similar to many engineering problems we have: for example, ORMs aren't great for all types of queries, but they're sufficient for a good part of them.

replies(1): >>42917417 #

311. timabdulla ◴[03 Feb 25 11:00 UTC] No.42917014[source]▶

>>42916832 #

I tried it on a few things I was familiar with just to assess its reliability.

The first was on a topic with which I am deeply familiar -- myself -- and it made three factual errors in a 500-word report: https://news.ycombinator.com/item?id=42916899

The second was a task to do an industry analysis on a space in which I worked for about ten years. I think its overall synthesis was good (it accorded with my understanding of the space), but there were a number of errors in the statistics and supporting evidence it compiled, based upon my random review of the source material.

I think the product is cool and will definitely be helpful, but I would still recommend verifying its outputs. I think the process of verification is less time-consuming than the process of researching and writing, so that is likely an acceptable compromise in many cases.

312. smusamashah ◴[03 Feb 25 11:13 UTC] No.42917084[source]▶

>>42913251 (OP) #

    can sometimes hallucinate facts in responses or make incorrect inferences, though at a notably lower rate than existing ChatGPT models, according to internal evaluations. It may struggle with distinguishing authoritative information from rumors, and currently shows weakness in confidence calibration, often failing to convey uncertainty accurately

Taken from the limitations section.

These tools are just good at creating pollution. I don't see the point of delegating a (not just) research where 1% blatant mistakes are acceptable. These need much better grounding before handing out to masses.

I can not take any output by these tools (google summaries, comment summaries by amazon, youtube summaries etc etc) while knowing for a fact some of that is a total lie. I can not tell which part is a lie. e.g. If LLM says that in any given text the sentiment is divided, it could be just one person with an opposing view.

If same task was given to a person, I could reason with that person on any conclusion. These tools will reason on their hallucinations.

313. joanfihu ◴[03 Feb 25 11:24 UTC] No.42917140[source]▶

>>42913251 (OP) #

There is no way I'll read all that text from the demos...

AskPandi has a similar feature called "Super Search" that essentially checks more sources and self validates it's own answers.

iT's AgEnTic.

The answers are easier to digest, if you search for products, you'll get a list of products with images, prices and retailers.

314. herculity275 ◴[03 Feb 25 11:25 UTC] No.42917151[source]▶

>>42916136 #

My worry is that all these recent capabilities attempt to minimize hallucinations by relying on extensive web search, however web itself is being actively degraded by unfiltered LLM output. After a certain point running your research agent against a ~5-year-old snapshot of the web will be strictly more accurate (for non-current affairs queries) than querying live web.

315. monkeydust ◴[03 Feb 25 11:36 UTC] No.42917197[source]▶

>>42913251 (OP) #

What a decent setup to replicate via open model and agent framework? One thing I have struggled with is getting comprehensive web searches using an agentic framework.

316. fosterfriends ◴[03 Feb 25 12:11 UTC] No.42917383[source]▶

>>42915077 #

Ran it for you using o3-high! Here's a link to the results: https://chatgpt.com/share/67a0b227-8ee4-800f-a8ed-882e7bab97...

Hopefully helpful, happy to test others for you :)

-- Raw output --

Understood. I will conduct a detailed technical analysis of next-generation particle collider projects, focusing on the Future Circular Collider (FCC), International Linear Collider (ILC), Compact Linear Collider (CLIC), Muon Collider proposals, and other significant initiatives.

The report will include: - Planned energy ranges and collision types - Estimated construction and operation timelines - Technical advantages and challenges - Approximate costs - Key physics goals - Review of technical design reports and feasibility studies - International collaboration and support - Comparative analysis on feasibility, cost-benefit, scientific potential, infrastructure needs, and environmental impact - Long-term considerations, including upgrade paths and integration with existing facilities

The analysis will be formatted as a structured technical summary, suitable for presentation at a topical meeting of particle physicists, with comparative tables and figures where relevant.

I will update you when the report is ready.

# Next-Generation Particle Collider Projects: A Comparative Technical Analysis

## Introduction High-energy physics is charting plans for new colliders to extend beyond the LHC and its High-Luminosity upgrade (HL-LHC). The discovery of the 125 GeV Higgs boson has motivated proposals for a dedicated “Higgs factory” lepton collider to study its properties with precision, as well as future energy-frontier machines to search for new physics. Several next-generation collider initiatives are under consideration internationally as of 2024. These include CERN’s proposed *Future Circular Collider (FCC)*, Japan’s *International Linear Collider (ILC)*, CERN’s *Compact Linear Collider (CLIC)*, various designs for a *Muon Collider*, China’s *Circular Electron-Positron Collider (CEPC)* and its successor *Super Proton-Proton Collider (SppC)*, among others. Each proposal differs in collision type (electron-positron, proton-proton, muon-muon, etc.), energy scale, technology, timeline, cost, and physics focus. This summary reviews each project’s key parameters – *planned energy ranges, collision types, timeline, technical advantages/challenges, cost, and physics goals* – based on technical design reports and feasibility studies. A comparative analysis then contrasts their *technical feasibility, cost-benefit, scientific potential for discoveries, timeline to first data, infrastructure needs, and environmental impact*, highlighting the relative strengths and weaknesses of each approach. We also discuss long-term implications such as upgrade paths, flexibility for future modifications, and integration with existing infrastructure.

(Citations refer to official reports and peer-reviewed sources using the format 【source†lines】.)

## Future Circular Collider (FCC) – CERN - *Type and Energy:* The FCC is a *proposed 100 km circular collider* at CERN that would be realized in stages. The first stage, *FCC-ee*, is an electron-positron ($e^+e^-$) collider with center-of-mass energy tunable from ~90 GeV up to 350–365 GeV, covering the Z boson pole, WW threshold, Higgs production (240 GeV), and top-quark pair threshold (~350 GeV). A second stage, *FCC-hh*, would use the same tunnel for a proton-proton collider at up to *100 TeV* center-of-mass energy (an order of magnitude above the LHC’s 14 TeV). Heavy-ion collisions (e.g. Pb–Pb) are also envisioned. An *FCC-eh* option (electron-hadron collisions) is considered by adding a high-energy electron injector to collide with the proton beam. This integrated FCC program thus spans both *precision lepton* collisions and *energy-frontier hadron* collisions.

- *Timeline:* The conceptual schedule foresees *FCC-ee construction in the 2030s* and a start of operations by around *2040* (as the LHC/HL-LHC program winds down). According to the FCC Conceptual Design Report, an $e^+e^-$ Higgs factory could begin delivering physics in ~2040, running for 15–20 years. The *hadron collider FCC-hh* would be constructed subsequently (using the same tunnel and upgraded infrastructure), aiming for *first proton-proton collisions in the late 2050s】. This staged approach (lepton collider first, hadron later) mirrors the successful *LEP–LHC sequence*, leveraging the $e^+e^-$ machine to produce great precision data (and to build infrastructure) before pushing to the highest energies with the hadron machine. ...

(Too long for HN to write more)

replies(1): >>42918479 #

317. dimitri-vs ◴[03 Feb 25 12:16 UTC] No.42917417{3}[source]▶

>>42916920 #

It contributes little to discuss a hypothetical future. Maybe we'll have fusion energy, delivery drones, everyone using VR, etc. Maybe we will go into a deep recession due to trade wars, or maybe not.

The meaningful discussion is about how they perform NOW and the edge cases that have persisted since GPT-2 which no one has yet found a good solution for.

replies(1): >>42917700 #

318. dimitri-vs ◴[03 Feb 25 12:41 UTC] No.42917612[source]▶

>>42913575 #

I have access as of ~3 hours ago. Using the Win desktop app too, which is behind on some features (Operator, tasks). I open up any of the models and it shows up as a `(Deep research)` tag on the input field next to the web search option. Didn't clear cache or anything.

319. infecto ◴[03 Feb 25 12:52 UTC] No.42917700{4}[source]▶

>>42917417 #

We already have delivery drones though.

I disagree though, it is useful as this problem has been whittled down and I think there is expectation that there will be continued effort. Its of course worth discussing but I find that for my workflows, I rarely encounter issues with hallucinations, they certainly exist but its gotten to a point that I don't have major issue with it.

replies(1): >>42918392 #

320. shakes_mcjunkie ◴[03 Feb 25 12:57 UTC] No.42917748[source]▶

>>42916136 #

> What I’m looking for is therefore not just the correct answer, but the correct answer in an amount of time that’s faster than it would take me to research the answer myself, and also faster than it takes me to verify the answer given by the machine.

This is why I haven't found AI tools very useful. I find my self spending more time verifying and fixing it's answers than I would have just doing or learning the darn thing myself.

replies(1): >>42917903 #

321. og_kalu ◴[03 Feb 25 13:09 UTC] No.42917836{3}[source]▶

>>42914290 #

>why would you use it in a tool that has to make hundreds of inference calls to it to answer a single question? You'd use a much cheaper model.

The same reason a lot of people switched to GPT-4 when it came out even though it was much more expensive than 3 - doesn't matter how cheap it is if it isn't good enough/much worse.

322. 7thpower ◴[03 Feb 25 13:17 UTC] No.42917903{3}[source]▶

>>42917748 #

It is added cognitive load, but there is a lot of value in async tasks if you can trust the output or if the opportunity cost of validating is low.

The challenge with something like this for research, in its current state, is you’ll need to go double check it because you don’t trust it and it will end up effectively being a list of links.

It’s progress though and evidently good enough to find a sweet NSX in Japan, which is all some really need.

323. regularjack ◴[03 Feb 25 13:21 UTC] No.42917932[source]▶

>>42913251 (OP) #

Of course, they had to weasel the word "deep" in there.

324. sharpshadow ◴[03 Feb 25 13:34 UTC] No.42918038[source]▶

>>42913251 (OP) #

Are they launching a new feature after some other AI got the attention to get the attention back?

325. chabes ◴[03 Feb 25 13:50 UTC] No.42918214[source]▶

>>42914456 #

> what's with all of the naming overlap in the AI world? Triton (Nvidia, OpenAI) and Gro{k,q} (X.ai, groq, OpenAI) all come to mind

They seem to be ok with outsourcing any and all creativity to a language model, so it’s not surprising that they can’t come up with unique names themselves.

replies(1): >>42922036 #

326. kavalerov ◴[03 Feb 25 13:51 UTC] No.42918218[source]▶

>>42914456 #

I am afraid Gemini's version is not really very "deep" - it surfaces a lot of information, but on a quite superficial level. OAIs version seems to make that one step forward to proper depth.

We found in our experience it is pretty hard to force LLM to do something in proper depth, and OAI's deep research definitely feels like one of the first examples from big labs on how this can be done. What we typically see is that it is not even the "agent" part that is hard to do, but how to force model to not "forget" to go deep...

327. titzer ◴[03 Feb 25 13:53 UTC] No.42918233[source]▶

>>42913251 (OP) #

It's great that none of these AI models are being foisted on us by advertising companies.

328. axpy906 ◴[03 Feb 25 14:09 UTC] No.42918354[source]▶

>>42913251 (OP) #

Don’t most researchers have a local setup plugged into Olama so that they do NOT share their search information?

329. giarc ◴[03 Feb 25 14:14 UTC] No.42918384[source]▶

>>42916899 #

What's faster? Writing a 500 word report "from scratch" by researching the topic yourself, vs. having AI write it then having to fact check every answer and correct each piece manually?

This is why I don't use AI for anything that requires a "correct" answer. I use it to re-write paragraphs or sentences to improve readability etc, but I stop short of trusting any piece of info that comes out from AI.

330. skywhopper ◴[03 Feb 25 14:14 UTC] No.42918392{5}[source]▶

>>42917700 #

At best, a proof of concept of experimental delivery drones exist, but only for small, lightweight items, and only in a few places, only in the right weather, and only if you place a target on your driveway and are there to receive the item in person, and all at the cost of a very high noise level. That's not exactly a real service.

replies(1): >>42943191 #

331. mdp2021 ◴[03 Feb 25 14:15 UTC] No.42918393[source]▶

>>42916899 #

> Then it cited an answer I gave on Stack Overflow [...] using this as evidence for my technical expertise. Turns out that about 15 years ago, I _asked_ a question on this topic, but the answer was submitted by someone else

Artificial dementia...

Some parties are releasing products much earlier than the ability to ship well working products (I am not sure that their legal cover will be so solid), but database aided outputs should and could become a strong limit to that phenomenon of remembering badly. Very linearly, like humans: get an idea, then compare it to the data - it is due diligence and part of the verification process in reasoning. It is as if some moves outside linear pure product progress reasoning are swaying the RnD towards directions outside the primary concerns. It's a form of procrastination.

332. RobinL ◴[03 Feb 25 14:16 UTC] No.42918413[source]▶

>>42916899 #

Interesting

You might find it amusing to compare it to: https://hn-wrapped.kadoa.com/timabdulla

(Ref:https://news.ycombinator.com/item?id=42857604)

replies(3): >>42918554 #>>42919630 #>>42920892 #

333. prof-dr-ir ◴[03 Feb 25 14:19 UTC] No.42918436[source]▶

>>42916899 #

> Pretty neat, but definitely watch out for hallucinations.

That would be exactly my verdict of any product based on LLMs in the past few years.

334. fosterfriends ◴[03 Feb 25 14:24 UTC] No.42918479{3}[source]▶

>>42917383 #

Honestly, these are the smartest and overall best LLM outputs I've ever seen to date. Loving Deep Research, feels like another level up in the race

replies(1): >>42920981 #

335. toasteros ◴[03 Feb 25 14:24 UTC] No.42918487[source]▶

>>42916899 #

"Pretty neat, but definitely watch out for hallucinations."

We'd never hire someone who just makes stuff up (or at least keep them employed for long). Why are we okay with calling "AI" tools like this anything other than curious research projects?

Can't we just send LLMs back to the drawing board until they have some semblance of reliability?

replies(11): >>42918517 #>>42918573 #>>42918981 #>>42919588 #>>42919860 #>>42920087 #>>42920744 #>>42920980 #>>42921656 #>>42922203 #>>42922968 #

336. svara ◴[03 Feb 25 14:26 UTC] No.42918497[source]▶

>>42914456 #

> Gemini has had this for a month or two,

Would have loved to try it when they released it, but I'm apparently in the wrong country. I think it's not available outside the US (?). OpenAI and DeepSeek have no such issues. It's a bummer really, I'm happy paying for this but they don't want me to.

replies(1): >>42937640 #

337. oldstrangers ◴[03 Feb 25 14:29 UTC] No.42918517{3}[source]▶

>>42918487 #

> Can't we just send LLMs back to the drawing board until they have some semblance of reliability?

Well at this point they've certainly proven a net gain for everyone regardless of the occasional nonsense they spew.

replies(5): >>42918546 #>>42918939 #>>42919393 #>>42920475 #>>42920542 #

338. PartiallyTyped ◴[03 Feb 25 14:32 UTC] No.42918539{5}[source]▶

>>42913919 #

I don’t think you actually read it. The benchmarks are in reference to the model that’s underlying deep-research, and not deep-research itself. For the latter, they have anecdata from scientists.

339. DanHulton ◴[03 Feb 25 14:33 UTC] No.42918546{4}[source]▶

>>42918517 #

That is... debatable. You may be entirely inside the bubble, there.

replies(3): >>42918721 #>>42918746 #>>42918956 #

340. wholinator2 ◴[03 Feb 25 14:34 UTC] No.42918554{3}[source]▶

>>42918413 #

This is... very uncomfortable. An (expanded) AI summary of my HN and reddit usage would appear to be a pretty complete representation of my "online" identity/character. I remember when people would browse your entire comment history just to find something to discredit you on reddit, and that behavior was _heavily_ discouraged. Now, we can just run an AI model to follow you and sentence you to a hell of being permanently discredited online. Give it a bunch of accounts to rotate through, send some voting power behind it (reddit or hn), and just pick apart every value you hold. You could obliterate someone's will to discuss anything online. You could effectively silence all but the most stubborn, and those people you would probably drive insane.

It's a very interesting usecase though, filter through billions of comments and give everyone a score on which real life person they probably are. I wonder if say, Ted Cruz hides behind a username somewhere.

replies(1): >>42919615 #

341. ◴[03 Feb 25 14:36 UTC] No.42918573{3}[source]▶

>>42918487 #

342. sivm ◴[03 Feb 25 14:39 UTC] No.42918601[source]▶

>>42913251 (OP) #

I used it once to research language learning and had my pro mode taken away pending review for abuse.

343. taikahessu ◴[03 Feb 25 14:51 UTC] No.42918721{5}[source]▶

>>42918546 #

Not sure if this was posted as humour, but I don't feel that way. In today's world, where I certainly would consider taking the blue pill, I'm having a blast with LLMs!

It has helped me learn stuff incredibly faster. Especially I find them useful for filling the gaps of knowledge and exploring new topics in my own way and language, without needing to wait an answer from a human (that could also be wrong).

Why does it feel, that "we are entirely inside the bubble" for you?

replies(3): >>42919116 #>>42919678 #>>42920242 #

344. Alifatisk ◴[03 Feb 25 14:53 UTC] No.42918734[source]▶

>>42913251 (OP) #

When I saw new to llms, I used Bing ai in a fun way. So when I was writing my report, it was sometimes hard to find discussions or material about a certain topic.

What I did was to ask Bing ai about that topic and it returned information aswell as sources to where it found those, so I picked up all those links and researched them myself.

Bing ai was a great resource for finding relevant links, this was until I found out about perplexity, my life haven't been the same since.

345. orangepanda ◴[03 Feb 25 14:54 UTC] No.42918746{5}[source]▶

>>42918546 #

You overestimate the importance of being correct

346. opdahl ◴[03 Feb 25 15:03 UTC] No.42918843{3}[source]▶

>>42913642 #

Hi tmnvdb, since you seem to love these super smart LLMs I thought it would be fun to have openais o3-mini-high analyze your recent comments in contrast to the Hacker News Comment Guidelines. Here is the output it gave me, hope it helps you:

------

Hey, I've noticed a few things in your style that are both strengths and opportunities for improvement:

Strengths:

- You clearly have deep knowledge and back up your points with solid data and examples.

- Your confidence and detailed analysis make your arguments compelling.

Opportunities:

- At times, your tone can feel a bit combative, which might shut down conversation.

- Focusing on critiquing ideas rather than questioning someone's honesty can help keep the discussion constructive.

- A clearer structure in longer posts could make your points even more accessible.

Overall, your passion and expertise shine through—tweaking the tone a bit might help foster even more productive debates.

------

Just reply here if you want the full 500+ words analysis that goes into more detail.

347. vessenes ◴[03 Feb 25 15:06 UTC] No.42918880[source]▶

>>42916899 #

Interesting!

I wonder if it’s carried over too much of that ‘helpful’ DNA from 4o’s RLHF. In that case, maybe asking for 500 words was the difficult part — it just didn’t have enough to say based on one SO post and one article, but the overall directives assume there is, and so the model is put into a place where it must publish..

Put another way, it seems this model faithfully replicates the incentives most academics have — publish a positive result, or get dinged. :)

Did it pick up your HN comments? Kadua claims that’s more than enough to roast me, … and it’s not wrong. It seems like there’s enough detail about you (or me) there to do a better job summarizing.

replies(1): >>42919363 #

348. throwaway123lol ◴[03 Feb 25 15:10 UTC] No.42918904[source]▶

>>42913251 (OP) #

This is so lame. This feels like another desperate attempt to stay relevant cobbled together after the DeepSeek announcement last week. What was the other attempt they made? Skip a version number to seem like more progress was made (o1->o3)? From what I can tell "o3" is just the same as o1 with an extra reasoning-effort parameter.

Oh and "Deep research" is available to people on the $200 per month plan? Lol - cool. I've been using DeepSeek a lot more recently and it's so incredibly good even with all the scaling issues.

349. aiono ◴[03 Feb 25 15:12 UTC] No.42918939{4}[source]▶

>>42918517 #

No, from the research around it the findings are mixed. There is no consensus that it's net gain.

350. throwaway123lol ◴[03 Feb 25 15:12 UTC] No.42918940[source]▶

>>42913462 #

Yeah it can be more iterative. Just use individual queries and build on it yourself. This is all this is doing. It's a trick, and OpenAI is a PR hype company at this stage.

351. pama ◴[03 Feb 25 15:13 UTC] No.42918951{6}[source]▶

>>42914454 #

This is an example of the submitted questions. Because it is possible to search it on the web, it is not an example of the accepted questions.

352. ◴[03 Feb 25 15:14 UTC] No.42918956{5}[source]▶

>>42918546 #

353. brushfoot ◴[03 Feb 25 15:15 UTC] No.42918972[source]▶

>>42916899 #

I disagree that this is a useful springboard. And I say that as an AI optimist.

A report full of factual errors that a careful intern wouldn't make is worse than useless (yes, yes, I've mentored interns).

If the hard part is the language, then do the research yourself, write an outline, and have the LLM turn it into complete sentences. That would at least be faster.

Here's the thing, though: If you do that, you're effectively proving that prose style is the low-value part of the work, and may be unnecessary. Which, as much as it pains to me say as a former English major, is largely true.

354. resters ◴[03 Feb 25 15:15 UTC] No.42918973{3}[source]▶

>>42914908 #

I have the $200/month version. Deep Research arrived this morning.

So far I tried it on one problem and it seems limited by the "front end" being 4o-mini. It ignored most of my initial prompt and also ignored the previous research it asked for which I provided. The final output was high quality and definitely was enriched by the web searching it did, but it left out a crucially important dimension of the problem because it was unable to ingest the background info I provided adequately.

I'd like to see a version of it where the front end model is o1-pro

355. throwing_away ◴[03 Feb 25 15:16 UTC] No.42918981{3}[source]▶

>>42918487 #

> We'd never hire someone who just makes stuff up (or at least keep them employed for long).

This is contrary to my experience.

replies(1): >>42921109 #

356. feznyng ◴[03 Feb 25 15:19 UTC] No.42919020[source]▶

>>42913531 #

You can buy residential proxies to pretend you're a regular person IIRC, some of the browser automation companies do that to bypass rate limiting, captchas, etc.

357. dingnuts ◴[03 Feb 25 15:26 UTC] No.42919116{6}[source]▶

>>42918721 #

>It has helped me learn stuff incredibly faster. Especially I find them useful for filling the gaps of knowledge and exploring new topics in my own way and language

and then you verify every single fact it tells you via traditional methods by confirming them in human-written documents, right?

Otherwise, how do you use the LLM for learning? If you don't know the answer to what you're asking, you can't tell if it's lying. It also can't tell if it's lying, so you can't ask it.

If you have to look up every fact it outputs after it does, using traditional methods, why not skip to just looking things up the old fashioned way and save time?

Occasionally an LLM helps me surface unknown keywords that make traditional searches easier, but they can't teach anything because they don't know anything. They can imagine things you might be able to learn from a real authority, but that's it. That can be useful! But it's not useful for learning alone.

And if you're not verifying literally everything an LLM tells you.. are you sure you're learning anything real?

replies(3): >>42919415 #>>42919912 #>>42924402 #

358. timabdulla ◴[03 Feb 25 15:48 UTC] No.42919363{3}[source]▶

>>42918880 #

I didn't actually give it a goal of writing any particular length, but I do think that perhaps given my not-so-large online footprint, it may have felt "pressured" to generate content that simply isn't there.

It didn't pick up my HN comments, probably because my first and last name are not in my profile, though obviously that is my handle in a smooshed-together form.

359. kees99 ◴[03 Feb 25 15:51 UTC] No.42919393{4}[source]▶

>>42918517 #

"Occasional nonsense" doesn't sound great, but would be tolerable.

Problem is - LLMs pull answers from their behind, just like a lazy student on the exam. "Halucinations" is the word people use to describe this.

Those are extremely hard to spot - unless you happen to know the right answer already, at which point - why ask? And those are everywhere.

One example - recently there was quite a discussion about llm being able to understand (and answer) base16 (aka "hex") encoding on the fly, so I went on to try base64, gzipped base64, zstd-compressed base64, etc...

To my surprise, LLM got most of those encoding/compressions right, decoded/uncompressed the question, and answered it flawlessly.

But with few encodings, LLM detected base64 correctly, got compression algorithm correctly, and then... instead of decompressing, made up a completely different payload, and proceeded to answer that. Without any hint of anything sinister going.

We really need LLMs to reliably calculate and express confidence. Otherwise they will remain mere toys.

replies(1): >>42919606 #

360. kardos ◴[03 Feb 25 15:53 UTC] No.42919415{7}[source]▶

>>42919116 #

The Gell-Mann amnesia effect applies to LLMs as well!

https://en.m.wikipedia.org/wiki/Gell-Mann_amnesia_effect

361. ctoth ◴[03 Feb 25 15:54 UTC] No.42919423{4}[source]▶

>>42915840 #

Maybe this would make sense if you saw the whole world as "kids" that you had to protect. As an adult who lives in an adult world, I would like adults to have access to metal tools and not just foam ones.

replies(1): >>42919675 #

362. z7 ◴[03 Feb 25 15:54 UTC] No.42919428[source]▶

>>42913251 (OP) #

Business and technical analysis of DeepSeek's entire R&D history with extrapolations:

https://chatgpt.com/share/67a0d59b-d020-8001-bb88-dc9869d52b...

363. cdblades ◴[03 Feb 25 16:06 UTC] No.42919588{3}[source]▶

>>42918487 #

> Why are we okay with calling "AI" tools like this anything other than curious research projects?

Because they are a way to launder liability while reducing costs to produce a service.

Look at the AI-based startups y-combinator has been funding. They match that description.

364. oldstrangers ◴[03 Feb 25 16:07 UTC] No.42919606{5}[source]▶

>>42919393 #

Yeah, what you said represents a 'net gain' over not having any of that at all.

365. throwaway519 ◴[03 Feb 25 16:08 UTC] No.42919615{4}[source]▶

>>42918554 #

throwaway/anonymous.

not just for when discussion of the content not the personality behind it is important.

366. dlivingston ◴[03 Feb 25 16:08 UTC] No.42919630{3}[source]▶

>>42918413 #

I put my profile in [0] and it's mostly silly; a few comments extracted and turned into jokes. No deep insights into me, and my "Top 3 Technologies" are hilariously wrong (I've never written a single line of TypeScript!)

[0]: https://hn-wrapped.kadoa.com/dlivingston

367. n4r9 ◴[03 Feb 25 16:11 UTC] No.42919675{5}[source]▶

>>42919423 #

I guess I can replace "kid" with "toddler" and add "unsupervised" at the end.

368. squigz ◴[03 Feb 25 16:11 UTC] No.42919678{6}[source]▶

>>42918721 #

> without needing to wait an answer from a human (that could also be wrong).

The difference is you have some reassurances that the human is not wrong - their expertise and experience.

The problem with LLMs, as demonstrated by the top-level comment here, is that they constantly make stuff up. While you may think you're learning things quickly, how do you know you're learning them "correctly", for lack of a better word?

Until an LLM can say "I don't know", I really don't think people should be relying on them as a first-class method of learning.

369. squigz ◴[03 Feb 25 16:14 UTC] No.42919707[source]▶

>>42916136 #

> They can only sound like they do.

More importantly, I think, is that they are incapable of not doing so. Have we figured out how to make an LLM realize and answer that it doesn't know an answer?

370. deeviant ◴[03 Feb 25 16:24 UTC] No.42919860{3}[source]▶

>>42918487 #

Yeah, I used to hire people, but then one of them made a mistake, now I'm done with them forever, they are useless. It is not I, who is directing the workers, who cannot create a process that is resistant to errors, it's definitely the fact that all people are worthless until they make no errors as there truly is no other way of doing things other than telling your intern to do a task then having them send it directly to the production line.

371. taikahessu ◴[03 Feb 25 16:28 UTC] No.42919912{7}[source]▶

>>42919116 #

I guess it all depends on the topic and levels of trust. How can I be certain that I have a brain? I just have to take something for granted, don't I? Of course I will "verify" the "important stuff", but what is important? How can I tell? Most of the time only thing I need is a pointer in the right direction. Wrong advice? I know when I get there I suppose.

I can remember numerous things I was told while growing up, that aren't actually true. Either by plain lies and rumours or because of the long list of our cognitive biases.

> If you have to look up every fact it outputs after it does, using traditional methods, why not skip to just looking things up the old fashioned way and save time?

What is the old fashioned way? I mean people learn "truths" these days from Tiktok and Youtube. Some of the stuff is actually very good, you just have to distill it based on the stuff I was being taught at school. Nonody has yet declared LLMs as a subtitute for schools, maybe they soon will, but neither "guarantees" us anything. We could as well be taught political agendas.

I could order a book about construction, but I wouldn't build a house without asking a "verified" expert. Some people build anyway and we get some catastrofic results.

Levels of trust, it's all games and play until it gets serious, like what to eat or doing something that involves life threatening physics. I take it as playing with a toy. Surely something great have come up from only a few piece of legos?

> And if you're not verifying literally everything an LLM tells you.. are you sure you're learning anything real?

I guess you shouldn't do it that way. But really, so far the topics I've rigorously explored with ChatGPT for example, have been better than your average journalism. What is real?

replies(1): >>42920442 #

372. senordevnyc ◴[03 Feb 25 16:35 UTC] No.42919989{6}[source]▶

>>42916851 #

Haha, are you referring to the Chromebook Pixel? How is that relevant to stopping spam calls?

Pixel phone launched in 2016.

373. rvnx ◴[03 Feb 25 16:44 UTC] No.42920086{6}[source]▶

>>42914669 #

It's literally a browsing agent that searches the internet and they know the questions in advance when preparing the agent

Without internet: 10%

With internet: 23%

In addition:

> We found that the ground-truth answers for one dataset were widely leaked online

in very small letters, and they blocked these URLs at runtime but not training time.

It's not bad, but not revolutionary at all compared to the leap that was GPT-2 from GPT-3, or GPT-4o to DeepSeek-R1

replies(1): >>42920352 #

374. ramon156 ◴[03 Feb 25 16:44 UTC] No.42920087{3}[source]▶

>>42918487 #

3k a month vs ~500 dollars a month. That's all u need to know. Not saying its as good, but its all some managers care about

375. DoctorOetker ◴[03 Feb 25 16:47 UTC] No.42920124[source]▶

>>42913251 (OP) #

Would formalizing Wiles' proof of Fermat's Last Theorem be considered deep research? Is it able to formalize it in say metamath's set.mm?

Or is the position of OpenAI that Wiles' proof is incomplete?

replies(1): >>42925214 #

376. HarHarVeryFunny ◴[03 Feb 25 16:56 UTC] No.42920231[source]▶

>>42916136 #

> language models aren’t large enough to actually know everything

I'd say they don't know anything.

An LLM base model, before it is post-trained with RL, just has access to a sliced and diced corpus of human output. Take the contents of 4chan and WikiPedia, put in blender and mix and chop into "training sample" sized bites, then learn the statistical regularities of this blended mess. It is what it is - not exactly what I'd call a knowledge base, even though there are bits of knowledge in there.

When you add RL-based post-training for reasoning, all you are doing is trying get the model to be more selective when you are sampling from it - encouraging it to suppress some statistics, and emphazise others, such that when you sample from it the output looks more like valid reasoning steps and/or conclusions, per the verified reasoning examples you train it on.

I'm well aware of how useful RL-tuned models (whatever the goal) can be, but at the end of the day all they are doing is taking a statistical babbler and saying "try to output patterns more like this". It's not exactly a recipe for factuality or rationality - we've just gone from hallucination-prone base models, to gaslighting-prone RL-tuned "reasoning" models that output stuff that sounds like coherent reasoning.

What missing from all of this - what makes it different from how animals learn - it that the model has no experience of it's own, no autonomy or motivation to explore, learn and verify, and hence no episodic memories of how it learnt something (tried it and ran controlled experiments, or just overheard it on the bus), and what that implies about it's trustworthiness.

It's amazing that LLMs work as well as they do, a reflection of how much of what we do can be accomplished by reactive pattern matching, but if you want to go beyond that to something that can learn and discern the truth for itself, this seems the wrong paradigm altogether.

replies(1): >>42953962 #

377. toasteros ◴[03 Feb 25 16:56 UTC] No.42920242{6}[source]▶

>>42918721 #

Are you sure it's helped you learn?

In the early days of ChatGPT where it seemed like this fun new thing, I used it to "learn" C. I don't remember anything it told me, and none of the answers it gave me were anything that I couldn't find elsewhere in different forms - heck I could have flipped open Kernighan & Ritchie to the right page and got the answer.

I had a conversation with an AI/Bitcoin enthusiast recently. Maybe that already tells you everything you need to know about this person, but to the hammer the point home, they made a claim to similar to you: "I learn much more and much better with AI". They also said they "fact check" things it "tells" them. Some moments later they told me "Bitcoin has its roots in Occupy Wall Street".

A simple web search tells you that Bitcoin is conceived a full 2 years before Occupy. How can they be related?

It's a simple error that can be fact checked simply. It's a pretty innocuous falsity in this particular case - but how many more falsehoods have they collected? How do those falsehoods influence them on a day-by-day basis?

How many falsehoods influence you?

A very well meaning activist posted a "comprehensive" list of all the programs that were to be halted by the grants and loans freezes last week. Some of the entries on the list weren't real, or not related to the freeze. They revealed they used ChatGPT to help compile the list and then went down one-by-one to verify each one.

With such meticulous attention to detail, incorrect information still filtered through.

Are you sure you are learning?

replies(2): >>42920446 #>>42921751 #

378. yencabulator ◴[03 Feb 25 17:02 UTC] No.42920312{4}[source]▶

>>42913838 #

> Amazon listings are blocked from google shopping

I see Amazon results there all the time. 3 of the visible 8 sponsored results are Amazon, in the non-sponsored results an Amazon listing is either first or second in every category.

379. CamperBob2 ◴[03 Feb 25 17:06 UTC] No.42920352{7}[source]▶

>>42920086 #

If they "knew the questions in advance," why'd they need Internet access at all? The ability to use the same data sources humans would use is not the insult you seem to think it is.

Again: the assertion was yours, so let us know the results of your own work.

380. dingnuts ◴[03 Feb 25 17:15 UTC] No.42920442{8}[source]▶

>>42919912 #

> What is the old fashioned way?

Looking in a resource written by someone with sufficient ethos that they can be considered trustworthy .

> What is real?

I'm not arguing ontology about systems that can't do arithmetic. you're not arguing in good faith at all

381. panarky ◴[03 Feb 25 17:15 UTC] No.42920446{7}[source]▶

>>42920242 #

When your bitcoiner friend told you something that's not true, that's a human who hallucinated, not an LLM.

Maybe we're already at AGI and just don't know it because we overestimate the capabilities of most humans.

replies(1): >>42920936 #

382. blackeyeblitzar ◴[03 Feb 25 17:16 UTC] No.42920459[source]▶

>>42913379 #

I don’t think you’re comparing the right things here. This feature is more like Google’s Deep Research, which basically goes off and does a whole lot of search and compute to produce something more like a full research report. This has nothing to do with open weight models like DeepSeek (note: DeepSeek, Llama, etc are NOT open source). This feature doesn’t just require the research on the model but also enormous compute. Plus anyone using such a feature for real work is not going to be using DeepSeek or whatever, but a product with trustworthy practices and guarantees.

replies(1): >>42939033 #

383. majormajor ◴[03 Feb 25 17:18 UTC] No.42920475{4}[source]▶

>>42918517 #

I think as these things get more integrated into customer service workflows - especially for things like insurance claims - there's gonna start being a lot more buyer's remorse on everyone's part.

We've tried for decades to turn people into reliable robots, now many companies are running to replace people robots with (maybe less reliable?) robot-robots. What could go wrong? What are the escalation paths going to be? Who's going to be watching them?

384. hawaiianbrah ◴[03 Feb 25 17:24 UTC] No.42920542{4}[source]▶

>>42918517 #

A net gain for everyone? Tell that to the artists its screwing over!

385. bilater ◴[03 Feb 25 17:38 UTC] No.42920694[source]▶

>>42913251 (OP) #

Not quite the agent they are building but I have an open source alternative that lets you use a variety of models, based on links of your choice to generate reports: https://github.com/btahir/open-deep-research

386. mlsu ◴[03 Feb 25 17:41 UTC] No.42920726{3}[source]▶

>>42913709 #

If 20% of people don't care about being correct, the rest of everyone can deal with that. If 80% of people don't care about being correct, the rest of us will not be able to deal with that.

Same thing as misinformation. A sufficient quantitative difference becomes a qualitative difference at some point.

387. randcraw ◴[03 Feb 25 17:41 UTC] No.42920728{3}[source]▶

>>42913713 #

I don't think so. Instead of SEO, I think we'll soon see 'LLMO' dominating such uses, where LLM summaries are reshaped by vendors and etailers to misrepresent facts in ways that favor them over others.

I suspect this can be done simply by poisoning a query with supplemental suggestions of sources to use in a RAG, many of which don't even have to be publicly available but are made accessible to the LLM (perhaps by submitting hidden URLs that mislead the summary along with the query).

But even after such a practice is uncovered and roundly maligned, that won't stop the infinite supply of net con men from continuing to inject their poisons into the background that drives deep research, so long as the LLM maker doesn't actively oppose this practice actively and publicly -- which none of them have been willing to do with any other LLM operational details so far.

In fact, I predict that if a LLM summary like DR's does NOT soon provide references to the sources of the facts it relies on, in no time users will disregard such summaries to be yet more uselessly unreliable pfaff from yet another net disreputable -- as we do with search engine summaries now.

388. kenjackson ◴[03 Feb 25 17:43 UTC] No.42920744{3}[source]▶

>>42918487 #

You can use them for whatever you like, or not use them. Everyone has a different bar for when technology is useful. My dad doesn't think EVs are useful due to the long charge times, but there are others who find it fully acceptable.

389. TheGradfather ◴[03 Feb 25 17:47 UTC] No.42920806[source]▶

>>42913251 (OP) #

The OpenAI Deep Research graph showing tool calls vs pass rate reveals something fascinating about how these models handle increasing amounts of information. The relationship follows a logistic curve that plateaus around 16% pass rate, even as we allow more tool calls.

This plateau behavior reflects something deeper about our current approach to AI. We've built transformer architectures partly inspired by simplified observations of human cognition - particularly how our brains use attention mechanisms to filter and process information. And like human attention, these models have inherent constraints: each attention layer normalizes scores to sum to 1, creating a fixed "attention budget" that must be distributed across all inputs.

A recent paper (https://arxiv.org/abs/2501.19399) explores this limitation, showing how standard attention becomes increasingly diffuse with longer contexts. Their proposed "Scalable-Softmax" helps maintain focused attention at longer ranges, but still shows diminishing returns - pushing the ceiling higher rather than eliminating it.

But here's the deeper question: As we push toward AGI and potentially superintelligent systems, should we remain bound by architectures modeled on our current understanding of human cognition? The human brain's limited attention mechanism evolved under specific constraints and for specific purposes. While it's remarkably effective for human-level intelligence, it might be fundamentally limiting for artificial systems that could theoretically process information in radically different ways.

Looking at the Deep Research results through this lens, the plateau might not just be a technical limitation to overcome, but a sign that we need to fundamentally rethink how artificial systems could process and integrate information. Instead of trying to stretch the capabilities of attention-based architectures, perhaps we need to explore entirely different paradigms of information processing that aren't constrained by biological analogues.

This isn't to dismiss the remarkable achievements of transformer architectures, but rather to suggest that the path to AGI might require breaking free from some of our biologically-inspired assumptions. What would an architecture that processes information in ways fundamentally different from human cognition look like? How might it integrate and reason about information without the constraints of normalized attention?

Would love to hear thoughts from others working on these problems, particularly around novel approaches that move beyond our current biological inspirations.

390. ComputerGuru ◴[03 Feb 25 17:57 UTC] No.42920892{3}[source]▶

>>42918413 #

That.. seems to just take a few (three or four) random comments that received some attention and then extrapolate an entire profile based on (incorrectly) interpreting their contents?

https://hn-wrapped.kadoa.com/ComputerGuru

391. toasteros ◴[03 Feb 25 18:00 UTC] No.42920936{8}[source]▶

>>42920446 #

The assertion is that they "learned" that Bitcoin came from Occupy from an AI.

If AI is teaching you, you are going to collect a thousand papercuts of lies.

392. dumbfounder ◴[03 Feb 25 18:04 UTC] No.42920980{3}[source]▶

>>42918487 #

Why not just verify the output? It’s faster than generating the entire thing yourself. Why do you need perfection in a productivity tool?

replies(1): >>42921792 #

393. elashri ◴[03 Feb 25 18:04 UTC] No.42920981{4}[source]▶

>>42918479 #

Thank you very much for doing that. It is actually somehow impressive. It got a lot of big picture comparison and points correct. There are problem with some details but overall it does save some work for initial search process.

What I like is that it asked you before clarifying questions before but I wonder if it just generic. Because the prompt mentioned that this would be for "presentation at a topical meeting of particle physicists" but still asked its last question about

> Intended Audience: Should the analysis assume a general physics audience or a more specialized group of particle physicists?

Also probably expected but it didn't include or reference graphs/plots.

394. machiaweliczny ◴[03 Feb 25 18:07 UTC] No.42921017[source]▶

>>42916899 #

This is very bearish for current AI. Seems like 99% reliability is still too small with compounding errors. But I wonder of this is inherently specific to longer context or if this just depends on how it’s trained. In theory longer context => more errors

Although I think people are the same, too big problem and you are getting lost unless taking it in bites, so seems like OpenAI implementation is just bad because o3 hallucination benchmark shouldn’t lead to such poor performance

395. kadushka ◴[03 Feb 25 18:16 UTC] No.42921109{4}[source]▶

>>42918981 #

Our president begs to differ! Or pretty much any elected official for that matter.

replies(1): >>42928039 #

396. YmiYugy ◴[03 Feb 25 18:38 UTC] No.42921302{3}[source]▶

>>42913867 #

Do we actually know whether it got this specific example right? It got 20% on HLE, but I think a few questions are quite a bit easier.

397. roflyear ◴[03 Feb 25 19:11 UTC] No.42921656{3}[source]▶

>>42918487 #

> We'd never hire someone who just makes stuff up

We do all the time - of course we do, all the time.

398. Bjorkbat ◴[03 Feb 25 19:18 UTC] No.42921709[source]▶

>>42916899 #

So, I still think this is a cool tool for search reasons, but otherwise the tendency to hallucinate makes it questionable as a researcher.

Hypothetically speaking, if the time you saved is now spent verifying the statements of your AI researcher, then did you really save any time at all?

If the answers aren't important enough to verify, then was it ever even important enough to actually research to begin with?

399. taikahessu ◴[03 Feb 25 19:22 UTC] No.42921751{7}[source]▶

>>42920242 #

I guess the real learning happens outside the AI, here in real life. Does the code run? Sure, it's on my local and not in production, but I would've never have the patience to get "that new thing working" without AI as assistant.

Does the food taste good? Oops, there's a bit too much vegetables here, they are never gonna fit in this pan of mine. Not a big deal, next time I'll be wiser.

AI is like a hypothesis machine. You're gonna have to figure out if the output is true. Few years ago, just testing any machine's "intelligence" was pretty quickly done and machine failed miserably. Now, the accuracy is astounishing in comparison.

> How many falsehoods influence you?

That is a great question. The answer is definitely not zero. I try to live by with a hacker mentality and I'm an engineer by trade. I read news and comments, which I'm not sure is good for me. But you also need some compassion towards oneself. It's not like ripping everything open will lead to salvation. I believe the truth does set you free, eventually. But all in one's time...

Anyway, AI is a tool like any other. Someone will hammer their fingers with it. I just don't understand the hate. It's not like we're drinking any AI koolaids here. It's just like it was 30 years ago (in my personal journey), you had a keyboard and a machine, you asked it things and got gibberish. Now the conversation with it just started to get interesting. Peace.

400. toasteros ◴[03 Feb 25 19:25 UTC] No.42921792{4}[source]▶

>>42920980 #

At that point why not just... I dunno, do the research yourself?

replies(1): >>42922343 #

401. dncbfwa ◴[03 Feb 25 19:47 UTC] No.42922036{3}[source]▶

>>42918214 #

lol

402. nomel ◴[03 Feb 25 20:02 UTC] No.42922203{3}[source]▶

>>42918487 #

LLM are "great" in some use cases, "ok" in others, and "laughable" in more.

Some people might find $500 worth of value, in their specific use case, in those "great" and "ok" categories, where they get more value than "lies" out of it.

A few verifiable lies, vs hours of time, could be worth it for some people, with use cases outside of your perspective.

403. tuckerman ◴[03 Feb 25 20:12 UTC] No.42922343{5}[source]▶

>>42921792 #

Perhaps because the time to proofread/correct is less than to do it from scratch? That would still make it a valuable tool

replies(2): >>42922856 #>>42923847 #

404. enknamel ◴[03 Feb 25 20:13 UTC] No.42922351[source]▶

>>42913251 (OP) #

So they did a RAG on the whole internet? Basically Google search results summary but better?

405. Yoric ◴[03 Feb 25 20:46 UTC] No.42922856{6}[source]▶

>>42922343 #

But is it?

406. rybosome ◴[03 Feb 25 20:55 UTC] No.42922968{3}[source]▶

>>42918487 #

This doesn’t make LLMs worthless, you just need to structure your processes around fallibility. Much like a well designed release pipeline is built with the expectation that devs will write bugs that shouldn’t ship.

407. toasteros ◴[03 Feb 25 21:59 UTC] No.42923847{6}[source]▶

>>42922343 #

How?

It's given you some information and now you have to seek out a source to verify that it's correct.

Finding information is hard work. It's why librarian is a valuable skilled profession. What you've done by suggesting that I should "verify" or "proofread" what a glorified, water-wasting Markov chain has given me now entails me looking up that information to verify that it's correct. That's...not quite doubling the work involved but it's adding an unnecessary step.

I could have searched for the source in the first instance. I could have gone to the library and asked for help.

We spent time coming up with a question ("prompt engineering"! hah!), we used up a bunch of electricity for an answer to be generated and now you...want me to search up that answer to find the source? Why did we do the first step?

People got undergraduate degrees - hell, even PhDs - before generative AI.

Look up the tweet from someone who said "Sometimes when coming up with a good prompt for ChatGPT, I sometimes come up with the answer myself without needing to submit".

replies(2): >>42925621 #>>42927084 #

408. therealpygon ◴[03 Feb 25 22:13 UTC] No.42924019{5}[source]▶

>>42914730 #

Explain the trade secrets contained in non-copyrightable AI outputs and the reasonable efforts OpenAI takes to keep its AI output “secret”. Or are you confused about what a “trade secret” actually is?

409. dialup_sounds ◴[03 Feb 25 22:45 UTC] No.42924402{7}[source]▶

>>42919116 #

Saying you need to verify "literally everything" both overestimates the frequency of hallucinations and underestimates the amount of wrong found in human-written sources. e.g. the infamous case of Google's AI recommending Elmer's glue on pizza was literally a human-written suggestion first: https://www.reddit.com/r/Pizza/comments/1a19s0/my_cheese_sli...

410. teleforce ◴[03 Feb 25 23:39 UTC] No.42924979[source]▶

>>42913251 (OP) #

What a coincident of releasing deep research for your product when one of your main competitors has DeepSeek R1 as their best performant version /s

Seriously, for the past 20+ years it's hard to imagine doing research without Google platform namely Google Search, Scholar, Patent and Book, but now it seems agent AI based on LLM is the way to. In twenty years in the future it will be hard to imagine that doing research without them. But as many people already pointed out Google probably the best company by far to perform this emerging AI based research. In data eco-system terms (refer to any book on data engineering), Google has already perform has the most important data preparation and data engineering upstream activities including data ingestion and transformation. Now given their vast amount of processed data they can just serve it to downstream data analytics or AI for performing research with minimum error/hallucinations as possible. According to Google there is no moat for any companies against open source LLM, but if any company that can has the moat it will be Google itself.

411. DoctorOetker ◴[04 Feb 25 00:00 UTC] No.42925214[source]▶

>>42920124 #

ah yes, when you get downvoted for asking Questions, not even Claims.

412. gqgs ◴[04 Feb 25 00:20 UTC] No.42925444[source]▶

>>42913251 (OP) #

If I'm understanding this correctly it sounds functionally similar to the report generating project Standard released a few weeks ago [1].

[1] https://storm.genie.stanford.edu/

413. murloc_oracle ◴[04 Feb 25 00:34 UTC] No.42925621{7}[source]▶

>>42923847 #

Verifying information is an order of magnitude easier than compiling it or synthesizing it in the first place. Prompt engineering is an order of magnitude easier still. This is obvious to most people, but apparently it needs to be said.

An entire day of generating responses with ChatGPT uses less water and energy than your morning shower. You seem terribly concerned about signaling the virtues of abstaining from technology use on behalf of purported resource misuse, yet you're sitting at a computer typing away.

You're not a serious person, and you're wasting everyone's time. Please leave the internet and go play with rocks in a cave.

replies(1): >>42932253 #

414. tuckerman ◴[04 Feb 25 02:47 UTC] No.42927084{7}[source]▶

>>42923847 #

Sometimes you don't need sources to verify something is correct, its something you can directly verify. To reduce it to the easiest version of this, I ask for code to do something, it writes me code, I run my unit test, it passes, my time is saved!

For other things, it depends, but if I'm asking it to do a survey I can look at its results and see if they fit what I'm looking for, check the sources it gives me, etc. People pay analysts/paralegals/assistants to do exactly this kind of work all the time expecting that they will need to check it over. I don't see how this is any different.

I don't think the library/electricity responses are serious but to move on to the point about degrees... people also got those degrees before calculators, before computers, before air travel, before video calls, before the internet, before electricity, yet all of those things assist in creating knowledge. I think its perfectly reasonable to look at these LLMs/chat assistants in the same light: as a tool that can augment human productivity in its own way.

replies(1): >>42932288 #

415. Gud ◴[04 Feb 25 04:39 UTC] No.42928039{5}[source]▶

>>42921109 #

Not everything is politics. Already your president gets too much media spotlight.

416. nycdatasci ◴[04 Feb 25 05:00 UTC] No.42928206{3}[source]▶

>>42914928 #

I thought it was an appropriate response in multiple contexts. If I’m wrong, please provide a rebuttal or counterpoint.

417. nycdatasci ◴[04 Feb 25 05:10 UTC] No.42928275{3}[source]▶

>>42914974 #

Appreciate the thoughtful engagement. I work for a large, US-based investment firm. No relationship w/ High Flyer.

This isn’t a single model. Almost the entire leadership team around sama has left and almost certainly agrees with me on this. OpenAI’s business model is not sustainable.

418. littlestymaar ◴[04 Feb 25 05:22 UTC] No.42928371[source]▶

>>42913251 (OP) #

There's an open source version of this that you can use with local LLMs: https://www.reddit.com/r/LocalLLaMA/comments/1gvlzug/i_creat...

In fact, it's been three month since the release and I wouldn't even be surprised if OpenAI took inspiration from it.

419. sponnath ◴[04 Feb 25 06:11 UTC] No.42928676{3}[source]▶

>>42914797 #

One of the biggest issues with these things is reliability. o3 likely increases that quite a bit. The idea itself isn't novel but I don't see why this wouldn't be useful?

420. martin293 ◴[04 Feb 25 06:57 UTC] No.42929005{3}[source]▶

>>42913529 #

Such an informative, well meaning comment

421. frontalier ◴[04 Feb 25 10:09 UTC] No.42930490{3}[source]▶

>>42913776 #

is this you?

https://news.ycombinator.com/item?id=42913544

replies(1): >>42934302 #

422. steveridout ◴[04 Feb 25 10:44 UTC] No.42930745[source]▶

>>42914043 #

Just use a VPN to pretend you're in the US and it works.

423. martin82 ◴[04 Feb 25 12:04 UTC] No.42931299[source]▶

>>42913251 (OP) #

They reduce one expensive but mostly useless gimmick after the other.

No one if ever going to trust the output this generates and then needs to spend so much time fact checking that they might as well do the entire research from scratch themselves...

424. toasteros ◴[04 Feb 25 13:46 UTC] No.42932253{8}[source]▶

>>42925621 #

You made a new account just to post this; I'm flattered! Perhaps your normal account is tied to your professional identity?

Do take care.

425. toasteros ◴[04 Feb 25 13:49 UTC] No.42932288{8}[source]▶

>>42927084 #

I'm interested to hear more about how you can verify information without a source. What are you looking at when you search for the verification, exactly?

replies(1): >>42938578 #

426. tmnvdb ◴[04 Feb 25 16:14 UTC] No.42934302{4}[source]▶

>>42930490 #

Gleefully celebrating the defeat of my enemies is not complaining.

427. ejang0 ◴[04 Feb 25 18:18 UTC] No.42936422[source]▶

>>42914043 #

https://help.openai.com/en/articles/10500283-deep-research-f... "Pro users (located outside the UK, Switzerland, and the European Economic Area) have access to 100 deep research tasks per month." So to me that must include Canada, right? I signed up for Pro. There it is. (For future reference, for someone else.)

428. michael_swift ◴[04 Feb 25 18:39 UTC] No.42936702{3}[source]▶

>>42913709 #

don't you think the problem of checking for correctness then becomes more insidious then? we now can generate hundreds of reports that look very professional on the surface. the usual things that would tip you off that this person was careless aren't there -- typos, poor sentence construction, missing references. just more noise to pick through for signal

429. hmottestad ◴[04 Feb 25 19:50 UTC] No.42937640{3}[source]▶

>>42918497 #

OpenAI Deep Research isn’t available in Norway at least, or the rest of Europe basically :(

replies(1): >>42962014 #

430. tuckerman ◴[04 Feb 25 20:59 UTC] No.42938578{9}[source]▶

>>42932288 #

Some code or maths proofs can be self supporting with things like unit tests or proof checkers as an example

431. rvz ◴[04 Feb 25 21:28 UTC] No.42939033{3}[source]▶

>>42920459 #

> This feature is more like Google’s Deep Research, which basically goes off and does a whole lot of search and compute to produce something more like a full research report.

Of course. It is in response to their disastrous operator demo which did not justify the $200 per month ChatGPT Pro subscription on top of the release of DeepSeek to make matters worse for them.

> This has nothing to do with open weight models like DeepSeek (note: DeepSeek, Llama, etc are NOT open source).

It obviously does. Even before they rushed this presentation, they made o3-mini available for ChatGPT free users so it in direct response to DeepSeek.

> This feature doesn’t just require the research on the model but also enormous compute. Plus anyone using such a feature for real work is not going to be using DeepSeek or whatever, but a product with trustworthy practices and guarantees.

Nothing that Perplexity + DeepSeek-R1 can already do.

So what is your point?

432. infecto ◴[05 Feb 25 03:06 UTC] No.42943191{6}[source]▶

>>42918392 #

You are sort of moving the goal post. The fact remains, drone delivery exists and is a solved problem. Major metro areas like Dallas, Texas have it through retailers like Walmart. Just because it does not meet your specific goal post does not mean it's a proof of concept.

433. saagarjha ◴[05 Feb 25 08:10 UTC] No.42945492{4}[source]▶

>>42916141 #

https://www.youtube.com/watch?v=Byg8VZdKK88&t=141s

434. southernplaces7 ◴[05 Feb 25 19:40 UTC] No.42953962{3}[source]▶

>>42920231 #

The amusing thing is that despite all you describe, and which many other people have described on this site and in others in great detail about the nature of LLMs, there are still many people who believe that these models essentially possess intelligence, and that it's little different from how a human mind expresses cognition.

I've seen this thinking to be especially prevalent among tech types (especially among many comments on this site), and more so than among average non-tech people I know.

It seems to be a reflection of a certain forced, almost ideological techno-reductionist thinking against the honestly complex and largely mysterious nature of consciousness. Many non-tech people on the other hand accept this mystery of consciousness and paradoxically are thus less likely to consider an LLM to be anything deeper than the clever but mindless pattern-matching trick that it is

435. hmottestad ◴[06 Feb 25 13:17 UTC] No.42962014{4}[source]▶

>>42937640 #

It seems to be available in Norway now.

↑