OpenAI, Google and Anthropic are struggling to build more advanced AI

(www.bloomberg.com)

625 points lukebennett | 4 comments | 13 Nov 24 13:28 UTC | HN request time: 0.312s | source

Show context

LASR ◴[14 Nov 24 19:19 UTC] No.42140045[source]▶

Question for the group here: do we honestly feel like we've exhausted the options for delivering value on top of the current generation of LLMs?

I lead a team exploring cutting edge LLM applications and end-user features. It's my intuition from experience that we have a LONG way to go.

GPT-4o / Claude 3.5 are the go-to models for my team. Every combination of technical investment + LLMs yields a new list of potential applications.

For example, combining a human-moderated knowledge graph with an LLM with RAG allows you to build "expert bots" that understand your business context / your codebase / your specific processes and act almost human-like similar to a coworker in your team.

If you now give it some predictive / simulation capability - eg: simulate the execution of a task or project like creating a github PR code change, and test against an expert bot above for code review, you can have LLMs create reasonable code changes, with automatic review / iteration etc.

Similarly there are many more capabilities that you can ladder on and expose into LLMs to give you increasingly productive outputs from them.

Chasing after model improvements and "GPT-5 will be PHD-level" is moot imo. When did you hire a PHD coworker and they were productive on day-0 ? You need to onboard them with human expertise, and then give them execution space / long-term memories etc to be productive.

Model vendors might struggle to build something more intelligent. But my point is that we already have so much intelligence and we don't know what to do with that. There is a LOT you can do with high-schooler level intelligence at super-human scale.

Take a naive example. 200k context windows are now available. Most people, through ChatGPT, type out maybe 1500 tokens. That's a huge amount of untapped capacity. No human is going to type out 200k of context. Hence why we need RAG, and additional forms of input (eg: simulation outcomes) to fully leverage that.

replies(43): >>42140086 #>>42140126 #>>42140135 #>>42140347 #>>42140349 #>>42140358 #>>42140383 #>>42140604 #>>42140661 #>>42140669 #>>42140679 #>>42140726 #>>42140747 #>>42140790 #>>42140827 #>>42140886 #>>42140907 #>>42140918 #>>42140936 #>>42140970 #>>42141020 #>>42141275 #>>42141399 #>>42141651 #>>42141796 #>>42142581 #>>42142765 #>>42142919 #>>42142944 #>>42143001 #>>42143008 #>>42143033 #>>42143212 #>>42143286 #>>42143483 #>>42143700 #>>42144031 #>>42144404 #>>42144433 #>>42144682 #>>42145093 #>>42145589 #>>42146002 #

afro88 ◴[14 Nov 24 20:17 UTC] No.42140726[source]▶

>>42140045 #

> potential applications > if you ... > for example ...

Yes there seems to be lots of potential. Yes we can brainstorm things that should work. Yes there is a lot of examples of incredible things in isolation. But it's a little bit like those youtube videos showing amazing basketball shots in 1 try, when in reality lots of failed attempts happened beforehand. Except our users experience the failed attempts (LLM replies that are wrong, even when backed by RAG) and it's incredibly hard to hide those from them.

Show me the things you / your team has actually built that has decent retention and metrics concretely proving efficiency improvements.

LLMs are so hit and miss from query to query that if your users don't have a sixth sense for a miss vs a hit, there may not be any efficiency improvement. It's a really hard problem with LLM based tools.

There is so much hype right now and people showing cherry picked examples.

replies(7): >>42140844 #>>42140963 #>>42141787 #>>42143330 #>>42144363 #>>42144477 #>>42148338 #

jihadjihad ◴[14 Nov 24 20:29 UTC] No.42140844[source]▶

>>42140726 #

> Except our users experience the failed attempts (LLM replies that are wrong, even when backed by RAG) and it's incredibly hard to hide those from them.

This has been my team's experience (and frustration) as well, and has led us to look at using LLMs for classifying / structuring, but not entrusting an LLM with making a decision based on things like a database schema or business logic.

I think the technology and tooling will get there, but the enormous amount of effort spent trying to get the system to "do the right thing" and the nondeterministic nature have really put us into a camp of "let's only allow the LLM to do things we know it is rock-solid at."

replies(2): >>42141270 #>>42141797 #

sdesol ◴[14 Nov 24 21:11 UTC] No.42141270[source]▶

>>42140844 #

> "let's only allow the LLM to do things we know it is rock-solid at."

Even this is insanely hard in my opinion. The one thing that you would assume LLM to excel at is spelling and grammar checking for the English language, but even the top model (GPT-4o) can be insanely stupid/unpredictable at times. Take the following example from my tool:

https://app.gitsense.com/?doc=6c9bada92&model=GPT-4o&samples...

5 models are asked if the sentence is correct and GPT-4o got it wrong all 5 times. It keeps complaining that GitHub is spelled like Github, when it isn't. Note, only 2 weeks ago, Claude 3.5 Sonnet did the same thing.

I do believe LLM is a game changer, but I'm not convinced it is designed to be public-facing. I see LLM as a power tool for domain experts, and you have to assume whatever it spits out may be wrong, and your process should allow for it.

Edit:

I should add that I'm convinced that not one single model will rule them all. I believe there will be 4 or 5 models that everybody will use and each will be used to challenge one another for accuracy and confidence.

replies(7): >>42141815 #>>42141930 #>>42142235 #>>42142767 #>>42142842 #>>42144019 #>>42145544 #

1. kristianp ◴[15 Nov 24 00:23 UTC] No.42142767[source]▶

>>42141270 #

> It keeps complaining that GitHub is spelled like Github, when it isn't

I feel like this is unfair. That's the only thing it got wrong? But we want it to pass all of our evals, even ones the perhaps a dictionary would be better at solving? Or even an LLM augmented with a dictionary.

replies(2): >>42143251 #>>42143364 #

2. MBCook ◴[15 Nov 24 01:52 UTC] No.42143251[source]▶

>>42142767 (TP) #

Does it matter?

As a user I want it to be right, even if that contradicts the normal rules of the language.

3. sdesol ◴[15 Nov 24 02:13 UTC] No.42143364[source]▶

>>42142767 (TP) #

My reason for commenting wasn't to say LLM sucks, but rather we need to get over the honeymoon phase. The fact the GPT-4o (one of the most advanced, if not the most advanced when it comes to non programming tasks) hallucinated "Github" as the input, should give us pause.

LLM has its place and it will forever change how we think about UX and other things, but we need to realize you really can't create a public facing solution without significant safe guards, if you don't want egg on your face.

replies(1): >>42145712 #

4. netdevnet ◴[15 Nov 24 10:52 UTC] No.42145712[source]▶

>>42143364 #

I believe the honeymoon face has loong been finished. Even in the mainstream, last year of the AI year. 2024 has seen nothing substantially good and the only notesworthy thing is this article finally hitting into the public consciousness that we are past of the AI peak and beyond the plateau and freefalling has already begun.

LLM investors will be reviewing their portfolios and will likely begin declining further investments without clear evidence of profits in the very near future. On the other side, LLM companies will likely try to downplay this and again promise the Moon.

And on and on the market goes

↑