Most active commenters

ppeetteerr(3)

Popular/hot comments

>>42140326 #
>>42141433 #
>>42144223 #

←back to thread

OpenAI, Google and Anthropic are struggling to build more advanced AI

(www.bloomberg.com)

Show context

LASR ◴[14 Nov 24 19:19 UTC] No.42140045[source]▶

>>42125888 (OP) #

Question for the group here: do we honestly feel like we've exhausted the options for delivering value on top of the current generation of LLMs?

I lead a team exploring cutting edge LLM applications and end-user features. It's my intuition from experience that we have a LONG way to go.

GPT-4o / Claude 3.5 are the go-to models for my team. Every combination of technical investment + LLMs yields a new list of potential applications.

For example, combining a human-moderated knowledge graph with an LLM with RAG allows you to build "expert bots" that understand your business context / your codebase / your specific processes and act almost human-like similar to a coworker in your team.

If you now give it some predictive / simulation capability - eg: simulate the execution of a task or project like creating a github PR code change, and test against an expert bot above for code review, you can have LLMs create reasonable code changes, with automatic review / iteration etc.

Similarly there are many more capabilities that you can ladder on and expose into LLMs to give you increasingly productive outputs from them.

Chasing after model improvements and "GPT-5 will be PHD-level" is moot imo. When did you hire a PHD coworker and they were productive on day-0 ? You need to onboard them with human expertise, and then give them execution space / long-term memories etc to be productive.

Model vendors might struggle to build something more intelligent. But my point is that we already have so much intelligence and we don't know what to do with that. There is a LOT you can do with high-schooler level intelligence at super-human scale.

Take a naive example. 200k context windows are now available. Most people, through ChatGPT, type out maybe 1500 tokens. That's a huge amount of untapped capacity. No human is going to type out 200k of context. Hence why we need RAG, and additional forms of input (eg: simulation outcomes) to fully leverage that.

replies(43): >>42140086 #>>42140126 #>>42140135 #>>42140347 #>>42140349 #>>42140358 #>>42140383 #>>42140604 #>>42140661 #>>42140669 #>>42140679 #>>42140726 #>>42140747 #>>42140790 #>>42140827 #>>42140886 #>>42140907 #>>42140918 #>>42140936 #>>42140970 #>>42141020 #>>42141275 #>>42141399 #>>42141651 #>>42141796 #>>42142581 #>>42142765 #>>42142919 #>>42142944 #>>42143001 #>>42143008 #>>42143033 #>>42143212 #>>42143286 #>>42143483 #>>42143700 #>>42144031 #>>42144404 #>>42144433 #>>42144682 #>>42145093 #>>42145589 #>>42146002 #

1. crystal_revenge ◴[14 Nov 24 19:27 UTC] No.42140135[source]▶

>>42140045 #

I don't think we've even started to get the most value out of current gen LLMs. For starters very few people are even looking at sampling which is a major part of the model performance.

The theory behind these models so aggressively lags the engineering that I suspect there are many major improvements to be found just by understanding a bit more about what these models are really doing and making re-designs based on that.

I highly encourage anyone seriously interested in LLMs to start spending more time in the open model space where you can really take a look inside and play around with the internals. Even if you don't have the resources for model training, I feel personally understanding sampling and other potential tweaks to the model (lots of neat work on uncertainty estimations, manipulating the initial embedding the prompts are assigned, intelligent backtracking, etc).

And from a practical side I've started to realize that many people have been holding on of building things waiting for "that next big update", but there a so many small, annoying tasks that can be easily automated.

replies(8): >>42140256 #>>42141284 #>>42141433 #>>42141459 #>>42141522 #>>42141760 #>>42142470 #>>42143106 #

2. dr_dshiv ◴[14 Nov 24 19:37 UTC] No.42140256[source]▶

>>42140135 (TP) #

> I've started to realize that many people have been holding on of building things waiting for "that next big update"

I’ve noticed this too — I’ve been calling it intellectual deflation. By analogy, why spend now when it may be cheaper in a month? Why do the work now, when it will be easier in a month?

replies(2): >>42140326 #>>42141311 #

3. vbezhenar ◴[14 Nov 24 19:43 UTC] No.42140326[source]▶

>>42140256 #

Why optimise software today, when tomorrow Intel will release CPU with 2x performance?

replies(4): >>42140532 #>>42140536 #>>42140770 #>>42144934 #

4. sdenton4 ◴[14 Nov 24 20:01 UTC] No.42140532{3}[source]▶

>>42140326 #

Curiously, Moore's law was predictable enough over decades that you could actually plan for the speed of next year's hardware quite reliably.

For LLMs, we don't even know how to reliably measure performance, much less plan for expected improvements.

replies(1): >>42140676 #

5. throwing_away ◴[14 Nov 24 20:01 UTC] No.42140536{3}[source]▶

>>42140326 #

Call Nvidia, that sounds like a job for AI.

6. mikeyouse ◴[14 Nov 24 20:13 UTC] No.42140676{4}[source]▶

>>42140532 #

Moores law became less of a prediction and more of a product road map as time went on. It helped coordinate investment and expectations across the entire industry so everyone involved had the same understanding of timelines and benchmarks. I fully believe more investment would’ve ‘bent the curve’ of the trend line but everyone was making money and there wasn’t a clear benefit to pushing the edge further.

replies(1): >>42141026 #

7. ben_w ◴[14 Nov 24 20:22 UTC] No.42140770{3}[source]▶

>>42140326 #

Back when Intel regularly gave updates with 2x performance increases, people did make decisions based on the performance doubling schedule.

8. epicureanideal ◴[14 Nov 24 20:45 UTC] No.42141026{5}[source]▶

>>42140676 #

Or maybe it pushed everyone to innovate faster than they otherwise would’ve? I’m very interested to hear your reasoning for the other case though, and I am not strongly committed to the opposite view, or either view for that matter.

9. ppeetteerr ◴[14 Nov 24 21:12 UTC] No.42141284[source]▶

>>42140135 (TP) #

The reason people are holding out is that the current generation of models are still pretty poor in many areas. You can have it craft an email, or to review your email, but I wouldn't trust an LLM with anything mission-critical. The accuracy of the generated output is too low be trusted in most practical applications.

replies(2): >>42142016 #>>42144223 #

10. jkaptur ◴[14 Nov 24 21:16 UTC] No.42141311[source]▶

>>42140256 #

https://en.wikipedia.org/wiki/Osborne_effect

11. deegles ◴[14 Nov 24 21:29 UTC] No.42141433[source]▶

>>42140135 (TP) #

My big question is what is being done about hallucination? Without a solution it's a giant footgun.

replies(3): >>42143293 #>>42145814 #>>42148625 #

12. creativenolo ◴[14 Nov 24 21:33 UTC] No.42141459[source]▶

>>42140135 (TP) #

Great & motivational comment. Any pointers on where to start playing with the internals and sampling?

Doesn’t need to be comprehensive, I just don’t know where to jump off from.

replies(1): >>42144378 #

13. creativenolo ◴[14 Nov 24 21:40 UTC] No.42141522[source]▶

>>42140135 (TP) #

> holding on of building things waiting for "that next big update", but there a so many small, annoying tasks that can be easily automated.

Also we only hear / see the examples that are meant to scale. Startups typically offer up something transformative, ready to soak up a segment of a market. And that’s hard with the current state of LLMs. When you try their offerings, it’s underwhelming. But there is richer, more nuanced hard to reach fruits that are extremely interesting - but it’s not clear where they’d scale in and of themselves.

14. kozikow ◴[14 Nov 24 22:11 UTC] No.42141760[source]▶

>>42140135 (TP) #

> "The theory behind these models so aggressively lags the engineering"

The problem is that 99% of theories are hard to scale.

I am not an expert, as I work adjacent to this field, but I see the inverse - dumbing down theory to increase parallelism/scalability.

15. saalweachter ◴[14 Nov 24 22:45 UTC] No.42142016[source]▶

>>42141284 #

Any email you trust an LLM to write is one you probably don't need to send.

replies(1): >>42142611 #

16. dheera ◴[14 Nov 24 23:36 UTC] No.42142470[source]▶

>>42140135 (TP) #

Exactly, I think the current crop of models is capable of solving a lot of non-first-world problems. Many of them don't need full AGI to solve, especially if we start thinking outside Silicon Valley.

17. Tagbert ◴[14 Nov 24 23:55 UTC] No.42142611{3}[source]▶

>>42142016 #

Glib but the reality is that there are lots of cases where you can use an AI in writing but don’t need to entrust it with the whole job blindly.

I mostly use AIs in writing as a glorified grammar checker that sometimes suggests alternate phrasing. I do the initial writing and send it to an AI for review. If I like the suggestions I may incorporate some. Others I ignore.

The only times I use it to write is when I have something like a status report and I’m having a hard time phrasing things. Then I may write a series of bullet points and send that through an AI to flesh it out. Again, that is just the first stage and I take that and do editing to get what I want.

It’s just a tool, not a creator.

replies(1): >>42144444 #

18. dr_kiszonka ◴[15 Nov 24 01:25 UTC] No.42143106[source]▶

>>42140135 (TP) #

Would you have any suggestions on how to play with the internals of these open models? I don't understand LLMs well, and would love to spend some experimenting, but I don't know where to start. Are any projects more appropriate for neophytes?

19. MBCook ◴[15 Nov 24 02:00 UTC] No.42143293[source]▶

>>42141433 #

CAN anything be done? At a very low level they’re basically designed to hallucinate text until it looks like something you’re asking for.

It works disturbingly well. But because it doesn’t have any actual intrinsic knowledge it has no way of knowing when it made a “good“ hallucination versus a “bad“ one.

I’m sure people are working at piling things on top to try and influence what gets generated or catch and move away from errors errors other layers spot… but how much effort and resources will be needed to make it “good enough“ that people don’t worry about this anymore.

In my mind the core problem is people are trying to use these for things they’re unsuitable for. Asking fact-based questions is asking for trouble. There isn’t much of a wrong answer if you wanted to generate a bedtime story or a bunch of test data that looks sort of like an example you give it.

If you ask it to find law cases on a specific point you’re going to raise a judge‘s ire, as many have already found.

20. jeswin ◴[15 Nov 24 05:39 UTC] No.42144223[source]▶

>>42141284 #

Google (even now) wasn't absolutely accurate either. That didn't stop it from becoming many billions worth.

> You can have it craft an email, or to review your email, but I wouldn't trust an LLM with anything mission-critical

My point is that an entire world lies between these two extremes.

replies(3): >>42145162 #>>42145790 #>>42152124 #

21. wruza ◴[15 Nov 24 06:21 UTC] No.42144378[source]▶

>>42141459 #

Afaiu “sampling” here, it is controlled with (not only?) topk and temp parameters in e.g. “text generation web ui”. You may find these in other frontends probably too.

This ofc implies local models and that you have a decent cpu + min 64gb of ram to run above 7b-sized model.

https://github.com/oobabooga/text-generation-webui

https://huggingface.co/models?pipeline_tag=text-generation&s...

22. osigurdson ◴[15 Nov 24 06:43 UTC] No.42144444{4}[source]▶

>>42142611 #

>> have something like a status report and I’m having a hard time phrasing things

I believe the above suggested that this type of email likely doesn't need to be sent. Is anyone really reading the status report? If they read it, what concrete decisions do they make based on it. We all get in this trap of doing what people ask of us but it often isn't what shareholders and customers really care about.

replies(1): >>42169563 #

23. fooker ◴[15 Nov 24 08:29 UTC] No.42144934{3}[source]▶

>>42140326 #

If Intel could do that, they would be the one with a 3 trillion market cap. Not Nvidia.

24. DiscourseFan ◴[15 Nov 24 09:09 UTC] No.42145162{3}[source]▶

>>42144223 #

I would say that anything you write can come back to you in the future, so don’t blindly sign your name on anything you didn’t review yourself.

25. netdevnet ◴[15 Nov 24 11:05 UTC] No.42145790{3}[source]▶

>>42144223 #

Why don't you give actual concrete testable examples back with evidence where this is the case? Put your skin in the game.

replies(1): >>42148527 #

26. netdevnet ◴[15 Nov 24 11:10 UTC] No.42145814[source]▶

>>42141433 #

what do you want done about it? Hallucination is an intrinsic part of how LLMs work. What makes a hallucination is the inconsistency between the hallucinated concept and the reality. Reality is not part of how LLMs work. They do amazing things but at the end of the day they are elaborate statistical machines.

Look behind the veil and see LLMs for what they really are and you will maximise their utility, temper your expectations and save you disappointment

27. jacobr1 ◴[15 Nov 24 16:44 UTC] No.42148527{4}[source]▶

>>42145790 #

A support ticket is a good middle ground. This is probably the area of most robust enterprise deployment. Synthesizing knowledge to produce a draft reply with some logic either to automatically send it or have human review. There are both shitty and ok systems that save real money with case deflection and even improved satisfaction rates. Partly this works because human responses can also suck, so you are raising a low bar. But it is a real use case with real money and reputation on the line.

replies(1): >>42152090 #

28. jacobr1 ◴[15 Nov 24 16:55 UTC] No.42148625[source]▶

>>42141433 #

Semantic search without LLMs is already making a dent. It still gives traditional results that need to be human processed, but you can get "better" search results.

And with that there is a body work on "groundedness" that basically post-processes output to compare it against its source material. It still can result in logic errors and has a base error it self, but can ensure you at least have clear citations for factual claims that match real documents, but doesn't fully ensure they are being referenced correctly (though that is already the case even with real papers produced by humans).

Also consider the baseline isn't perfection, it is a benchmark against real humans. Accuracy is getting much better in certain domains where we have a good corpora. Part of assessing the accuracy of a system is going to be about determining if the generated content is "in distribution" of its training data. There is progress being made in this direction, so we could perhaps do a better job at the application level of making use of a "confidence" score of some kind maybe even taking that into account in a chain of thought like reasoning step.

People keep finding "obviously wrong" hallucinates that seem like proof things are still crap. But these system keep getting better on benchmarks looking at retrieval accuracy. And the benchmarks keep getting better as people point out deficiencies it them. Perfection might not be possible, but consistently better than average human seems in reach, and better than that seems feasible too. The challenge is the class of mistakes might look different even if the error rate overall is lower.

29. ppeetteerr ◴[15 Nov 24 22:41 UTC] No.42152090{5}[source]▶

>>42148527 #

Keyword is "draft". You still need a person to review the response with knowledge of the context of the issue. It's the same as my email example.

30. ppeetteerr ◴[15 Nov 24 22:45 UTC] No.42152124{3}[source]▶

>>42144223 #

Google became a billion dollar company creating the best search and indexing service at the time and putting ads around the results (that and YouTube). The didn't own the answer of the question.

31. Tagbert ◴[18 Nov 24 03:47 UTC] No.42169563{5}[source]▶

>>42144444 #

Considering that I do get questions and comments about the projects, yet, people are reading this.

↑