OpenAI, Google and Anthropic are struggling to build more advanced AI

(www.bloomberg.com)

625 points lukebennett | 1 comments | 13 Nov 24 13:28 UTC | HN request time: 0.203s | source

Show context

LASR ◴[14 Nov 24 19:19 UTC] No.42140045[source]▶

Question for the group here: do we honestly feel like we've exhausted the options for delivering value on top of the current generation of LLMs?

I lead a team exploring cutting edge LLM applications and end-user features. It's my intuition from experience that we have a LONG way to go.

GPT-4o / Claude 3.5 are the go-to models for my team. Every combination of technical investment + LLMs yields a new list of potential applications.

For example, combining a human-moderated knowledge graph with an LLM with RAG allows you to build "expert bots" that understand your business context / your codebase / your specific processes and act almost human-like similar to a coworker in your team.

If you now give it some predictive / simulation capability - eg: simulate the execution of a task or project like creating a github PR code change, and test against an expert bot above for code review, you can have LLMs create reasonable code changes, with automatic review / iteration etc.

Similarly there are many more capabilities that you can ladder on and expose into LLMs to give you increasingly productive outputs from them.

Chasing after model improvements and "GPT-5 will be PHD-level" is moot imo. When did you hire a PHD coworker and they were productive on day-0 ? You need to onboard them with human expertise, and then give them execution space / long-term memories etc to be productive.

Model vendors might struggle to build something more intelligent. But my point is that we already have so much intelligence and we don't know what to do with that. There is a LOT you can do with high-schooler level intelligence at super-human scale.

Take a naive example. 200k context windows are now available. Most people, through ChatGPT, type out maybe 1500 tokens. That's a huge amount of untapped capacity. No human is going to type out 200k of context. Hence why we need RAG, and additional forms of input (eg: simulation outcomes) to fully leverage that.

replies(43): >>42140086 #>>42140126 #>>42140135 #>>42140347 #>>42140349 #>>42140358 #>>42140383 #>>42140604 #>>42140661 #>>42140669 #>>42140679 #>>42140726 #>>42140747 #>>42140790 #>>42140827 #>>42140886 #>>42140907 #>>42140918 #>>42140936 #>>42140970 #>>42141020 #>>42141275 #>>42141399 #>>42141651 #>>42141796 #>>42142581 #>>42142765 #>>42142919 #>>42142944 #>>42143001 #>>42143008 #>>42143033 #>>42143212 #>>42143286 #>>42143483 #>>42143700 #>>42144031 #>>42144404 #>>42144433 #>>42144682 #>>42145093 #>>42145589 #>>42146002 #

afro88 ◴[14 Nov 24 20:17 UTC] No.42140726[source]▶

>>42140045 #

> potential applications > if you ... > for example ...

Yes there seems to be lots of potential. Yes we can brainstorm things that should work. Yes there is a lot of examples of incredible things in isolation. But it's a little bit like those youtube videos showing amazing basketball shots in 1 try, when in reality lots of failed attempts happened beforehand. Except our users experience the failed attempts (LLM replies that are wrong, even when backed by RAG) and it's incredibly hard to hide those from them.

Show me the things you / your team has actually built that has decent retention and metrics concretely proving efficiency improvements.

LLMs are so hit and miss from query to query that if your users don't have a sixth sense for a miss vs a hit, there may not be any efficiency improvement. It's a really hard problem with LLM based tools.

There is so much hype right now and people showing cherry picked examples.

replies(7): >>42140844 #>>42140963 #>>42141787 #>>42143330 #>>42144363 #>>42144477 #>>42148338 #

fnordpiglet ◴[15 Nov 24 02:07 UTC] No.42143330[source]▶

>>42140726 #

We have built quite a few highly useful LLM applications in my org that have reduced cost and improved outcomes in several domains - fraud detection, credit analysis, customer support, and a variety of other spaces. By in large they operate as cognitive load reducers but also handle through automation the vast majority of work since in our uses false negatives are not as bad as false positives but the majority of things we analyze are not true positives (99.999%+). As such the LLMs do a great job at anomaly detection and allow us to do tasks it would be prohibitively expensive with humans and their false positive and negative rates are considerably higher than LLMs.

I see these statements often here about “I’ve never seen an effective commercial use of LLMs,” which tells me you aren’t working with very creative and competent people in areas that are amenable to LLMs. In my professional network beyond where I work now I know at least a dozen people who have successful commercial applications of LLMs. They tend to be highly capable people able to build the end to end tool chains necessary (which is a huge gap) and understand how to compose LLMs in hierarchical agents with effective guard rails. Most ineffectual users of LLMs want them to be lazy buttons that obviate the need to think. They’re not - like any sufficiently powerful tool they require thought up front and are easy to use wrong. This will get better with time as patterns and tools emerge to get the most use out of them in a commercial setting. However the ability to process natural language and use an emergent (if not actual) abductive reasoning is absurdly powerful and was not practically possible 4 years ago - the assertion such an amazing capability in an information or decisioning system is not commercially practical is on the face absurd.

replies(3): >>42143387 #>>42143440 #>>42143506 #

1. topicseed ◴[15 Nov 24 02:27 UTC] No.42143440[source]▶

>>42143330 #

Do they build guardrails themselves or do they use an llm guardrail api like Modelmetry or Langwatch?

↑