Most active commenters
  • sdesol(7)
  • vidarh(4)
  • larodi(3)
  • intended(3)
  • netdevnet(3)
  • jacobr1(3)

←back to thread

625 points lukebennett | 52 comments | | HN request time: 0.003s | source | bottom
Show context
LASR ◴[] No.42140045[source]
Question for the group here: do we honestly feel like we've exhausted the options for delivering value on top of the current generation of LLMs?

I lead a team exploring cutting edge LLM applications and end-user features. It's my intuition from experience that we have a LONG way to go.

GPT-4o / Claude 3.5 are the go-to models for my team. Every combination of technical investment + LLMs yields a new list of potential applications.

For example, combining a human-moderated knowledge graph with an LLM with RAG allows you to build "expert bots" that understand your business context / your codebase / your specific processes and act almost human-like similar to a coworker in your team.

If you now give it some predictive / simulation capability - eg: simulate the execution of a task or project like creating a github PR code change, and test against an expert bot above for code review, you can have LLMs create reasonable code changes, with automatic review / iteration etc.

Similarly there are many more capabilities that you can ladder on and expose into LLMs to give you increasingly productive outputs from them.

Chasing after model improvements and "GPT-5 will be PHD-level" is moot imo. When did you hire a PHD coworker and they were productive on day-0 ? You need to onboard them with human expertise, and then give them execution space / long-term memories etc to be productive.

Model vendors might struggle to build something more intelligent. But my point is that we already have so much intelligence and we don't know what to do with that. There is a LOT you can do with high-schooler level intelligence at super-human scale.

Take a naive example. 200k context windows are now available. Most people, through ChatGPT, type out maybe 1500 tokens. That's a huge amount of untapped capacity. No human is going to type out 200k of context. Hence why we need RAG, and additional forms of input (eg: simulation outcomes) to fully leverage that.

replies(43): >>42140086 #>>42140126 #>>42140135 #>>42140347 #>>42140349 #>>42140358 #>>42140383 #>>42140604 #>>42140661 #>>42140669 #>>42140679 #>>42140726 #>>42140747 #>>42140790 #>>42140827 #>>42140886 #>>42140907 #>>42140918 #>>42140936 #>>42140970 #>>42141020 #>>42141275 #>>42141399 #>>42141651 #>>42141796 #>>42142581 #>>42142765 #>>42142919 #>>42142944 #>>42143001 #>>42143008 #>>42143033 #>>42143212 #>>42143286 #>>42143483 #>>42143700 #>>42144031 #>>42144404 #>>42144433 #>>42144682 #>>42145093 #>>42145589 #>>42146002 #
1. afro88 ◴[] No.42140726[source]
> potential applications > if you ... > for example ...

Yes there seems to be lots of potential. Yes we can brainstorm things that should work. Yes there is a lot of examples of incredible things in isolation. But it's a little bit like those youtube videos showing amazing basketball shots in 1 try, when in reality lots of failed attempts happened beforehand. Except our users experience the failed attempts (LLM replies that are wrong, even when backed by RAG) and it's incredibly hard to hide those from them.

Show me the things you / your team has actually built that has decent retention and metrics concretely proving efficiency improvements.

LLMs are so hit and miss from query to query that if your users don't have a sixth sense for a miss vs a hit, there may not be any efficiency improvement. It's a really hard problem with LLM based tools.

There is so much hype right now and people showing cherry picked examples.

replies(7): >>42140844 #>>42140963 #>>42141787 #>>42143330 #>>42144363 #>>42144477 #>>42148338 #
2. jihadjihad ◴[] No.42140844[source]
> Except our users experience the failed attempts (LLM replies that are wrong, even when backed by RAG) and it's incredibly hard to hide those from them.

This has been my team's experience (and frustration) as well, and has led us to look at using LLMs for classifying / structuring, but not entrusting an LLM with making a decision based on things like a database schema or business logic.

I think the technology and tooling will get there, but the enormous amount of effort spent trying to get the system to "do the right thing" and the nondeterministic nature have really put us into a camp of "let's only allow the LLM to do things we know it is rock-solid at."

replies(2): >>42141270 #>>42141797 #
3. VeejayRampay ◴[] No.42140963[source]
really agree with this and I think it's been the general experience: people wanting LLMs to be so great (or making money off them) kind of cherry picking examples that fit their narrative, which LLMs are good at because they produce amazing results some of the time like the deluxe broken clock that they are (they're right many many times a day)

at the end of the day though, it's not exactly reliable or particularly transformative when you get past the party tricks

4. sdesol ◴[] No.42141270[source]
> "let's only allow the LLM to do things we know it is rock-solid at."

Even this is insanely hard in my opinion. The one thing that you would assume LLM to excel at is spelling and grammar checking for the English language, but even the top model (GPT-4o) can be insanely stupid/unpredictable at times. Take the following example from my tool:

https://app.gitsense.com/?doc=6c9bada92&model=GPT-4o&samples...

5 models are asked if the sentence is correct and GPT-4o got it wrong all 5 times. It keeps complaining that GitHub is spelled like Github, when it isn't. Note, only 2 weeks ago, Claude 3.5 Sonnet did the same thing.

I do believe LLM is a game changer, but I'm not convinced it is designed to be public-facing. I see LLM as a power tool for domain experts, and you have to assume whatever it spits out may be wrong, and your process should allow for it.

Edit:

I should add that I'm convinced that not one single model will rule them all. I believe there will be 4 or 5 models that everybody will use and each will be used to challenge one another for accuracy and confidence.

replies(7): >>42141815 #>>42141930 #>>42142235 #>>42142767 #>>42142842 #>>42144019 #>>42145544 #
5. archiepeach ◴[] No.42141787[source]
To be fair in the human-based teams I've worked with in startups I couldn't show you products with decent retention.
6. ◴[] No.42141797[source]
7. SimianSci ◴[] No.42141815{3}[source]
> "I see LLM as a power tool for domain experts, and you have to assume whatever it spits out may be wrong, and your process should allow for it."

this gets to the heart of it for me. I think LLMs are an incredible tool, providing advanced augmentation on our already developed search capabilities. What advanced user doesnt want to have a colleague they can talk about their specific domain capacity with?

The problem comes from the hyperscaling ambitions of the players who were the first in this space. They quickly hyped up the technology beyond want it should have been.

replies(1): >>42145693 #
8. larodi ◴[] No.42141930{3}[source]
Those Apple engineers stated in a very clear tone:

- every time a different result is produced.

- no reasoning capabilities were categorically determined.

So this is it. If you want LLM - brace for different results and if this is okay for your application (say it’s about speech or non-critical commands) then off you are.

Otherwise simply forget this approach, and particularly when you need reproducible discreet results.

I don’t think it gets any better than that and nothing so far implicated it will (with this particular approach to AGI or whatever the wet dream is)

replies(4): >>42141956 #>>42142010 #>>42142797 #>>42144428 #
9. marcellus23 ◴[] No.42141956{4}[source]
> Those Apple engineers

Which Apple engineers? Yours is the only reference to the company in this comment section or in the article.

replies(2): >>42142644 #>>42146113 #
10. verteu ◴[] No.42142010{4}[source]
(for reference: https://arxiv.org/pdf/2410.05229 )
11. malfist ◴[] No.42142235{3}[source]
I was using an LLM to help spot passive voice in my documents and it told me "We're making" was passive and I should change it to "we are making" to make it active.

Leaving aside "we're" and "we are" are the same, it is absolutely active voice

replies(1): >>42142538 #
12. sdesol ◴[] No.42142538{4}[source]
In the process of developing my tool, there are only 5 models (the first 5 in my models dropdown list) that I would use as a writing aide. If you used any other model, it really is a crapshoot with how bad they can be.
replies(1): >>42145279 #
13. Agingcoder ◴[] No.42142644{5}[source]
See arxiv paper just above
14. kristianp ◴[] No.42142767{3}[source]
> It keeps complaining that GitHub is spelled like Github, when it isn't

I feel like this is unfair. That's the only thing it got wrong? But we want it to pass all of our evals, even ones the perhaps a dictionary would be better at solving? Or even an LLM augmented with a dictionary.

replies(2): >>42143251 #>>42143364 #
15. rco8786 ◴[] No.42142797{4}[source]
There’s another option here though. Human supervised tasks.

There’s a whole classification of tasks where a human can look at a body of work and determine whether it’s correct or not in far less time than it would take for them to produce the work directly.

As a random example, having LLMs write unit tests.

replies(1): >>42148431 #
16. vidarh ◴[] No.42142842{3}[source]
I do contract work on fine-tuning efforts, and I can tell you that most humans aren't designed to be public-facing either.

While LLMs do plenty of awful things, people make the most incredibly stupid mistakes too, and that is what LLMs needs to be benchmarked against. The problem is that most of the people evaluating LLMs are better educated than most and often smarter than most. When you see any quantity of prompts input by a representative sample of LLM losers, you quickly lose all faith in humanity.

I'm not saying LLMs are good enough. They're not. But we will increasingly find that there are large niches where LLMs are horrible and error prone yet still outperform the people companies are prepared to pay to do the task.

In other words, on one hand you'll have domain experts becoming expert LLM-wranglers. On the other hand you'll have public-facing LLMs eating away at tasks done by low paid labour where people can work around their stupid mistakes with process or just accepting the risk, same as they currently do with undertrained labor.

replies(3): >>42143411 #>>42143886 #>>42145953 #
17. MBCook ◴[] No.42143251{4}[source]
Does it matter?

As a user I want it to be right, even if that contradicts the normal rules of the language.

18. fnordpiglet ◴[] No.42143330[source]
We have built quite a few highly useful LLM applications in my org that have reduced cost and improved outcomes in several domains - fraud detection, credit analysis, customer support, and a variety of other spaces. By in large they operate as cognitive load reducers but also handle through automation the vast majority of work since in our uses false negatives are not as bad as false positives but the majority of things we analyze are not true positives (99.999%+). As such the LLMs do a great job at anomaly detection and allow us to do tasks it would be prohibitively expensive with humans and their false positive and negative rates are considerably higher than LLMs.

I see these statements often here about “I’ve never seen an effective commercial use of LLMs,” which tells me you aren’t working with very creative and competent people in areas that are amenable to LLMs. In my professional network beyond where I work now I know at least a dozen people who have successful commercial applications of LLMs. They tend to be highly capable people able to build the end to end tool chains necessary (which is a huge gap) and understand how to compose LLMs in hierarchical agents with effective guard rails. Most ineffectual users of LLMs want them to be lazy buttons that obviate the need to think. They’re not - like any sufficiently powerful tool they require thought up front and are easy to use wrong. This will get better with time as patterns and tools emerge to get the most use out of them in a commercial setting. However the ability to process natural language and use an emergent (if not actual) abductive reasoning is absurdly powerful and was not practically possible 4 years ago - the assertion such an amazing capability in an information or decisioning system is not commercially practical is on the face absurd.

replies(3): >>42143387 #>>42143440 #>>42143506 #
19. sdesol ◴[] No.42143364{4}[source]
My reason for commenting wasn't to say LLM sucks, but rather we need to get over the honeymoon phase. The fact the GPT-4o (one of the most advanced, if not the most advanced when it comes to non programming tasks) hallucinated "Github" as the input, should give us pause.

LLM has its place and it will forever change how we think about UX and other things, but we need to realize you really can't create a public facing solution without significant safe guards, if you don't want egg on your face.

replies(1): >>42145712 #
20. andai ◴[] No.42143387[source]
>compose LLMs in hierarchical agents with effective guard rails

Could you elaborate? Is this related to the "teams of specialized LLMs" concept I saw last year when Auto-GPT was getting a lot of hype?

21. sdesol ◴[] No.42143411{4}[source]
> While LLMs do plenty of awful things, people make the most incredibly stupid mistakes too

I am 100% not blaming the LLM, but rather VCs and the media for believing the VCs. Once we get over the hype and people realize there isn't a golden goose, the better off we will be. Once we accept that LLM is not perfect and that it is not what we are being sold, I believe we will find a place for it that will make a huge impact. Unfortunately for OpenAI and others, I don't believe they will play as big of a role as they would like us to believe/will.

22. topicseed ◴[] No.42143440[source]
Do they build guardrails themselves or do they use an llm guardrail api like Modelmetry or Langwatch?
23. mhuffman ◴[] No.42143506[source]
>We have built quite a few highly useful LLM applications in my org that have reduced cost and improved outcomes in several domains

Apps that use LLMs or apps made with LLMs? In either case can you share them?

>which tells me you aren’t working with very creative and competent people

> In my professional network beyond where I work now I know at least a dozen people who have successful commercial applications of LLMs.

Apps that use LLMs or apps made with LLMs? In either case can you share them?

No one doubts that you can integrate LLMs into an application workflow and get some benefits in certain cases. That has been what the excitement and promise was about all along. They have a demonstrated ability to wrangle, extract, and transform data (mostly correctly) and generate patterns from data and prompts (hit and miss, usually with a lot of human involvement). All of which can be powerful. But outside of textual or visual chatbots or CRUD apps, no one wants to "put up or shut" a solid example that the top management of an existing company would sign off on. Only stories about awesome examples they and their friends are working on ... which often turn out to be CRUD apps or textual or visual chatbots. One notable standout is generative image apps can be quite good in certain circumstances.

So, since you seem to have a real interest and actual examples of this, I am curious to see some that real companies would gamble that company on. And I don't mean some quixotic startup, I mean a company making real money now with customers that is confident on that app to the point they are willing to risk big. Because that last part is what companies do with other (non LLM) apps. I also know that people aren't perfect and wouldn't expect an LLM to be, just want to make sure I am not missing something.

24. intended ◴[] No.42143886{4}[source]
I have a side point here - There is a certain schizoid aspect to this argument that LLMs and humans make similar mistakes.

This means that on one hand firms are demanding RTO for culture and team work improvements. While on the other they will be ok with a tool that makes unpredictable errors like humans, but can never be impacted by culture and team work.

These two ideas lie in odd juxtaposition to each other.

replies(1): >>42146209 #
25. solid_fuel ◴[] No.42144019{3}[source]
I wouldn't expect an LLM to be good at spell checking, actually. The way they tokenize text before manipulating it makes them fairly bad at working with small sequences of letters.

I have had good luck using an LLM as a "sanity checking" layer for transcription output, though. A simple prompt like "is this paragraph coherent" has proven to be a pretty decent way to check the accuracy of whisper transcriptions.

replies(1): >>42144176 #
26. sdesol ◴[] No.42144176{4}[source]
Yes this is a tokenization error. If you rewrite the sentence as shown below:

https://app.gitsense.com/?doc=905f4a9af74c25f&model=Claude+3...

Claude 3.5 Sonnet will now misinterpret "GitHub as "Github"

27. physicsguy ◴[] No.42144363[source]
We’ve found that the text it generates in our RAG application is good, but it cocks up probably 5-10% of the time doing the inline references to the documents which users think is a bug and which we aren’t able to fix. This is static rather than interactively generated too
28. osigurdson ◴[] No.42144428{4}[source]
I wonder if there is a moral hazard here? Apple doesn't really have much in terms of AI, so maybe more likely to have an unfavorable view.
replies(2): >>42146106 #>>42146354 #
29. anilgulecha ◴[] No.42144477[source]
LLMs are not hype.

In education at least, we've actively improved efficiency by ~25% across a large swath of educators (direct time saved) - agentic evaluators, tutors and doubt clarifiers. The wins in this industry are clear. And this is that much more time to spend with students.

I also know from 1-1 conversation with my peers in large-finance world, and there too the efficiency improvements on multiple fronts are similar.

replies(1): >>42145776 #
30. WA ◴[] No.42145279{5}[source]
OT: Your tool has a typo in the right hand side: "Claude 3.5 Sonnet Techincal writing checker"
replies(1): >>42147492 #
31. boredhedgehog ◴[] No.42145544{3}[source]
> I do believe LLM is a game changer, but I'm not convinced it is designed to be public-facing.

I think that, too, is a UX problem.

If you present the output as you do, as simple text on a screen, the average user will read it with the voice of an infallible Star Trek computer and be irritated by every mistake.

But if you present the same thing as a bunch of cartoon characters talking to each other, users might not only be fine with "egg in your face moments", as you put it, they will laugh about them.

The key is to move the user away from the idealistic mental model of what a computer is and does.

replies(2): >>42148581 #>>42157351 #
32. netdevnet ◴[] No.42145693{4}[source]
Welcome to capitalism. The market forces will squeze max value out of them. I imagine that Anthropic and OpenAI will be in the future fully downsized and acquired by their main investors (Microsoft and Amazon) and will simply becoming part of their generic and faceless AI & ML Teams once the current downwards stage of the hype cycle completes it closure in the next 5-8 years.
replies(1): >>42167865 #
33. netdevnet ◴[] No.42145712{5}[source]
I believe the honeymoon face has loong been finished. Even in the mainstream, last year of the AI year. 2024 has seen nothing substantially good and the only notesworthy thing is this article finally hitting into the public consciousness that we are past of the AI peak and beyond the plateau and freefalling has already begun.

LLM investors will be reviewing their portfolios and will likely begin declining further investments without clear evidence of profits in the very near future. On the other side, LLM companies will likely try to downplay this and again promise the Moon.

And on and on the market goes

34. netdevnet ◴[] No.42145776[source]
They are partially hype though. That's what people here are arguing. There are benefits but their valuation is largely hype driven. AI is going to transform industries and humanity, yes. But AI does not mean LLM (even if LLM means AI). LLM raw potential was reached last year with GPT-4. From here on, the value will lie on exploiting the potential we already have to generate clever applications. Just like the internet provided a platform for new services, I expect LLMs to be the same but with a much smaller impact
35. vidarh ◴[] No.42145953{4}[source]
Yikes, that was an unfortunate auto-correct and too late to edit. "LLM losers" was meant to be "LLM users".
replies(1): >>42148538 #
36. larodi ◴[] No.42146106{5}[source]
No sadly they just voicing the opinion already voiced by (many) other scientists.

My masters was text-to-sql and I can tell you hundreds of papers conclude that seq2seq and the transformer dérivâtes suck at logic even when you approach logic the symbolic way.

We’d love to figure production rules of any sort emerge with scale of the transformer, but I’m get to read such paper.

37. larodi ◴[] No.42146113{5}[source]
Sorry I thought this was already discussed in HN in a major topic, and was hard for me to copy page the link on mobile. Please take excuse.
38. vidarh ◴[] No.42146209{5}[source]
I think this goes exactly to the point that a whole lot of things become acceptable once they become cheap enough.
replies(1): >>42148086 #
39. fennecfoxy ◴[] No.42146354{5}[source]
It's also true that Apple hasn't really created any of these technologies themselves; afaik they're using a mostly standard LLM architecture (not invented by Apple) combined with task specific LORAs (not invented by Apple). Has Apple actually created any genuinely new technologies or innovations for Apple Intelligence?
40. sdesol ◴[] No.42147492{6}[source]
Hey thanks! The error is in the config file. Will fix this.
41. intended ◴[] No.42148086{6}[source]
Since this is a comparison, what has been made comparatively cheaper?
replies(1): >>42148416 #
42. jacobr1 ◴[] No.42148338[source]
This is why we are only at the start of exploring the solution space. What applications don't require 100% accuracy? What tooling can we build that enables a human in the loop to choose between options? What options do we have to better testing or checking accuracy? There is a lot more to be done to invest hybrid systems that use other types of models or novel training date or heuristics or human workflows in novel ways that shore up the shortcomings ... but in aggregate allow us to do new things. It will take many years for us to figure where this makes the most sense.
43. jacobr1 ◴[] No.42148416{7}[source]
We aren't talking about skilled knowledge work in Silicon Valley campuses. We are talking about work that might already have been outsourced so some cube-farm in the Philippines. Our routine office work that probably could already have been automated away by a line of business app in the 1980s, but is still done in some small office in Tulsa because it doesn't make sense to pay someone to write the code when 80% of the work is managing the data entry that still needs to be done regardless.

This more marginal labor is going to be more easy to replace. Also plenty of the more "elite" type labor will too, as it turns out it is more marginal. Already glue and boilerplate programming work is going this way, there is just so much more to do, and the important work of figuring out what should be done, that it hasn't displaced programmers yet. But it will for some fraction. WYSIWG type websites for small business has come a long way and will only get better, so there will be less need for customization on the margin. Or light design work (like take my logo and plug into into this format for this charity tournament flyer).

replies(1): >>42150352 #
44. jacobr1 ◴[] No.42148431{5}[source]
Which is a good example, because accuracy can be improved significantly with even minor human guidance in task like unit tests. Human augmentation is extremely valuable.
45. tim333 ◴[] No.42148538{5}[source]
I thought you were maybe a bit rude there!
replies(1): >>42149423 #
46. tim333 ◴[] No.42148581{4}[source]
To be fair they usually have "ChatGPT can make mistakes. Check important info" type disclaimers.
replies(1): >>42149865 #
47. vidarh ◴[] No.42149423{6}[source]
Yeah, not my intent. I use LLMs a lot myself too...
48. sdesol ◴[] No.42149865{5}[source]
As mentioned earlier, unless you have a 6th sense for what is wrong, you won't know. If the message was "make sure to double check our response" then they get a pass, but they know people will just say "why shouldn't i just use google."
49. intended ◴[] No.42150352{8}[source]
Ok.

Well, I can see the direction you are going. I am unconvinced though - it hasn't thread the needle.

Reason being

1) They are doing both in cube farms in the PHP, RTO + replacement by GenAI.

2) In high tech, they are also trying achieve these contradictory goals. RTO + Increased GenAI capability to reduce manpower needs.

I can see a desire to reduce costs. I cant see how RTO to improve team work sits with using LLMs to do human work.

replies(1): >>42153386 #
50. salad-tycoon ◴[] No.42153386{9}[source]
That’s a lot of weight on RTO and why it’s being implementing. A company is fully able to have you RTO, maybe even move, and fire you next day/month/year and desiring increased teamwork is not mutually exclusive of preparing for lay offs. Plus, I imagine at these companies there are multiple hands all doing things for their own purpose and metrics without knowing what the other hand is doing.Mid level Jan’s Christmas bonus depends on responding to exit interviews measurements showing workers leaving due to lack of teamwork, Bobs bonus depends on quickly implementing the code.
51. BlueTemplar ◴[] No.42157351{4}[source]
> It looks like you're writing unsubstantiated nonsense. Would you like to turn it all caps ?

clippy.gif

52. parineum ◴[] No.42167865{5}[source]
> Welcome to capitalism. The market forces will squeze max value out of them.

What a ringing endorsement.