95% of Companies See 'Zero Return' on $30B Generative AI Spend

(thedailyadda.com)

Show context

jawns ◴[21 Aug 25 16:36 UTC] No.44974805[source]▶

Full disclosure: I'm currently in a leadership role on an AI engineering team, so it's in my best interest for AI to be perceived as driving value.

Here's a relatively straightforward application of AI that is set to save my company millions of dollars annually.

We operate large call centers, and agents were previously spending 3-5 minutes after each call writing manual summaries of the calls.

We recently switched to using AI to transcribe and write these summaries. Not only are the summaries better than those produced by our human agents, they also free up the human agents to do higher-value work.

It's not sexy. It's not going to replace anyone's job. But it's a huge, measurable efficiency gain.

replies(39): >>44974847 #>>44974853 #>>44974860 #>>44974865 #>>44974867 #>>44974868 #>>44974869 #>>44974874 #>>44974876 #>>44974877 #>>44974901 #>>44974905 #>>44974906 #>>44974907 #>>44974929 #>>44974933 #>>44974951 #>>44974977 #>>44974989 #>>44975016 #>>44975021 #>>44975040 #>>44975093 #>>44975126 #>>44975142 #>>44975193 #>>44975225 #>>44975251 #>>44975268 #>>44975271 #>>44975292 #>>44975458 #>>44975509 #>>44975544 #>>44975548 #>>44975622 #>>44975923 #>>44976668 #>>44977281 #

dsr_ ◴[21 Aug 25 16:42 UTC] No.44974877[source]▶

>>44974805 #

Pro-tip: don't write the summary at all until you need it for evidence. Store the call audio at 24Kb/s Opus - that's 180KB per minute. After a year or whatever, delete the oldest audio.

There, I've saved you more millions.

replies(10): >>44974925 #>>44975015 #>>44975017 #>>44975057 #>>44975100 #>>44975212 #>>44975220 #>>44975321 #>>44975382 #>>44975421 #

sillyfluke ◴[21 Aug 25 17:17 UTC] No.44975421[source]▶

>>44974877 #

You also will have saved them all the cost of the AI summaries that are incorrect as well.

The parent states:

>Not only are the summaries better than those produced by our human agents...

Now, since they have not mentioned what it took to actually verify that the AI summaries were in fact better than the human agents, I'm sceptical they did the necessary due dillengence.

Why do I think this? Because I have actually tried to do such a verification. In order to verify that the AI summary is actually correct you have to engage in the incredibly tedious task of listening to original recording literally second by second and make sure that what is said does not conflict with the AI summary in question. Not only did the AI summary fail at this test, it failed in the first recording I tested.

The AI summary stated that "Feature x was going to be in Release 3, not 4" whereas the in the recording it is stated that the feature will be in Release 4 not 3, literally the opposite of what the AI said.

I'm sorry but the fact that the AI summary is nicely formatted and has not missed a major topic of conversation means fuck all if the details that are are discussed are spectacularly wrong from a decision tracking perspective, as in literally the opposite of what is stated.

And I know "why" the Ai summary fucked up, because in that instance the topic of conversation was about how there was some confusion about which release that feature was going to be in, that's why the issue was a major item of the meeting agenda in the first place. Predicably, the AI failed to follow the convoluted discussion and "came to" the opposite conclusion.

In short, no fucking thanks.

replies(3): >>44975487 #>>44975553 #>>44975657 #

1. doorhammer ◴[21 Aug 25 17:26 UTC] No.44975553[source]▶

>>44975421 #

Again, not the OP, so I can't speak to exactly their use-case, but the vast majority of call center calls fall into really clear buckets.

To give you an idea: Phonetic transcription was the "state of the art" when I was a QA analyst. It broke call transcripts apart into a stream of phonemes and when you did a search, it would similarly convert your search into a string of phonemes, then look for a match. As you can imagine, this is pretty error prone and you have to get a little clever with it, but realistically, it was more than good enough for the scale we operated at.

If it were an ecom site you'd already know the categories of calls you're interested in because you've been doing that tracking manually for years. Maybe something like "late delivery", "broken item", "unexpected out of stock", "missing pieces", etc.

Basically, you'd have a lot of known context to anchor the llms analysis, which would (probably) cover the vast majority of your calls, leaving you freed up to interact with outliers more directly.

At work as a software dev, having an LLM summarize a meeting incorrectly can be really really bad, so I appreciate the point you're making, but at a call center for an f500 company you're looking for trends and you're aware of your false positive/negative rates. Realistically, those can be relatively high and still provide a lot of value.

Also, if it's a really large company, they almost certainly had someone validate the calls, second-by-second, against the summaries (I know because that was my job for a period of time). That's a minimum bar for _any_ call analysis software so you can justify the spend. Sure, it's possible that was hand-waved, but as the person responsible for the outcome of the new summarization technique with LLMs, you'd be really screwing yourself to handwave a product that made you measurably less effective. There are better ways to integrate the AI hype train into a QA department than replacing the foundation of your analysis, if that's all you're trying to do.

replies(2): >>44975928 #>>44975980 #

2. sillyfluke ◴[21 Aug 25 17:56 UTC] No.44975928[source]▶

>>44975553 (TP) #

Thanks for the detailed domain-specific explanation, if we assume that some whale clients of the company will end up in the call center is it not more probable that more competent human agents will be responsible for the call, whereas it's pretty much the same AI agent adressing the whale client as the regular customers in the alternative scenario?

replies(1): >>44976448 #

3. Imustaskforhelp ◴[21 Aug 25 18:00 UTC] No.44975980[source]▶

>>44975553 (TP) #

I genuinely don't think that the GP is actually making someone actually listen to the transcription and summary and check if the summary is wrong.

I almost have this gut feeling that its the case (I may be wrong though)

Like imagine this, if the agent could just spend 3 minutes writing a summary, why would you use AI to create a summary and then have some other person listen to the whole audio recording and check if the summary is right

like it would take an agent 3 minutes out of lets say a 1 hour long conversation / (call?)

on the other hand you have someone listen to 1 hour whole recording and then check the summary? that's now 1 hour compared to 3 minutes Nah, I don't think so.

Even if we assume that multiple agents are contacted in the same call, they can all simply write the summary of what they did and to whom they redirected and just follow that line of summaries.

And after this, I think that your summary of seeing that they are really screwing away is accurately true.

Kinda funny how the gp comment was the first thing that I saw in this post and how even I was kinda convinced that they are one of the more smarter ones integrating AI but your comment made me come to realization of them actually just screwing themselves.

Imagine the irony, that a post about how AI companies are screwing themselves by burning a lot of money and then the people using them don't get any value out of it.

And then the one on Hn that sounded like it finally made sense for them is also not making sense... and they are screwing over themselves.

The irony is just ridiculous. So funny it made me giggle

replies(1): >>44976362 #

4. doorhammer ◴[21 Aug 25 18:31 UTC] No.44976362[source]▶

>>44975980 #

They might not be, and their use-case might not be one I agree with. I can just imagine a plausible reality where they made a reasonable decision given the incentives and constraints, and I default to that.

I'm basically inferring how this would go down in the context I worked under, not the GP, because I don't know the details of their real context.

I think I'm seeing where I'm not being as clear as I could, though.

I'm talking about the lifecycle of a methodology for categorizing calls, regardless of whether or not it's a human categorizing them or a machine.

If your call center agent is writing summaries and categorizing their own calls, you still typically have a QA department of humans that listen to a random sample of full calls for any given agent on a schedule to verify that your human classifiers are accurately tagging calls. The QA agents will typically listen to them at like 4x speed or more, but mostly they're just sampling and validating the sample.

The same goes for _any_ automated process you want to apply at scale. You run it in parallel to your existing methodology and you randomly sample classified calls, verifying that the results were correct and you _also_ compare the overall results of the new method to the existing one, because you know how accurate the existing method is.

But you don't do that for _every_ call.

You find a new methodology you think is worth trying and you trial it to validate the results. You compare the cost and accuracy of that method against the cost and accuracy of the old one. And you absolutely would often have a real human listen to full calls, just not _all_ of them.

In that respect, LLMs aren't particularly special. They're just a function that takes a call and returns some categories and metadata. You compare that to the output of your existing function.

But it's all part of the: New tech consideration? -> Set up conditions to validate quantitatively -> run trials -> measure -> compare -> decide

Then on a schedule you go back and do another analysis to make sure your methodology is still providing the accuracy you need it to, even if you haven't change anything

replies(1): >>44976754 #

5. doorhammer ◴[21 Aug 25 18:40 UTC] No.44976448[source]▶

>>44975928 #

Yeah, if I were running a QA department I wouldn't let llms anywhere near actual customers as far as trying to resolve a customer issue directly.

And, this is just a guess, but it's not uncommon that whale customers like that have their own dedicated account person and I'd personally stick with that model.

The use-case I'm like "huh, yeah, I could see that working well" is mostly around doing sentiment analysis and call tagging--maybe actual summaries that humans might read if I had a really well-design context for the llm to work within. Basically anything where you can have an acceptable false positive/negative rate.

6. Imustaskforhelp ◴[21 Aug 25 19:05 UTC] No.44976754{3}[source]▶

>>44976362 #

Man firstly I wanted to say that I loved your comment to which I responded to and then this comment too. I feel actually happy reading it and maybe its hard explaing it but maybe its because I learned something new.

So firstly, I thought that you meant that they had to listen to every call so uh yeah a misunderstanding since I admittedly don't know much about it, but still its great to hear from an expert.

I also don't know about the GP's context but I truly felt like this because of how I said in some other comments too on how people are just slapping AI stickers and markets rewarding it even though they are mostly being reckless in how they are using AI (which the post basically says) and I thought of them as the same, though I still doubt them though. Only more context from their side can tell.

Secondly, I really appreciate the paragraph that you wrote about testing different strategies and almost how indepth you went into man. Really feel like one of those comments that I feel like will be useful for me one day or the other Seriously thanks!

replies(1): >>44978251 #

7. doorhammer ◴[21 Aug 25 21:22 UTC] No.44978251{4}[source]▶

>>44976754 #

Hey, thanks for saying that. I have huge gaps in time commenting on HN stuff because tbh, it's just social anxiety I don't need to sign up for :| so I really value someone taking the time to express appreciation if they got something out of my novels.

I don't ever want to come across like I think I know what's up better than someone else. I just want to share my perspective given my experience and if I'm wrong, hope someone will be kind when they point it out.

Tbh it's been awhile since I've worked directly in a call center (I've done some consulting type stuff here and there since then, but not much) so I'm mostly just extrapolating based on new tech and people I still know in that industry.

Fwiw, the way I try to approach interpreting something like the GPs post is to try to predict the possible realities and decide which ones I think are most plausible. After that I usually contribute the less represented perspective--but only if I think it's plausible.

I think the reality you were describing is totally plausible. My gut feeling is that it's probably not what's happening, but I wouldn't bet any money on that.

If someone said "Pick a side. I'll give you $20k if your right and take $20k if you're wrong" I'm just flat out not participating, lol. If I _had_ to participate I'd reluctantly take benefit-of-the-doubt side, but I wouldn't love having to commit to something I'm not at all confident about

As it stands it's just a fun vehicle to talk about call center dynamics. Weirdly, I think they're super interesting

↑