←back to thread

418 points speckx | 2 comments | | HN request time: 0s | source
Show context
jawns ◴[] No.44974805[source]
Full disclosure: I'm currently in a leadership role on an AI engineering team, so it's in my best interest for AI to be perceived as driving value.

Here's a relatively straightforward application of AI that is set to save my company millions of dollars annually.

We operate large call centers, and agents were previously spending 3-5 minutes after each call writing manual summaries of the calls.

We recently switched to using AI to transcribe and write these summaries. Not only are the summaries better than those produced by our human agents, they also free up the human agents to do higher-value work.

It's not sexy. It's not going to replace anyone's job. But it's a huge, measurable efficiency gain.

replies(39): >>44974847 #>>44974853 #>>44974860 #>>44974865 #>>44974867 #>>44974868 #>>44974869 #>>44974874 #>>44974876 #>>44974877 #>>44974901 #>>44974905 #>>44974906 #>>44974907 #>>44974929 #>>44974933 #>>44974951 #>>44974977 #>>44974989 #>>44975016 #>>44975021 #>>44975040 #>>44975093 #>>44975126 #>>44975142 #>>44975193 #>>44975225 #>>44975251 #>>44975268 #>>44975271 #>>44975292 #>>44975458 #>>44975509 #>>44975544 #>>44975548 #>>44975622 #>>44975923 #>>44976668 #>>44977281 #
dsr_ ◴[] No.44974877[source]
Pro-tip: don't write the summary at all until you need it for evidence. Store the call audio at 24Kb/s Opus - that's 180KB per minute. After a year or whatever, delete the oldest audio.

There, I've saved you more millions.

replies(10): >>44974925 #>>44975015 #>>44975017 #>>44975057 #>>44975100 #>>44975212 #>>44975220 #>>44975321 #>>44975382 #>>44975421 #
sillyfluke ◴[] No.44975421[source]
You also will have saved them all the cost of the AI summaries that are incorrect as well.

The parent states:

>Not only are the summaries better than those produced by our human agents...

Now, since they have not mentioned what it took to actually verify that the AI summaries were in fact better than the human agents, I'm sceptical they did the necessary due dillengence.

Why do I think this? Because I have actually tried to do such a verification. In order to verify that the AI summary is actually correct you have to engage in the incredibly tedious task of listening to original recording literally second by second and make sure that what is said does not conflict with the AI summary in question. Not only did the AI summary fail at this test, it failed in the first recording I tested.

The AI summary stated that "Feature x was going to be in Release 3, not 4" whereas the in the recording it is stated that the feature will be in Release 4 not 3, literally the opposite of what the AI said.

I'm sorry but the fact that the AI summary is nicely formatted and has not missed a major topic of conversation means fuck all if the details that are are discussed are spectacularly wrong from a decision tracking perspective, as in literally the opposite of what is stated.

And I know "why" the Ai summary fucked up, because in that instance the topic of conversation was about how there was some confusion about which release that feature was going to be in, that's why the issue was a major item of the meeting agenda in the first place. Predicably, the AI failed to follow the convoluted discussion and "came to" the opposite conclusion.

In short, no fucking thanks.

replies(3): >>44975487 #>>44975553 #>>44975657 #
doorhammer ◴[] No.44975553[source]
Again, not the OP, so I can't speak to exactly their use-case, but the vast majority of call center calls fall into really clear buckets.

To give you an idea: Phonetic transcription was the "state of the art" when I was a QA analyst. It broke call transcripts apart into a stream of phonemes and when you did a search, it would similarly convert your search into a string of phonemes, then look for a match. As you can imagine, this is pretty error prone and you have to get a little clever with it, but realistically, it was more than good enough for the scale we operated at.

If it were an ecom site you'd already know the categories of calls you're interested in because you've been doing that tracking manually for years. Maybe something like "late delivery", "broken item", "unexpected out of stock", "missing pieces", etc.

Basically, you'd have a lot of known context to anchor the llms analysis, which would (probably) cover the vast majority of your calls, leaving you freed up to interact with outliers more directly.

At work as a software dev, having an LLM summarize a meeting incorrectly can be really really bad, so I appreciate the point you're making, but at a call center for an f500 company you're looking for trends and you're aware of your false positive/negative rates. Realistically, those can be relatively high and still provide a lot of value.

Also, if it's a really large company, they almost certainly had someone validate the calls, second-by-second, against the summaries (I know because that was my job for a period of time). That's a minimum bar for _any_ call analysis software so you can justify the spend. Sure, it's possible that was hand-waved, but as the person responsible for the outcome of the new summarization technique with LLMs, you'd be really screwing yourself to handwave a product that made you measurably less effective. There are better ways to integrate the AI hype train into a QA department than replacing the foundation of your analysis, if that's all you're trying to do.

replies(2): >>44975928 #>>44975980 #
1. sillyfluke ◴[] No.44975928[source]
Thanks for the detailed domain-specific explanation, if we assume that some whale clients of the company will end up in the call center is it not more probable that more competent human agents will be responsible for the call, whereas it's pretty much the same AI agent adressing the whale client as the regular customers in the alternative scenario?
replies(1): >>44976448 #
2. doorhammer ◴[] No.44976448[source]
Yeah, if I were running a QA department I wouldn't let llms anywhere near actual customers as far as trying to resolve a customer issue directly.

And, this is just a guess, but it's not uncommon that whale customers like that have their own dedicated account person and I'd personally stick with that model.

The use-case I'm like "huh, yeah, I could see that working well" is mostly around doing sentiment analysis and call tagging--maybe actual summaries that humans might read if I had a really well-design context for the llm to work within. Basically anything where you can have an acceptable false positive/negative rate.