←back to thread

317 points laserduck | 1 comments | | HN request time: 0.214s | source
Show context
lolinder ◴[] No.42157485[source]
One of the consistent problems I'm seeing over and over again with LLMs is people forgetting that they're limited by the training data.

Software engineers get hyped when they see the progress in AI coding and immediately begin to extrapolate to other fields—if Copilot can reduce the burden of coding so much, think of all the money we can make selling a similar product to XYZ industries!

The problem with this extrapolation is that the software industry is pretty much unique in the amount of information about its inner workings that is publicly available for training on. We've spent the last 20+ years writing millions and millions of lines of code that we published on the internet, not to mention answering questions on Stack Overflow (which still has 3x as many answers as all other Stack Exchanges combined [0]), writing technical blogs, hundreds of thousands of emails in public mailing lists, and so on.

Nearly every other industry (with the possible exception of Law) produces publicly-visible output at a tiny fraction of the rate that we do. Ethics of the mass harvesting aside, it's simply not possible for an LLM to have the same skill level in ${insert industry here} as they do with software, so you can't extrapolate from Copilot to other domains.

[0] https://stackexchange.com/sites?view=list#answers

replies(4): >>42157535 #>>42157654 #>>42157924 #>>42164051 #
1. mountainriver ◴[] No.42157535[source]
Yep, this is also the reason LLMs can probably work well for a lot more things if we did have the data