←back to thread

317 points laserduck | 1 comments | | HN request time: 0.211s | source
Show context
lolinder ◴[] No.42157485[source]
One of the consistent problems I'm seeing over and over again with LLMs is people forgetting that they're limited by the training data.

Software engineers get hyped when they see the progress in AI coding and immediately begin to extrapolate to other fields—if Copilot can reduce the burden of coding so much, think of all the money we can make selling a similar product to XYZ industries!

The problem with this extrapolation is that the software industry is pretty much unique in the amount of information about its inner workings that is publicly available for training on. We've spent the last 20+ years writing millions and millions of lines of code that we published on the internet, not to mention answering questions on Stack Overflow (which still has 3x as many answers as all other Stack Exchanges combined [0]), writing technical blogs, hundreds of thousands of emails in public mailing lists, and so on.

Nearly every other industry (with the possible exception of Law) produces publicly-visible output at a tiny fraction of the rate that we do. Ethics of the mass harvesting aside, it's simply not possible for an LLM to have the same skill level in ${insert industry here} as they do with software, so you can't extrapolate from Copilot to other domains.

[0] https://stackexchange.com/sites?view=list#answers

replies(4): >>42157535 #>>42157654 #>>42157924 #>>42164051 #
1. rm445 ◴[] No.42164051[source]
Many other industries haven't yet been fully eaten by software. All kinds of data is locked away and in proprietary formats, and is generated by humans without much automation. I don't think we know where exactly the frontiers are, once someone puts in the work to build large datasets, and automates creation of synthetic training data. Whole industries could suddenly flip from 'impossible' to 'easy' for AI.