This is a tangent but I think that this neat illustration of how LLMs regurgitate their training material makes me voice a little prediction I've been nursing recently:
LLMs are better at generating the boilerplate of todays programming languages than they will be with tomorrows programming languages.
This is because not only will tomorrows programming languages be newer and lacking in corpus to train the models in but, by the time a corpus is built, that corpus will consist largely of LLM hallucinations that got checked into github!?
The internet that that has been trawled to train the LLMs is already largely SEO spam etc, but the internet of the future will be much more so. The loop will feed into itself and become ever worse quality.
replies(1):