This PhD thesis gives a very good introduction: https://arxiv.org/abs/2104.10544
This PhD thesis gives a very good introduction: https://arxiv.org/abs/2104.10544
In the context of generative AI it's precisely the fact that we're dealing with lossy compression that it works at all. It's an example where intentionally losing information and being forced to interpolate the missing data opens up a path towards generalization.
Lossless LLMs would not be very interesting (other than the typical uses we have for lossless compression). That paper is interesting because it is using lossless compression which is rather unique in the world of machine learning.
Do you want a few large corporations to have to have absolute and total control of all AI? Because that's how you get that-- under that reasoning google/etc. will just stick requirements in their terms of service that they can train on your data and they'll be the only parties that can effectively make useful models.
Copyright isn't some natural law, were it all your works would be preempted by their presence deep inside pi. Instead it's a pragmatic compromise intended to give creators a time limited monopoly for reproduction of their work to encourage more creation. In the US it has never covered highly transformative uses-- in fact it shouldn't even matter if embedded in the AI were a literal encoding of the whole work, though that generally isn't the case. All our creations are fundamentally derivative, and fortunately the judiciary does seem to have a better handle on what copyright is than a lot of public.
If anything the rise of generative AI tools is a greater sign that the copyright tradeoff should shift towards more permissive: We don't need as much restriction on people's actual natural rights as we used to in order to get valuable and important stuff created as it's never been less expensive, less risky, or easier to monetize through non-restrictive means than it is today.
We don't get to choose to live in a world with these tools or not-- the genie is out of the bottle. But we probably do get a lot of choice about their openness and everyone's level of access to create and use these tools. Lets not choose poorly.
Is this not what we currently have? Large corporations own the data centers, and there will never be a collectively-owned data center unless our dominant mode of production changes.
I know there are open models, but how do you serve them to users who don't have the compute?
Sure, not every user can obtain the compute. But the fact that a great many people can, and that the people that it makes the most difference for can, creates a tremendous playing field leveling.
Imagine that welding could only be performed by WeldCo and what a negative effect that would have. Fortunately anyone can weld, most people won't. But if you found yourself dead in the water and weldco was trying to extort you, you'd just pick up the equipment teach yourself, and commence with the welding. (or go hire someone to do so). Now realize that LLMs may well turn out to be more general than even welding is. So the freedom to access these tools is all the more critical, even if many will find they don't need to. The widespread access is why you may not need to.