Essential Coding Theory [pdf]

(cse.buffalo.edu)

Show context

mingtianzhang ◴[29 Aug 25 17:37 UTC] No.45067101[source]▶

>>45065705 (OP) #

It would be interesting to add more lossless compression stuff, which has a close connection to generative AI.

This PhD thesis gives a very good introduction: https://arxiv.org/abs/2104.10544

replies(1): >>45067799 #

1. roadside_picnic ◴[29 Aug 25 18:36 UTC] No.45067799[source]▶

>>45067101 #

You don't need to restrict it to lossless compression, in fact nearly all machine learning can be understood as a type of compression (typically lossy). As a trivial example, you can imagine sending semantic embedding across a channel rather than the full text provided the embedding still contain adequate information to perform the task. Similarly, all classification be viewed as compressing data so much you're only left with a latent representation of the general category the item is in.

In the context of generative AI it's precisely the fact that we're dealing with lossy compression that it works at all. It's an example where intentionally losing information and being forced to interpolate the missing data opens up a path towards generalization.

Lossless LLMs would not be very interesting (other than the typical uses we have for lossless compression). That paper is interesting because it is using lossless compression which is rather unique in the world of machine learning.

replies(3): >>45068195 #>>45071246 #>>45073444 #

2. atrettel ◴[29 Aug 25 19:14 UTC] No.45068195[source]▶

>>45067799 (TP) #

The interpretation of AI/ML as a form of lossy compression is definitely an interesting one. I wish more people (especially judges) would recognize this. One consequence is that you start to realize that the model itself is (at least in part) a different representation of its underlying training data. Yes, it is a lossy representation, but a representation nonetheless.

replies(1): >>45071292 #

3. andoando ◴[30 Aug 25 01:49 UTC] No.45071246[source]▶

>>45067799 (TP) #

All learning, human or AI is a lossy compression.

It is by generalizing data that we form mental conceptions. A square is a square despite its size or color or material. A house is a house so long as something lives there.

4. nullc ◴[30 Aug 25 02:00 UTC] No.45071292[source]▶

>>45068195 #

> I wish more people (especially judges) would recognize this

Do you want a few large corporations to have to have absolute and total control of all AI? Because that's how you get that-- under that reasoning google/etc. will just stick requirements in their terms of service that they can train on your data and they'll be the only parties that can effectively make useful models.

Copyright isn't some natural law, were it all your works would be preempted by their presence deep inside pi. Instead it's a pragmatic compromise intended to give creators a time limited monopoly for reproduction of their work to encourage more creation. In the US it has never covered highly transformative uses-- in fact it shouldn't even matter if embedded in the AI were a literal encoding of the whole work, though that generally isn't the case. All our creations are fundamentally derivative, and fortunately the judiciary does seem to have a better handle on what copyright is than a lot of public.

If anything the rise of generative AI tools is a greater sign that the copyright tradeoff should shift towards more permissive: We don't need as much restriction on people's actual natural rights as we used to in order to get valuable and important stuff created as it's never been less expensive, less risky, or easier to monetize through non-restrictive means than it is today.

We don't get to choose to live in a world with these tools or not-- the genie is out of the bottle. But we probably do get a lot of choice about their openness and everyone's level of access to create and use these tools. Lets not choose poorly.

replies(1): >>45076952 #

5. mingtianzhang ◴[30 Aug 25 10:16 UTC] No.45073444[source]▶

>>45067799 (TP) #

I mean, all likelihood-based generative models can be used as lossless compressors (by using arithmetic coding). The likelihood of a generated text corresponds exactly to its minimal code length under the model in practice. Thus, all current likelihood-based generative models are exact lossless compressors.

replies(1): >>45073450 #

6. mingtianzhang ◴[30 Aug 25 10:17 UTC] No.45073450[source]▶

>>45073444 #

For other AI systems like recognition/classification models, they are lossy.

7. thankyoufriend ◴[30 Aug 25 18:32 UTC] No.45076952{3}[source]▶

>>45071292 #

> Do you want a few large corporations to have to have absolute and total control of all AI?

Is this not what we currently have? Large corporations own the data centers, and there will never be a collectively-owned data center unless our dominant mode of production changes.

I know there are open models, but how do you serve them to users who don't have the compute?

replies(1): >>45079227 #

8. nullc ◴[31 Aug 25 00:24 UTC] No.45079227{4}[source]▶

>>45076952 #

Users can obtain the compute, it's not even that substantial for current LLMs, esp if you don't mind running them somewhat slowly.

Sure, not every user can obtain the compute. But the fact that a great many people can, and that the people that it makes the most difference for can, creates a tremendous playing field leveling.

Imagine that welding could only be performed by WeldCo and what a negative effect that would have. Fortunately anyone can weld, most people won't. But if you found yourself dead in the water and weldco was trying to extort you, you'd just pick up the equipment teach yourself, and commence with the welding. (or go hire someone to do so). Now realize that LLMs may well turn out to be more general than even welding is. So the freedom to access these tools is all the more critical, even if many will find they don't need to. The widespread access is why you may not need to.

↑