←back to thread

181 points ekiauhce | 1 comments | | HN request time: 0.208s | source
Show context
omoikane ◴[] No.42224958[source]
The original email thread was from 2001, and it gets posted to HN periodically:

https://news.ycombinator.com/from?site=patrickcraig.co.uk

For another compression challenge that is still ongoing, try "500000€ Prize for Compressing Human Knowledge" (also known as "Hutter Prize"):

http://prize.hutter1.net/

https://news.ycombinator.com/item?id=37502329 - Hutter Prize for compressing human knowledge (2023-09-13, 215 comments)

replies(2): >>42225155 #>>42232092 #
vlovich123 ◴[] No.42225155[source]
I have a fundamental problem with the Hutter prize stating that intelligence is related to compression & then sponsoring a prize for lossless compression. Intelligence is related to lossy compression. Lossless is mainly a mechanistic act.
replies(5): >>42225198 #>>42225266 #>>42231630 #>>42232462 #>>42233497 #
_hark ◴[] No.42225266[source]
Say we have some dataset composed of D bytes.

Next, say I find some predictive model of the data M, where M is composed of N bytes. Furthermore, let us say that the entropy of the dataset under the model is H bytes.

Then, if N + H < D, my model compresses the data.

It doesn't matter if the model is deterministic or probabilistic: a probabilistic model can be used to (losslessly) compress a dataset with entropy coding.

One more argument for compression being equivalent to intelligence: Across many fields of statistical machine learning, there are generalization bounds which have the form:

test error <= train error + model complexity

That is, we don't expect to do any worse on new (test) data, than the sum of the train error and the model complexity (smallest compressed size). Notice that if you interpret the train error as the entropy of the data (i.e. error under a cross entropy loss), then the models which satisfy the statistical generalization bound correspond to those which best compress the data. In other words: the model which produces the shortest description of the data is the one which is expected to generalize best.

replies(2): >>42225426 #>>42231019 #
vlovich123 ◴[] No.42231019[source]
> It doesn't matter if the model is deterministic or probabilistic: a probabilistic model can be used to (losslessly) compress a dataset with entropy coding.

But if you can choose to lose information you can obviously achieve a higher compression score. That's literally what optical & auditory compression exploits. Indeed, we know people generally don't memorize the entire Wikipedia article. Rather they convert what they learn into some internally consistent story that they can then recite at any time & each time they recite it it's even differently worded (maybe memorizing some facts that help solidify the story).

Again, I have no problem with compression and decompression being equated to intelligence provided both are allowed to be lossy (or at least one facet of intelligence). That's because you get to inject structure into the stored representation that may not otherwise exist in the original data and you get to choose how to hydrate that representation. That's why LZMA isn't "more intelligent" than ZIP - the algorithm itself is "smarter" at compression but you're not getting to AGI by working on a better LZMA.

It's also why H264 and MP3 aren't intelligent either. While compression is lossy decompression is deterministic. That's why we can characterize LLMs as "more intelligent" than LZMA even though LZMA compresses losslessly better than LLMs.

replies(1): >>42231118 #
1. _hark ◴[] No.42231118[source]
I agree with you in spirit. I just thought you might be interested in some of the technical details regarding the relationship between compression and generalization!

I'll have a paper out next week which makes your point precise, using the language of algorithmic rate--distortion theory (lossy compression applied to algorithmic information theory + neural nets).

I think another way of understanding this is through the "Value Equivalence Principle", which points out that if we are learning a model of our environment, we don't want to model everything in full detail, we only want to model things which affect our value function, which determines how we will act. The value function, in this sense, implies a distortion function that we can define lossy compression relative to.