The $5000 Compression Challenge (2001)

(www.patrickcraig.co.uk)

181 points ekiauhce | 2 comments | 23 Nov 24 21:10 UTC | HN request time: 0.417s | source

Show context

Xcelerate ◴[24 Nov 24 21:41 UTC] No.42230884[source]▶

There's another interesting loophole to these general compression challenges. A universal Turing machine will necessarily compress some number of strings (despite the fact that almost all strings are incompressible). The set of compressible strings varies depending on the choice of UTM, and if your UTM if fixed, you're out of luck for random data. But if the UTM is unspecified, then there exist an infinite number of UTMs that will compress any specific string.

To some extent, this resembles the approach of "hiding the data in the decompressor". But the key difference here is that you can make it less obvious by selecting a particular programming language capable of universal computation, and it is the choice of language that encodes the missing data. For example, suppose we have ~17k programming languages to choose from—the language selection itself encodes about 14 bits (log2(17000)) of information.

If there are m bits of truly random data to compress and n choices of programming languages capable of universal computation, then as n/m approaches infinity, the probability of at least one language being capable of compressing an arbitrary string approaches 1. This ratio is likely unrealistically large for any amount of random data more than a few bits in length.

There's also the additional caveat that we're assuming the set of compressible strings is algorithmically independent for each choice of UTM. This certainly isn't the case. The invariance theorem states that ∀x|K_U(x) - K_V(x)| < c for UTMs U and V, where K is Kolmogorov complexity and c is a constant that depends only on U and V. So in our case, c is effectively the size of the largest minimal transpiler between any two programming languages that we have to choose from, and I'd imagine c is quite small.

Oh, and this all requires computing the Kolmogorov complexity of a string of random data. Which we can't, because that's uncomputable.

Nevertheless it's an interesting thought experiment. I'm curious what the smallest value of m is such that we could realistically compress a random string of length 2^m given the available programming languages out there. Unfortunately, I imagine it's probably like 6 bits, and no one is going to give you an award for compressing 6 bits of random data.

replies(3): >>42231360 #>>42231365 #>>42232405 #

_hark ◴[24 Nov 24 23:02 UTC] No.42231365[source]▶

>>42230884 #

The issue with the invariance theorem you point out always bugged me.

Let s be an algorithmically random string relative to UTM A. Is it the case that there exists some pathological UTM S, such that K(s|S) (the Kolmogorov complexity of s relative to S) is arbitrarily small? I.e. the blank print statement of S produces s. And there always exists such an S for any s?

Is there some way of defining a meta-complexity measure, the complexity of some UTM without a reference UTM? It seems intuitive that although some pathological UTM might exist that can "compress" whichever string you have, its construction appears very unnatural. Is there some way of formalizing this "naturalness"?

replies(2): >>42233214 #>>42233484 #

1. Xcelerate ◴[25 Nov 24 04:20 UTC] No.42233214[source]▶

>>42231365 #

> Is it the case that there exists some pathological UTM S, such that K(s|S) (the Kolmogorov complexity of s relative to S) is arbitrarily small

Yes. It’s not even that hard to create. Just take a standard UTM and perform a branching “if” statement to check if the input is the string of interest before executing any other instructions.

> Is there some way of defining a meta-complexity measure, the complexity of some UTM without a reference UTM?

Haha, not that anyone knows of. This is one of the issues with Solomonoff induction as well. Which UTM do we pick to make our predictions? If no UTM is privileged over any other, then some will necessarily give very bad predictions. Averaged over all possible induction problems, no single UTM can be said to be superior to the others either. Solomonoff wrote an interesting article about this predicament a while ago.

(A lot of people will point to the constant offset of Kolmogorov complexity due to choice of UTM as though it somehow trivializes the issue. It does not. That constant is not like the constant in time complexity which is usually safe to ignore. In the case of Solomonoff induction, it totally changes the probability distribution over possible outcomes.)

replies(1): >>42235581 #

2. _hark ◴[25 Nov 24 12:13 UTC] No.42235581[source]▶

>>42233214 (TP) #

Interesting. I guess then we would only be interested in the normalized complexity of infinite strings, e.g. lim n-> \infty K(X|n)/n where X is an infinite set of numbers (e.g. the decimal expansion of some real number), and K(X|n) is the complexity of the first n of them. This quantity should still be unique w/o reference to the choice of UTM, no?

↑