The $5000 Compression Challenge (2001)

(www.patrickcraig.co.uk)

181 points ekiauhce | 3 comments | 23 Nov 24 21:10 UTC | HN request time: 0.622s | source

Show context

Xcelerate ◴[24 Nov 24 21:41 UTC] No.42230884[source]▶

There's another interesting loophole to these general compression challenges. A universal Turing machine will necessarily compress some number of strings (despite the fact that almost all strings are incompressible). The set of compressible strings varies depending on the choice of UTM, and if your UTM if fixed, you're out of luck for random data. But if the UTM is unspecified, then there exist an infinite number of UTMs that will compress any specific string.

To some extent, this resembles the approach of "hiding the data in the decompressor". But the key difference here is that you can make it less obvious by selecting a particular programming language capable of universal computation, and it is the choice of language that encodes the missing data. For example, suppose we have ~17k programming languages to choose from—the language selection itself encodes about 14 bits (log2(17000)) of information.

If there are m bits of truly random data to compress and n choices of programming languages capable of universal computation, then as n/m approaches infinity, the probability of at least one language being capable of compressing an arbitrary string approaches 1. This ratio is likely unrealistically large for any amount of random data more than a few bits in length.

There's also the additional caveat that we're assuming the set of compressible strings is algorithmically independent for each choice of UTM. This certainly isn't the case. The invariance theorem states that ∀x|K_U(x) - K_V(x)| < c for UTMs U and V, where K is Kolmogorov complexity and c is a constant that depends only on U and V. So in our case, c is effectively the size of the largest minimal transpiler between any two programming languages that we have to choose from, and I'd imagine c is quite small.

Oh, and this all requires computing the Kolmogorov complexity of a string of random data. Which we can't, because that's uncomputable.

Nevertheless it's an interesting thought experiment. I'm curious what the smallest value of m is such that we could realistically compress a random string of length 2^m given the available programming languages out there. Unfortunately, I imagine it's probably like 6 bits, and no one is going to give you an award for compressing 6 bits of random data.

replies(3): >>42231360 #>>42231365 #>>42232405 #

1. SideQuark ◴[25 Nov 24 01:42 UTC] No.42232405[source]▶

>>42230884 #

The problem with the UTM argument designed for a specific string is that the UTM size grows without bound. You also don’t need a UTM, which will be much larger than a simple TM or finite automaton. And now we’re back to the original problem: designing a machine capable of compressing the specific string and with smaller description.

UTM provides no gain.

replies(1): >>42233322 #

2. Xcelerate ◴[25 Nov 24 04:44 UTC] No.42233322[source]▶

>>42232405 (TP) #

You cannot specify the “size” of an arbitrary Turing machine without knowing the index of that machine in some encoding scheme. If we are to account for the possibility of any string being selected as the one to compress, and if we wish to consider all possible algorithmic ways to do this, then the encoding scheme necessarily corresponds to that of a universal Turing machine.

It’s a moot point anyway as most programming languages are capable of universal computation.

Again, I’m not claiming we can compress random data. I’m claiming we can (sometimes) make it look like we can compress random data by selecting a specific programming language from one of many possibilities. This selection itself constitutes the “missing” information.

replies(1): >>42236391 #

3. SideQuark ◴[25 Nov 24 14:06 UTC] No.42236391[source]▶

>>42233322 #

> You cannot specify the “size” of an arbitrary Turing machine

Related to compression this is, as I said, irrelevant. Kolmogorov complexity, in the first order naive sense, picks a machine then talks about length. In the complete sense, due to Kolmorogov invariance which extends the naive version to any possible encoding, it shows that there is a fundamental minimal length over any machine Then one proves that the vast majority of strings are incompressible. Withtout this machine invariance there could be no notion of inherent Kolmorogov complexity of a string.

This has all been known since the 1960s.

So no, it matters not which UTM, FSM, TM, Java, Hand coded wizard machine, or whatever you choose. This line of reasoning has been investigated decades ago and thrown out since it does not work.

Your claim

>then there exist an infinite number of UTMs that will compress any specific string

combined with your claim

> For example, suppose we have ~17k programming languages to choose from—the language selection itself encodes about 14 bits (log2(17000)) of information.

Does not let you use that information to compress the string you wish to compress, unless you're extremely lucky that the string matches the particular information in the language choice matches, say, the string prefix so you know how to use the language encoded information to apply to the string. Otherwise you need to also encode the information of how the language choice maps to the prefix (or any other part), and that information is going to be as large or larger than what you want. Your program, to reconstruct the data you think language choice hides, will have to restore that data - how does your program choice exactly do this? Does it list all 17k languages and look up the proper information? Whoops, once again your program is too big.

So no, you cannot get a free ride with this choice.

By your argument, Kolmogorov complexity (the invariance version over any possible machine) allows compression of any string, which has been proven impossible for decades.

↑