Say we have some dataset composed of D bytes.
Next, say I find some predictive model of the data M, where M is composed of N bytes. Furthermore, let us say that the entropy of the dataset under the model is H bytes.
Then, if N + H < D, my model compresses the data.
It doesn't matter if the model is deterministic or probabilistic: a probabilistic model can be used to (losslessly) compress a dataset with entropy coding.
One more argument for compression being equivalent to intelligence: Across many fields of statistical machine learning, there are generalization bounds which have the form:
test error <= train error + model complexity
That is, we don't expect to do any worse on new (test) data, than the sum of the train error and the model complexity (smallest compressed size). Notice that if you interpret the train error as the entropy of the data (i.e. error under a cross entropy loss), then the models which satisfy the statistical generalization bound correspond to those which best compress the data. In other words: the model which produces the shortest description of the data is the one which is expected to generalize best.