Compression algorithms for nearly uniform data

I've seen questions on compression algorithms around SE, but none quite fit what I'm looking for. Clearly truly uniformly distributed data cannot be compressed, but how close can we get?

My (probably incorrect) thoughts: I would imagine that by transforming the data (normalizing in some way?), you could accentuate the non-uniformity aspects of nearly uniform data and then use that transformed set to compress, perhaps along with the inverse transform or its parameters. But maybe I'm totally wrong and they all perform equally terribly as the data approaches uniformity?

When I look at lists of (lossless) compression algorithms, I don't see them ranked by how effective they are against certain types of data, at least not in any concrete terms. Does anyone know of a source that dives into this?

As background, I have an application where the data set is not independent, but nevertheless appears to be nearly uniform (most of the symbols have very low frequencies, and none of them have very high frequencies). So I was wondering if there are algorithms that can exploit the sampling dependence even if the data frequencies are mostly low. Then of course it would be more helpful to have a source that detailed exactly why some compression algorithms might perform better at this than others, if such a thing existed.

1 answer

  • answered 2020-10-22 17:42 btilly

    The short answer is no. Such a thing both does not and cannot exist.

    The long answer involves information theory.

    What matters to a compression algorithm is not how hard it is to say the thing you are specifying. It is how many equally likely things could you have said instead, but didn't. That is, if you have M things you might have said that were equally likely, you must send a signal long enough that it specifies which of the M you said. And that requires log_2(M) bits to make it clear which one you actually said.

    In the case of a stream of independent symbols, each with a known probability, we can figure out how many messages could be sent with equal likelihood. And thereby put a lower bound on how efficiently a message can be compressed. That lower bound is the entropy bits per symbol sent. This lower bound is actually achieved by Huffman coding.

    In order to do better than Huffman coding, we must find some additional structure to our messages. For example language often has correlations where "h" is likely to follow "t". Or in images, the color of a pixel tends to be similar to the color of a nearby pixel. Any such structure reduces the number of equally likely messages we could have sent, and opens up the possibility of a better compression algorithm.

    However you've not described such a structure. So Huffman coding is the best you can do. And if the symbol probabilities are close to each other, it won't give you very much.