Information-Theoretic Limits of Lossless Compression

Shannon's source coding theorem establishes a hard boundary: no lossless compression scheme can achieve an average code length shorter than the entropy of the source. This theorem is over seventy years old, and yet its implications are still being discovered.

Entropy as a Limit

For a discrete memoryless source with probability distribution P, the entropy H(X) = -Σ P(x) log₂ P(x) gives the minimum average bits per symbol. Any scheme achieving this rate is optimal. Any scheme claiming to beat it is either wrong or operating under different assumptions about the source.

The Kolmogorov Connection

Shannon entropy measures the average complexity of a source. Kolmogorov complexity measures the complexity of individual strings. The two are related but distinct: Kolmogorov complexity is uncomputable, while Shannon entropy is not. However, the expected Kolmogorov complexity of strings drawn from a source converges to the Shannon entropy of that source.

Compression is the dual of prediction. A good compressor is a good predictor, and vice versa. This duality is not metaphorical — it is a mathematical equivalence.

Beyond Memoryless Sources

Real data is not memoryless. Context modeling — predicting the next symbol based on previous symbols — is where modern compressors gain their advantage. The entropy rate of a stationary process, defined as the limit of conditional entropies, generalizes Shannon's bound to sources with memory.

Arithmetic coding achieves this rate asymptotically when paired with a good context model. The compression ratio is entirely determined by the quality of the model, not the coding scheme. This is why neural compression methods, which use powerful sequence models, can outperform traditional compressors on structured data.

The limit is real and cannot be circumvented by cleverness. What can be improved is our model of the source. Better models mean shorter codes. That is the only game in town.