Remembrance of lzip Past
Looking forward to the future of lossy compression
Written by Christopher HesseApril 1st, 2025
25 years ago today, lzip revolutionized the world of lossy compression (not to be confused with the lossless compression program lzip). However, nobody expected the tragic loss of the source code and binary artifacts a couple of years ago due to a misplaced space in a command line that compressed the maintainer's root filesystem to 0 bytes.
As a reminder, lossless compression works by removing repeated data from your file, but many files don't contain that much repeated data. lzip's insight was that most files do contain a lot of unimportant data, and that by getting rid of that data, you can completely surpass the best lossless compression.
loss-e
Having irreversibly lost the source code, we now turn to modern machine learning approaches to try to emulate the power of lzip. As you may recall from the previous post, we had made a "toy" version of lzip using cutting edge machine learning models available at the time. Since the technology has improved over the past few years, is an updated "toy" version good enough to replace the original lzip for some use cases? Let us compare.
We have created a command line tool, loss-e, to implement the compression and decompression using language models for images and text, as long as the user can supply an API key.
Terminal
pip install loss-e
export OPENAI_API_KEY="..."
# compression
loss-e mode=compress input=cat.jpg output=cat.lse
# decompression
loss-e mode=decompress input=cat.lse output=cat.png
Note that due to 4o image generation being not yet available in the OpenAI API, image decoding in the command line tool currently uses dalle-3.
Evaluation
We will use the OpenAI 4o model for both compression and decompression.
To measure the distance between the original and decompressed versions of images, we will use the common "corporate similarity" metric, where we ask a model to find the differences between two pictures and assign a score in the range 1-10, where 1 means they are completely different, and 10 means they're the same picture.
Results
Amazing! Given that the "toy" version outputs all had Similarity: 1, the similarity score has gone up dramatically in the last 5 years, who knows what the future holds!
Credits
Thanks to Shantanu Jain for reviewing this article.
all code samples on this site are in the public domain unless otherwise stated