Overfitted a 900KB Transformer to Compress a 100MB CSV into 7MB
83
sspidy__ 3 days ago 43 comments
DE version is available. Content is displayed in original English for accuracy.
I built an experiment that uses an overfitted transformer and arithmetic coding to compress individual files.
Instead of training the model to generalize, I train a 900KB transformer to memorize a single file and predict the next byte. Those predictions are fed into an arithmetic coder to produce the compressed output.
On a 100MB NYC taxi CSV, it compresses to about 7MB (~0.5 bits/byte). On a 100MB slice of enwik9, it compresses to about 21MB (~1.68 bits/byte).
It's pretty slow right now (roughly 20–30 minutes of training and 45 minutes each for compression and decompression on my AMD 7800XT).
Checkout the repo - https://github.com/samyak112/pym-particles

Discussion (43 Comments)Read Original on HackerNews
main drawback is that it's not lossless ;-)
but this is great. I hope this actually becomes a format that wraps the weights and transformer module (maybe this can also be NAS-optimized too?). Maybe it would even work for video?
It's like calling gzip but instead of compression level you choose kolmogorov complexity level
Just amazing, wow
I am curious. A classic machine learning ensemble approach is to overfit a collection of small models then bag them (e.g. voting) allowing the models to generalize.
I'm sure someone's tried to overfit a bunch of transformers for compression like this, then bag them to see how well it does?
I know the top submission was able to get it to 13 mb.
Still trying some ideas to get better compression.
Can you point me to something that i can read? I really wanna try this approach , diffusion model does sounds interesting for compression.
Edit: oh wait that's too easy. Need to generate /publish random digits so everyone can use it.
Decompressor: Take any old algorithm for finding digets of pi, find first 100M of them, print them.
Compression ratio of 0! :0
Random data does not mean it does not match a pattern in your dictionary for example.
A non-general compression algorithm (model - I don't mean a distinct llm, but "modeling data") targeted at a specific dataset will always do better than a general algorithm.
The reason I mentioned the "encoder" doesn't matter - arithmetic coding, for the data it is presented, will beat huffman/adaptive huffman every day, but it's the model that is where the real "compression" comes into play.
I've implemented enough "coders" over the years, including arithmetic for both commercial and research purposes (was a student of Glen Langdon).
1. How much was AI used to generate documentation for this project?
2. The 100MB CSV data sources are not provided in the repo so it doesn't seem possible to reproduce your results. The enwik9 dataset says it is a "slice" of the larger data set, and there are many NYC taxi trip record datasets that exist. Can you provide the datasets used to generate your results?
3. I am surprised to see performance comparisons only between your transformer and WinZIP. What were your results when comparing your transformer to more modern approaches like LZMA2 (level 9), BZIP2 and ZPAQ (max effort)?
2. Have added the link for downloading both the enwik9 slice and the nyc dataset. Apologies I forgot to add it.
You can get it from here - https://github.com/samyak112/pym-particles/blob/main/README....
3. Other than zip i tested it with zstd19, and now that you mentioned LZMA2 and BZIP2
I got results on enwik9 100mb slice as
zstd - 28mb bzip2 - 30mb lzma2 - 26mb
I will mention these and results from ZPAQ in the readme for both files, thanks for pointing them out!!!
But the thing is this neural compression approach cant be used right now, as it takes hours to compress and de compress a 100mb file so not really usable and more of a fun project.
So apply this same logic to compressing a bigger model within a smaller model
I know this is absolutely regarded, but humour me please
Compression is such an interesting field
Check it out: https://github.com/samyak112/pym-particles/blob/main/arithme...
Its not an issue is it? I am not sure.
So the codec would be something like: <header describing image size + transformer layer shape> <transformer data itself>
I've seen experiments where people have a "fixed" pipeline but I think having something more dynamic would work quite well.
BUT my model size is just 900KB (for 100mb file atleast) so it is negligible
I tested with 100 MB files because anything larger takes a long time to evaluate. The actual target was at least 1 GB, and in that case I would use a 100 MB model (Shannon entropy rules).
I also tried it on a 100 MB Photoshop file and was able to compress it down to 45 MB, whereas ZIP could only get it down to 60 MB. So yeah still not losing gains.
But it’s only for the game I’m building and it’s not pure compression work, I had to do some tricky things
For context these numbers are for a grid based game where players can perform 4 actions per second, and the numbers I’m sharing are for 30 minutes of gameplay with anywhere from 2-1024+ players (human players) playing simultaneously
So if you do the math, my compression feat is effectively ~99% compression on naive best case. And if you compare it to the raw data, it’s closing in on an even higher number than that I haven’t done the math but the raw data is another factor of 10 greater than ~100KB so the “compression” versus raw data is ~99.9%
It sounds absolutely bullshit I know :D
But I will be posting a blog post soon once I release the game.
I do compression in quotes because it’s not a pure compression feat, the 99%+ feat is effectively being clever about what actually requires compression to achieve the same outcome