{"id":19071735,"url":"https://github.com/blackhc/mnist_by_zip","last_synced_at":"2025-06-17T04:03:10.058Z","repository":{"id":147711558,"uuid":"197083080","full_name":"BlackHC/mnist_by_zip","owner":"BlackHC","description":"Compression algorithms (like the well-known zip file compression) can be used for machine learning purposes, specifically for classifying hand-written digits (MNIST)","archived":false,"fork":false,"pushed_at":"2019-12-11T22:32:19.000Z","size":13259,"stargazers_count":36,"open_issues_count":0,"forks_count":2,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-06-17T04:02:48.616Z","etag":null,"topics":["colab-notebook","machine-learning","mnist-classification","notebook-jupyter"],"latest_commit_sha":null,"homepage":"https://www.blackhc.net/blog/2019/mnist-by-zip.html","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/BlackHC.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-07-15T22:56:26.000Z","updated_at":"2024-09-14T12:26:22.000Z","dependencies_parsed_at":null,"dependency_job_id":"e64def91-50af-4077-a003-f22925fef88c","html_url":"https://github.com/BlackHC/mnist_by_zip","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/BlackHC/mnist_by_zip","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BlackHC%2Fmnist_by_zip","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BlackHC%2Fmnist_by_zip/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BlackHC%2Fmnist_by_zip/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BlackHC%2Fmnist_by_zip/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/BlackHC","download_url":"https://codeload.github.com/BlackHC/mnist_by_zip/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BlackHC%2Fmnist_by_zip/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":260288450,"owners_count":22986661,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["colab-notebook","machine-learning","mnist-classification","notebook-jupyter"],"created_at":"2024-11-09T01:30:17.270Z","updated_at":"2025-06-17T04:03:10.008Z","avatar_url":"https://github.com/BlackHC.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# MNIST by zip\n\nFor minimally more maths, see the blog post: \u003chttps://www.blackhc.net/blog/2019/mnist-by-zip/\u003e.\n\n**tl;dr: We can use compression algorithms (like the well-known zip file compression) for machine learning purposes, specifically for classifying hand-written digits (MNIST).**\n\nLearning means reducing information into knowledge. Through learning, we build concise models of the world that helps us navigate it. We reduce cognitive load by finding representations of information that require less \"storage\" for common events than for rare ones.\n\nThis insight connects (machine) learning, information theory, probability theory, *and compression*.\n\n## A bit of machine learning, probability theory, and information theory\n\nIn machine learning, we want to solve problems like recognizing hand-written digits in the [MNIST dataset](http://yann.lecun.com/exdb/mnist/). Using probability theory, we can express this as finding the digit class that maximizes the probability of the test image being of that class given the information in our dataset.\n \nInformation theory is concerned with encoding information. An important result is that given a probability distribution that tells us how often or likely certain events occur, there is an optimal encoding: the more likely an event, the fewer bits will be used in this encoding.\n\nIn the case of hand-written digits, we can look at 10 different probability distributions: one for each class of digits. We then want to find the class that has minimal encoding length for a given image.\n\nUsing Bayes' theorem, we can see that both approaches are equivalent.\n\nHowever, of course, we don't know the actual distribution or the optimal encoding. \n\nIn machine learning, we would try to learn the probability distribution using deep neuronal networks and gradient descent, for example. \n\nThrough information theory, we can also try to learn the optimal encoding using compression algorithms. Any statistical compression algorithm can be used for this. \n\nStatistical compression algorithms compress data by looking at the structure and statistics of the data itself. For example, by building look-up tables, or by creating Huffman trees like the well-known zip compression algorithm.\n\nWe could use zip compression to approximate optimal encoding! ^[Note: in particular, I am referring to the deflate compression algorithm here which is usually used by zip compressors.]\n\n## Using zip compression for classification\n\nWe can train one compressor for each class of digits. We do this by separating our training data by digit class, so we have 10 files `digit_0.training`, `digit_1.training`, …, `digit_9.training`, and we compress each using zip compression, and memorize the compressed size.\n\nWhen we want to classify a test image of a digit, given in a file `test.image`, we simply append it to each training file to obtain test files: `digit_0.test`, `digit_1.test`, …, `digit_9.test`, which we then compress as well. We compute the difference in compressed size to the training file, which tells us how well the test image was compressed with the training data for the different digits. The digit class with the highest compression (smalled compressed size difference) is our prediction. \n\nBecause zip compression makes use of data statistics, it is more likely that the test image will be compressed better when it is appended to the training data of digit that it represents. \n\n## Performance\n\nWith default compression parameters, and a byte stream in MNIST, we can obtain 27% accuracy on the full test set. By trying a couple of different parameters (compression level 9 and window bits -10, whatever that means…), we can obtain 35% accuracy using zip on MNIST:\n\n```\nClassification report for classifier ZipCompressionClassifier():\n              precision    recall  f1-score   support\n\n           0       0.58      0.28      0.37       980\n           1       0.62      0.94      0.74      1135\n           2       0.19      0.07      0.11      1032\n           3       0.22      0.34      0.26      1010\n           4       0.22      0.21      0.21       982\n           5       0.20      0.17      0.18       892\n           6       0.56      0.41      0.47       958\n           7       0.23      0.32      0.27      1028\n           8       0.36      0.33      0.35       974\n           9       0.27      0.31      0.29      1009\n\n    accuracy                           0.35     10000\n   macro avg       0.35      0.34      0.33     10000\nweighted avg       0.35      0.35      0.33     10000\n```\n![Fig 1. Confusion matrix for the zip compression classifier on MNIST's test set.](assets/confusion_matrix.png)\n\n### Binarized MNIST\n\nIf we binarize the dataset by thresholding on 128 (still using one byte per pixel though), we can achieve 45% accuracy:\n\n```\nClassification report for classifier ZipCompressionClassifier():\n              precision    recall  f1-score   support\n\n           0       0.38      0.71      0.50       980\n           1       0.95      0.70      0.80      1135\n           2       0.29      0.44      0.35      1032\n           3       0.37      0.32      0.34      1010\n           4       0.49      0.39      0.43       982\n           5       0.29      0.33      0.31       892\n           6       0.68      0.33      0.44       958\n           7       0.47      0.46      0.46      1028\n           8       0.51      0.57      0.54       974\n           9       0.42      0.24      0.30      1009\n\n    accuracy                           0.45     10000\n   macro avg       0.49      0.45      0.45     10000\nweighted avg       0.49      0.45      0.45     10000\n```\n![Fig 2. Confusion matrix for the zip compression classifier on a binarized MNIST's test set.](assets/confusion_matrix_45.png)\n\n### Binarized MNIST with chunked compressors \n\n@ylecun pointed out that 45% accuracy is not exactly great.\n\nLooking at how zip's [deflate algorithm](https://www.w3.org/Graphics/PNG/RFC-1951) works, we learn that it independently compresses blocks of up to 64 KiB size.\nThis means that separating the training set by class and compressing it as one will only take into account the last 82 (~ 65536/28/28 - 1) training samples for each class.\n\nThat is not a lot.\n\nOne way to improve on this is to chunk the training data per class into chunks of fewer samples and measure how each of these compresses the test image.\n\nTo combine these, we can either compute the average of the compressed lengths, or we can treat the compressed length as an information length which is an upper bound on the optimal encoding length.\nThe optimal encoding length is the negative logarithm of the conditional probability, and we can average those using [logsumexp](https://docs.scipy.org/doc/scipy-0.19.0/reference/generated/scipy.misc.logsumexp.html).\n\nUsing the latter, we achieve 74% accuracy on binarized MNIST.\n\n```\nClassification report for classifier ChunkedZipCompressionClassifier(max_samples_per_compressor=64):\n              precision    recall  f1-score   support\n\n           0       0.60      0.89      0.72       980\n           1       0.97      0.95      0.96      1135\n           2       0.81      0.48      0.60      1032\n           3       0.60      0.74      0.66      1010\n           4       0.91      0.77      0.83       982\n           5       0.73      0.45      0.56       892\n           6       0.92      0.72      0.81       958\n           7       0.80      0.68      0.74      1028\n           8       0.69      0.80      0.74       974\n           9       0.59      0.85      0.70      1009\n\n   accuracy                            0.74     10000\n   macro avg       0.76      0.73      0.73     10000\nweighted avg       0.76      0.74      0.74     10000\n```\n![Fig 3. Confusion matrix for the chunked zip compression classifier on a binarized MNIST's test set.](assets/confusion_matrix_74.png) \n\nFrom the confusion matrix, we can see quite clearly that the classes 2, 3, and 5 are being confused the most, which makes sense\ngiven their similar sub-structures. \n\n### Conclusion\n\nUsing zip compression as a classifier is significantly better than random (10% expected accuracy). As another comparison, we have also implemented a classifier that just counts the number of pixels in images for different digits and picks the nearest class for a test image: it achieves 20% accuracy. Overall, zip compression can capture some of MNIST's structure to make somewhat informed classification decisions.\n\nWe are uncertain whether this is an appraisal of zip compression or an indictment of the MNIST dataset.  \n\n## Acknowledgements\n\nThanks to Christopher Mattern (from DeepMind) for mentioning this to me (as a joke?) a couple of years ago at Friday Drinks and to [Owen Campbell Moore](https://twitter.com/owencm) for turning a random afternoon conversation into a tiny hack project later. I have always remembered it as a fun fact and was surprised when no one else knew about it either. So, here we go. :)\n\n## Code\n\nFor simplicity, I have implemented a `ZipCompressionClassifier` that is compatible with [https://scikit-learn.org/](https://scikit-learn.org/).\n\nFor fun, I have also added a BASH version using `gzip`. To reproduce the results on the first 10 entries from MNIST's test data, run:\n```bash\nfor test_file in data/*.image; do ./classify.sh $test_file; done\n```\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fblackhc%2Fmnist_by_zip","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fblackhc%2Fmnist_by_zip","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fblackhc%2Fmnist_by_zip/lists"}