{"id":15669758,"url":"https://github.com/alexgustafsson/compdec","last_synced_at":"2026-05-20T21:03:26.156Z","repository":{"id":83002353,"uuid":"316272528","full_name":"AlexGustafsson/compdec","owner":"AlexGustafsson","description":"CompDec is a novel approach to automatically detect the compression algorithm used for file fragments using machine learning","archived":false,"fork":false,"pushed_at":"2021-01-12T11:07:43.000Z","size":5472,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-01-19T07:23:39.636Z","etag":null,"topics":["bth","carving","cnn","compression","digital-forensics","forensics","machine-learning","paper","research","study"],"latest_commit_sha":null,"homepage":"","language":"TeX","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/AlexGustafsson.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2020-11-26T15:39:37.000Z","updated_at":"2023-09-30T15:27:31.000Z","dependencies_parsed_at":"2023-10-20T16:34:56.832Z","dependency_job_id":null,"html_url":"https://github.com/AlexGustafsson/compdec","commit_stats":null,"previous_names":[],"tags_count":5,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AlexGustafsson%2Fcompdec","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AlexGustafsson%2Fcompdec/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AlexGustafsson%2Fcompdec/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AlexGustafsson%2Fcompdec/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/AlexGustafsson","download_url":"https://codeload.github.com/AlexGustafsson/compdec/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243230101,"owners_count":20257644,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bth","carving","cnn","compression","digital-forensics","forensics","machine-learning","paper","research","study"],"created_at":"2024-10-03T14:41:08.443Z","updated_at":"2025-12-25T22:17:06.555Z","avatar_url":"https://github.com/AlexGustafsson.png","language":"TeX","funding_links":[],"categories":[],"sub_categories":[],"readme":"CompDec\n======\n\nA project in machine learning and digital forensics for the courses DV2578 (Machine Learning) and DV2579 (Advanced Course in Digital Forensics).\n\nIn digital forensics *data carving* is the act of extracting files directly from some memory media - without any metadata or known filesystem. Conventional techniques use simple heuristics such as magic numbers, headers etc. These techniques do not scale well due to a limited number of supported file types, slow processing speeds and insufficient accuracy.\n\nRecently, machine learning has been applied to the subject, achieving state-of-the-art results both in terms of scale, accuracy and speed. These techniques utilize an efficient feature extraction from files that can be turned into a small image or other representation of the features. The images are then fed to convolutional neural networks to learn to identify parts of files.\n\nThese techniques focus on generality to identify files such as documents (.txt, .docx, .ppt, .pdf) and images (.jpg, .png). There is a gap in research when it comes to effectively identify compressed files and what algorithm was used. Compression algorithms seek to make data as dense as possible, which will in turn likely yield a higher entropy than a typical file. This in theory could make detection much harder.\n\nThis project aims to fill this gap, answering the following questions:\n\n* How do compressed files compare to non-compressed files in terms of entropy?\n* How can a machine-learning system be designed and trained to detect compression algorithms?\n\n**TL;DR** CompDec is a novel approach to automatically detect the compression algorithm used for file fragments using machine learning.\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"./samples/sample-predictions.png\"\u003e\n\u003c/p\u003e\n\n_Predicted labels for some randomly chosen samples. Format: prediction (confidence) (label)._\n\n## Table of Contents\n\n[Quickstart](#quickstart)\u003cbr /\u003e\n[Dataset](#dataset)\u003cbr /\u003e\n[Development](#development)\u003cbr /\u003e\n[Development - Quickstart](#development-quickstart)\u003cbr /\u003e\n[Development - Quickstart - Setup](#development-quickstart-setup)\u003cbr /\u003e\n[Development - Quickstart - Data Preparation](#development-quickstart-data)\u003cbr /\u003e\n[Development - Quickstart - Training and Evaluation](#development-quickstart-training)\u003cbr /\u003e\n[Development - Tools](#development-tools)\n\n## Quickstart\n\u003ca name=\"quickstart\"\u003e\u003c/a\u003e\n\n_Note: These instructions are only for inference using the pre-trained model._\n\nFirst download the latest release from [releases](https://github.com/AlexGustafsson/compdec/releases). The release contains three files; a pre-trained model, a python script and a Dockerfile.\n\nIf you wish not to install all the prerequisites mentioned under [Development - Quickstart](#development-quickstart), build the Docker image instead like so:\n\n```sh\ncd compdec\ndocker build -t compdec .\n```\n\nNow you may use the tool natively or via Docker:\n\n```sh\n# Docker\ndocker run -it -v \"$/path/to/samples:/samples\" compdec /samples/unknown-file1.bin /samples/unknown-file2.bin\n# Native\npython3 ./compdec.py /path/to/samples/unknown-file1.bin /path/to/samples/unknown-file2.bin\n```\n\nThe tool will produce output like so:\n\n```\n/path/to/samples/unknown-file1.bin\n7z       : 0.00%\nbrotli   : 0.00%\nbzip2    : 0.00%\ncompress : 0.00%\ngzip     : 0.00%\nlz4      : 100.00%\nrar      : 0.00%\nzip      : 0.00%\n/path/to/samples/unknown-file2.bin\n7z       : 0.00%\nbrotli   : 0.00%\nbzip2    : 0.00%\ncompress : 100.00%\ngzip     : 0.00%\nlz4      : 0.00%\nrar      : 0.00%\nzip      : 0.00%\n```\n\n## Dataset\n\u003ca name=\"dataset\"\u003e\u003c/a\u003e\n\n### Samples\n\nIn the samples directory are file chunks, visualizations and NIST Statistical tests performed on the dataset.\n\nBelow is an example visualization and NIST test for the 7-zip tool.\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"./samples/visualizations/7z.png\"\u003e\n\u003c/p\u003e\n\n```\n...\nSUMMARY\n-------\nmonobit_test                             0.23712867340389365 PASS\nfrequency_within_block_test              0.28036273314388394 PASS\nruns_test                                0.11846733945572493 PASS\nlongest_run_ones_in_a_block_test         0.5251306363531703 PASS\nbinary_matrix_rank_test                  0.0                FAIL\ndft_test                                 0.753290157881333  PASS\nnon_overlapping_template_matching_test   0.9999999736364428 PASS\noverlapping_template_matching_test       0.0                FAIL\nmaurers_universal_test                   0.0                FAIL\nlinear_complexity_test                   0.0                FAIL\nserial_test                              0.1862667243373838 PASS\napproximate_entropy_test                 0.18385318163162168 PASS\ncumulative_sums_test                     0.17770673343194865 PASS\nrandom_excursion_test                    0.24443855795386374 PASS\nrandom_excursion_variant_test            0.013229883923921373 PASS\n```\n\nThere are two pseudo-random samples, `random` and `urandom` taken from `/dev/random` and `/dev/urandom` respectively. There is also a true random sample, `true-random` taken from random.org. These random samples have one NIST test report each, available in the `.txt` file with the same name. Each \"random\" and random sample consists of 4096 bytes.\n\n## Development\n\u003ca name=\"development\"\u003e\u003c/a\u003e\n\n### Quickstart\n\u003ca name=\"development-quickstart\"\u003e\u003c/a\u003e\n\n#### Setting up the project\n\u003ca name=\"development-quickstart-setup\"\u003e\u003c/a\u003e\n\nPrerequisites:\n* Ubuntu 20.04 for training and evaluation\n* macOS 11 for development and CPU inference\n* CuDNN 8.0.4\n* Tensorflow 2.4\n* CUDA 11.1\n* Python 3.8\n  * matplotlib\n  * seaborn\n  * numpy\n  * pyyaml\n  * h5py\n  * PIL\n* Docker 19\n\nSee: https://medium.com/@cwbernards/tensorflow-2-3-on-ubuntu-20-04-lts-with-cuda-11-0-and-cudnn-8-0-fb136a829e7f.\n\nTo start, first clone this repository.\n\n```sh\ngit clone --recurse-submodules https://github.com/AlexGustafsson/compdec.git \u0026\u0026 cd compdec\n```\n\nTo train the model, you'll need some training data. The paper uses the [GovDocs](https://digitalcorpora.org/corpora/files) dataset, but any larger dataset with a wide variety of files should work fine. For ease of use, a tool is included to download the data. The commands below download a small subset of the dataset, suitable for testing and developing. This procedure can be repeated for any number of available threads.\n\n#### Preparing data\n\u003ca name=\"development-quickstart-data\"\u003e\u003c/a\u003e\n```sh\nmkdir -p data\n./tools/govdocs.sh download data threads/thread0.zip\nunzip -d data/govdocs data/threads/thread0.zip\n```\n\nGiven the base data, we can now compress it using the available tools. These tools require Docker and the Docker images available as part of this project. Build and tag them using `./tools/build.sh`.\n\n```sh\n./tools/create-dataset.sh ./data/govdocs ./data/dataset\n```\n\nNow we'll need an index of the dataset, what files there are and how large they are. This is easily created using the following command. In this case we're picking chunks of maximum 4096 bytes, a common chunk size of commonly used file systems.\n\n```sh\npython3 ./tools/create_index.py 4096 ./data/dataset \u003e ./data/index.csv\n```\n\nAs part of our analysis we want to study the entropy of compressed files. This can be done by first creating a stratified sample.\n\nWith the index created, one can perform stratified sampling to extract a sample from the population with the following command. In this case we're picking a strata of 20 samples and we're using the seed `seed`.\n\n```sh\npython3 ./tools/stratified_sampling.py seed ./data/index.csv 20 \u003e ./data/strata.csv\n```\n\nUsing the stratified sample, we can run the NIST statistical test suite on them using the following command:\n\n```sh\npython3 ./tools/nist_test.py ./data/strata.csv \u003e ./data/tests.txt\n```\n\nWe can now create two stratas, one for training and one for evaluation. This can be done using the same tool as previously. Note that we're now using even sampling to ensure the same number of samples for each algorithm. This is to ensure that algorithms that perform bad (yield more chunks) are not over-represented.\n\n```sh\npython3 ./tools/even_sampling.py seed ./data/index.csv 80 \u003e ./data/training-strata.csv\npython3 ./tools/even_sampling.py seed ./data/index.csv 20 ./data/evaluation-strata.csv \u003e ./data/test-strata.csv\n```\n\nMake sure that you apply an appropriate split of the data. Although a small number was used in this example, you may use the full sample size of the dataset.\n\n#### Training and evaluating the model\n\u003ca name=\"development-quickstart-training\"\u003e\u003c/a\u003e\nGiven the dataset, we can now train a model like so:\n\n```sh\npython3 ./model/train.py --model-name my-model --training-strata ./data/training-strata.csv --evaluation-strata ./data/evaluation-strata.csv --save-model --enable-tensorboard --enable-gpu\n```\n\nThe training will create a checkpoints file under `./data/checkpoints/my-model-name`. The trained model will be created in `./data/models/my-model-name.h5`. The model will overwrite any file by the same name that may exist.\n\nTo start TensorBoard run the following command:\n\n```sh\n# --bind_all optional. Makes the site available to the local network\ntensorboard --logdir ./data/tensorboard --bind_all\n```\n\nWith the model trained we can predict the algorithm of a file or chunk using the following script:\n\n```sh\npython3 ./model/predict.py --model ./data/models/my-model.h5 --sample ./data/dataset/000233/compressed.brotli\n```\n\nWe'll get an output like so;\n\n```\n7z       : 0.34%\nbrotli   : 95.39%\nbzip2    : 0.20%\ncompress : 0.06%\ngzip     : 3.07%\nlz4      : 0.57%\nrar      : 0.27%\nzip      : 0.09%\n```\n\nThe prediction utility requires at least as many bytes as the model was trained with. By default this is 4096 bytes, but it can be changed.\n\nTo evaluate the performance of the model, one can render a confusion matrix like so:\n\n```\npython3 ./model/plot.py --type confusion-matrix --model ./data/models/my-model.h5 --strata ./data/evaluation-strata.csv\n```\n\nAn example plot, trained on 2M samples for 5 epochs looks like this:\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"./samples/confusion-matrix.png\"\u003e\n\u003c/p\u003e\n\n### Model\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"./samples/network-architecture.png\"\u003e\n\u003c/p\u003e\n\n_The network architecture based on the work of Q. Chen et al._\n\nFor instructions on how to train and evaluate the model, refer to the quickstart.\n\nThe model is defined as a Keras model in `model/utilities/model_utilities.py`:\n\n```python\nmodel = tf.keras.models.Sequential()\nmodel.add(tf.keras.layers.Conv2D(filters=32, kernel_size=(3, 3), activation=\"relu\", padding=\"same\", input_shape=(dataset_utilities.IMAGE_SIZE, dataset_utilities.IMAGE_SIZE, 1)))\nmodel.add(tf.keras.layers.Conv2D(filters=32, kernel_size=(3, 3), activation=\"relu\", padding=\"same\"))\nmodel.add(tf.keras.layers.MaxPooling2D(pool_size=(3, 3), strides=(2, 2)))\nmodel.add(tf.keras.layers.Conv2D(filters=64, kernel_size=(3, 3), activation=\"relu\", padding=\"same\"))\nmodel.add(tf.keras.layers.MaxPooling2D(pool_size=(3, 3), strides=(2, 2)))\nmodel.add(tf.keras.layers.Conv2D(filters=126, kernel_size=(3, 3), activation=\"relu\", padding=\"same\"))\nmodel.add(tf.keras.layers.MaxPooling2D(pool_size=(3, 3), strides=(2, 2)))\nmodel.add(tf.keras.layers.Conv2D(filters=256, kernel_size=(3, 3), activation=\"relu\", padding=\"same\"))\nmodel.add(tf.keras.layers.MaxPooling2D(pool_size=(3, 3), strides=(2, 2)))\nmodel.add(tf.keras.layers.Flatten())\nmodel.add(tf.keras.layers.Dense(2048, activation=\"relu\"))\nmodel.add(tf.keras.layers.Dense(2048, activation=\"relu\"))\nmodel.add(tf.keras.layers.Dense(len(dataset_utilities.CLASS_NAMES), activation=\"softmax\"))\n```\n\n### Tools\n\u003ca name=\"development-tools\"\u003e\u003c/a\u003e\n\n#### chunk.sh\n\nChunking tool for splitting a file into chunks.\n\nUsage:\n```sh\n./tools/chunk.sh \u003cchunk size\u003e \u003cinput file\u003e \u003coutput directory\u003e\n```\n\nExample:\n```sh\n# Extract 4096B chunks from this file to the output directory\n./tools/chunk.sh 4096 ./tools/chunk.sh ./output\n```\n\nExample output:\n```\nfile,\"chunk size\",size\n\"./tools/chunk.sh\",4096,999\n```\n\n#### create_index.py\n\nCreate a index for the dataset.\n\nUsage:\n```sh\npython3 tools/create_index.py \u003cchunk size\u003e \u003cinput directory\u003e\n```\n\nExample:\n```sh\npython3 tools/create_index.py 4096 ./data/dataset\n```\n\nExample output:\n```\n\"file path\",\"file size\",\"chunk size\",\"chunks\",extension\n\"/path/to/compdec/data/dataset/thread0.zip\",322469174,4096,78728,\"application/zip\"\n\"/path/to/compdec/data/dataset/909/909820.pdf\",291569,4096,72,\"application/pdf\"\n\"/path/to/compdec/data/dataset/135/135778.pdf\",14013,4096,4,\"application/pdf\"\n\"/path/to/compdec/data/dataset/135/135495.html\",18127,4096,5,\"text/html\"\n...\n```\n\n#### govdocs.sh\n\nThis is a tool to simplify communication with GovDocs: https://digitalcorpora.org/corpora/files.\n\nUsage:\n```sh\n./tools/govdocs.sh download \u003ctarget-directory\u003e \u003cfile 1\u003e [file 2] [file 3] ...\n```\n\nExample:\n```sh\n# Download a single thread (about 300MB)\n./tools/govdocs.sh download data threads/thread0.zip\n```\n\nExample output:\n```\n[Download started] http://downloads.digitalcorpora.org/corpora/files/govdocs1/threads/thread0.zip -\u003e data/threads/thread0.zip\n[Download complete] http://downloads.digitalcorpora.org/corpora/files/govdocs1/threads/thread0.zip -\u003e data/threads/thread0.zip\n```\n\n#### stratified_sampling.py\n\nThis is a tool to perform a stratified sampling of a dataset.\n\nUsage:\n```sh\npython3 ./tools/stratified_sampling.py \u003cseed\u003e \u003cindex path\u003e \u003cstrata size\u003e\n```\n\nExample:\n```sh\npython3 tools/stratified_sampling.py 1.3035772690 index.csv 20\n```\n\nExample output:\n```\nextension,samples,frequency\n\"zip\",78728,0.35\n\"pdf\",37438,0.17\n\"html\",3590,0.016\n\"txt\",45112,0.2\n\"jpeg\",9875,0.044\n\"docx\",6659,0.03\n\"xml\",598,0.0027\n\"ppt\",29038,0.13\n\"gif\",580,0.0026\n\"csv\",679,0.003\n\"xls\",6953,0.031\n\"ps\",2535,0.011\n\"png\",604,0.0027\n\"flash\",362,0.0016\nTotal samples: 224026\nStrata size: 20\n\"file path\",offset,\"chunk size\",extension\n\"/path/to/compdec/data/dataset/thread0.zip\",108646400,4096,\"zip\"\n\"/path/to/compdec/data/dataset/191/191969.txt\",125845504,4096,\"txt\"\n\"/path/to/compdec/data/dataset/354/354930.doc\",307200,4096,\"docx\"\n\"/path/to/compdec/data/dataset/thread0.zip\",34136064,4096,\"zip\"\n...\n```\n\n#### even_sampling.py\n\nThis is a tool to perform an even sampling of a dataset.\n\nUsage:\n```sh\npython3 ./tools/even_sampling.py \u003cseed\u003e \u003cindex path\u003e \u003cstrata size\u003e\n```\n\nExample:\n```sh\npython3 tools/even_sampling.py 1.3035772690 index.csv 20\n```\n\nExample output:\n```\nextension,samples,frequency\n\"zip\",78728,0.35\n\"pdf\",37438,0.17\n\"html\",3590,0.016\n\"txt\",45112,0.2\n\"jpeg\",9875,0.044\n\"docx\",6659,0.03\n\"xml\",598,0.0027\n\"ppt\",29038,0.13\n\"gif\",580,0.0026\n\"csv\",679,0.003\n\"xls\",6953,0.031\n\"ps\",2535,0.011\n\"png\",604,0.0027\n\"flash\",362,0.0016\nTotal samples: 224026\nStrata size: 20\n\"file path\",offset,\"chunk size\",extension\n\"/path/to/compdec/data/dataset/thread0.zip\",108646400,4096,\"zip\"\n\"/path/to/compdec/data/dataset/191/191969.txt\",125845504,4096,\"txt\"\n\"/path/to/compdec/data/dataset/354/354930.doc\",307200,4096,\"docx\"\n\"/path/to/compdec/data/dataset/thread0.zip\",34136064,4096,\"zip\"\n...\n```\n\n#### compress.sh\n\nThis is a tool to simplify interfacing with various compression algorithms. Due to its dependencies, it's preferably used via Docker. To build it run: `./tools/build.sh`.\n\nInstead of `./tools/compress.sh`, you may use `docker run -it --rm compdec:compress`.\n\nUsage:\n```\n# Show versions of used tools\n./tools/compress.sh versions\n# Show this help dialog\n./tools/compress.sh help\n# Compress a file with all algorithms\n./tools/compress.sh compress \u003coutput prefix\u003e \u003cinput file\u003e\n```\n\nExample:\n```\n./tools/compress.sh compress output/compressed-file input/test-file\n```\n\n#### create-dataset.sh\n\nThis is a tool to simplify creating the dataset (compressing GovDocs).\n\nUsage:\n```\n./tools/create-dataset.sh \u003cbase-dir\u003e \u003ctarget-dir\u003e\n```\n\nExamples:\n\n```\n./tools/create-dataset.sh ./data/govdocs ./data/dataset\n# Only compress part of the dataset\nMAXIMUM_FILES=10 ./tools/create-dataset.sh ./data/govdocs ./data/dataset\n```\n\n#### nist_test.py\n\nThis is a tool to perform the NIST statistical test suite on samples.\n\nUsage:\n```\npython3 ./tools/nist_test.py ./data/strata.csv\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Falexgustafsson%2Fcompdec","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Falexgustafsson%2Fcompdec","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Falexgustafsson%2Fcompdec/lists"}