{"id":13735984,"url":"https://github.com/FFRI/ffridataset-scripts","last_synced_at":"2025-05-08T12:31:55.701Z","repository":{"id":118805567,"uuid":"214375515","full_name":"FFRI/ffridataset-scripts","owner":"FFRI","description":"Make datasets like FFRI Dataset","archived":false,"fork":false,"pushed_at":"2024-07-23T03:27:29.000Z","size":37171,"stargazers_count":10,"open_issues_count":3,"forks_count":3,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-04-15T03:15:41.066Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/FFRI.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-10-11T07:46:48.000Z","updated_at":"2024-10-05T13:29:24.000Z","dependencies_parsed_at":null,"dependency_job_id":"ea5d49b0-9404-4e10-903d-eecf1bccfa3e","html_url":"https://github.com/FFRI/ffridataset-scripts","commit_stats":null,"previous_names":[],"tags_count":7,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FFRI%2Fffridataset-scripts","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FFRI%2Fffridataset-scripts/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FFRI%2Fffridataset-scripts/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FFRI%2Fffridataset-scripts/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/FFRI","download_url":"https://codeload.github.com/FFRI/ffridataset-scripts/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253068630,"owners_count":21848848,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-03T03:01:13.796Z","updated_at":"2025-05-08T12:31:50.974Z","avatar_url":"https://github.com/FFRI.png","language":"Python","funding_links":[],"categories":[":bookmark_tabs: Datasets"],"sub_categories":["Scientific Research"],"readme":"# FFRI Dataset scripts\n\nThis script allows you to create datasets in the same format as the FFRI dataset.\n\n## Requirements\n\nWe recommend using Docker to create datasets. For more information, refer to the [Using Docker](#Using-Docker) section.\n\nAlternatively, you can run this script natively by installing the following dependencies on [tested platforms](#Tested). For detailed instructions, see the [Run this script natively](#Run-This-Script-Natively) section.\n\n- Python 3.12\n- [Poetry](https://python-poetry.org/) 1.7+\n\n## Using Docker\n\n### Make A CSV File\n\nThis script requires a CSV file that contains file information such as labels, dates, and file paths. For example:\n\n```\npath,label,date\n./data/cleanware/test0.exe,0,2018/01/01\n./data/malware/test1.exe,1,2018/01/02\n```\n\nPlease note that the file paths in the CSV file should be specified as relative paths from the container's working directory.\n\n### Make Datasets\n\nYou can create datasets using the following commands:\n\n```\ndocker build --target production --tag ffridataset-scripts .\ndocker run -v \u003cpath/to/here\u003e/testbin:/work/testbin ffridataset-scripts test_main.py\n# Note: The data directory should contain a CSV file and the executable files you want to process.\ndocker run -v \u003cpath/to/here\u003e/data:/work/data -v \u003cpath/to/here\u003e/out_dir:/work/out_dir ffridataset-scripts main.py --csv ./data/target.csv --out ./out_dir --log ./dataset.log --ver \u003cversion_string\u003e\n```\n\nPlease ensure the following:\n\n- The host directory containing the CSV file and executable files is mounted to the container’s `/work/data`.\n- The host directory where you want to save the JSON files is mounted to the container’s `/work/out_dir`.\n- Replace `\u003cversion_string\u003e` with vYYYY (e.g., use v2024 for the FFRI Dataset 2024).\n\nTo process non-PE files, include the --not-pe-only flag:\n```\ndocker run -v \u003cpath/to/here\u003e/data:/work/data -v \u003cpath/to/here\u003e/out_dir:/work/out_dir ffridataset-scripts main.py --csv ./data/target.csv --out ./out_dir --log ./dataset.log --ver \u003cversion_string\u003e --not-pe-only\n```\n\n## Run This Script Natively\n\n### Prepare To Use\n\n**Attention** We recommend running the following commands in the working directory (the ffridataset-scripts directory).\n```\nexport LC_ALL=C.UTF-8\nexport LANG=C.UTF-8\n\nsudo apt update\nsudo apt install -y --no-install-recommends wget git gcc g++ make autoconf libfuzzy-dev unar cmake mlocate libssl-dev libglib2.0-0 curl libboost-regex-dev libboost-program-options-dev libboost-system-dev libboost-filesystem-dev build-essential libpcre2-dev libdouble-conversion-dev\nsudo apt install -y --no-install-recommends libqt5core5a libqt5svg5 libqt5gui5 libqt5widgets5 libqt5opengl5 libqt5dbus5 libqt5scripttools5 libqt5script5 libqt5network5 libqt5sql5\nsudo apt install -y --no-install-recommends libffi-dev libncurses5-dev zlib1g zlib1g-dev libreadline-dev libbz2-dev libsqlite3-dev liblzma-dev\nsudo apt install -y --no-install-recommends software-properties-common gpg-agent gpg clang\nwget https://github.com/horsicq/DIE-engine/releases/download/3.09/die_3.09_Ubuntu_22.04_amd64.deb\nsudo apt --fix-broken install ./die_3.09_Ubuntu_22.04_amd64.deb\nrm die_3.09_Ubuntu_22.04_amd64.deb\n\nwget mark0.net/download/trid_linux_64.zip\nunar trid_linux_64.zip\ncp trid_linux_64/trid ./\nchmod u+x trid\ncp triddefs_dir/triddefs-dataset2024.trd triddefs.trd\n\ncd workspace\n\ngit clone https://github.com/JPCERTCC/impfuzzy.git\ncd impfuzzy\ngit checkout b30548d005c9d980b3e3630648b39830597293fc\ncd ../\n\ngit clone https://github.com/JusticeRage/Manalyze.git\ncd Manalyze\ngit checkout b6800ffcf2f7f4e82fe1f94d0eb2736e75e175ec\ncmake .\nmake\ncd ../\n\ngit clone https://github.com/lief-project/LIEF.git\ncd LIEF\ngit checkout 573c885de5a2bb217d4d0255b54f9b53d9a4d7c9\ngit apply ../../patches/lief.patch\ncd ../\n\ngit clone  https://github.com/trendmicro/tlsh.git\ncd tlsh\ngit checkout 96536e3f5b9b322b44ce88d36126121685e45a77\n./make.sh\ncd ../\n\ngit clone https://github.com/erocarrera/pefile.git\ncd pefile\ngit checkout ceab92e003b3436d2e52b74e9c903e812a4aeae1\ncd ../../\n\nwget https://github.com/ninja-build/ninja/releases/download/v1.12.1/ninja-linux.zip\nunar ninja-linux.zip\nsudo mv ninja /usr/bin/\n\npoetry install --no-root\n```\n\nIf something goes wrong, refer to the Dockerfile.\n\n### Run Tests\n\n**Attention** Do not store a file named `test.exe` in the working directory. The test script copies `testbin/test.exe` into the directory and then removes it.\n```\npoetry run python test_main.py\n```\n\n### Make Datasets\n\nBefore running this script, you need to make a CSV file described in the [Make A CSV File](#Make-A-CSV-File) section and specify its file path as an argument. Unlike when using Docker, file paths can be specified as full paths.\n\n**Attention** Do not store malware and cleanware in the working directory. This script will copy malware and cleanware into the directory and then removes them.\n\n```\npoetry run python main.py --csv \u003cpath/to/csv\u003e --out \u003cpath/to/output_dataset_dir\u003e --log \u003cpath/to/log_file\u003e --ver \u003cversion_string\u003e\n```\n\n## Notes About Hashes\n\n- TLSH may sometimes be an empty string. This occurs because a file must possess a sufficient level of complexity to generate a valid TLSH. For more details, visit https://github.com/trendmicro/tlsh/blob/master/README.md.\n- The peHashes (crits, endgame, and totalhash) can be null due to bugs in their implementation.\n\n## Notes About TrID Definition File\n\n- The TrID definition files located in [triddefs_dir](triddefs_dir) are redistributed with the permission from the TrID author, Marco Pontello.\n- The latest definition file can be obtained from the [TrID website](https://mark0.net/soft-trid-e.html).\n\n## Tested\n\n- Ubuntu 22.04.2 LTS\n- Ubuntu 22.04 on WSL2 on Windows 10\n\n## Development\n\n### Profiling Measurement\n\nFirst, create two folders:\n```\nmkdir out_dir\nmkdir measurement\n```\n\nNext, build a Docker image by specifying the measurement target:\n```\ndocker build --target measurement --tag ffridataset-scripts .\n```\n\nThen, run the following command to generate executables and a csv file:\n```\ndocker run -v \u003cpath/to/here\u003e\\testbin:/work/testbin -v \u003cpath/to/here\u003e\\measurement\\:/work/measurement ffridataset-scripts poetry run python create_measurement_env.py\n```\n\nNow you're ready to do profiling. To generate a cProfile result file, run:\n```\ndocker run -v \u003cpath/to/here\u003e\\measurement:/work/data -v \u003cpath/to/here\u003e\\out_dir:/work/out_dir ffridataset-scripts poetry run python -m cProfile -o ./out_dir/profiling.stats main.py --csv ./data/test.csv --out ./out_dir --log ./test.log --ver v2023\n```\n\nThen, execute the following command:\n```\ndocker run -v \u003cpath/to/here\u003e\\out_dir\\:/work/out_dir/ --rm -p 8080:8080 ffridataset-scripts poetry run snakeviz /work/out_dir/profiling.stats  -s -p 8080 -H 0.0.0.0\n```\n\nNow, you can view the profiling results through your browser.\n\n## Author\n\nYuki Mogi. \u0026copy; FFRI, Inc. 2019-2024\n\nKoh M. Nakagawa. \u0026copy; FFRI, Inc. 2019-2024\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FFFRI%2Fffridataset-scripts","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FFFRI%2Fffridataset-scripts","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FFFRI%2Fffridataset-scripts/lists"}