{"id":20275644,"url":"https://github.com/softsec-kaist/binkit","last_synced_at":"2025-08-22T06:32:28.604Z","repository":{"id":43662136,"uuid":"314792627","full_name":"SoftSec-KAIST/BinKit","owner":"SoftSec-KAIST","description":"Binary Code Similarity Analysis (BCSA) Benchmark","archived":false,"fork":false,"pushed_at":"2023-12-15T02:24:01.000Z","size":108,"stargazers_count":135,"open_issues_count":3,"forks_count":24,"subscribers_count":6,"default_branch":"master","last_synced_at":"2024-12-08T14:35:51.828Z","etag":null,"topics":["benchmark","binary-analysis"],"latest_commit_sha":null,"homepage":"","language":"Shell","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/SoftSec-KAIST.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-11-21T11:08:44.000Z","updated_at":"2024-12-03T12:16:34.000Z","dependencies_parsed_at":"2023-02-18T03:10:16.319Z","dependency_job_id":"818fac7e-f8ff-4ab8-94de-27209044fe35","html_url":"https://github.com/SoftSec-KAIST/BinKit","commit_stats":null,"previous_names":[],"tags_count":5,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SoftSec-KAIST%2FBinKit","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SoftSec-KAIST%2FBinKit/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SoftSec-KAIST%2FBinKit/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SoftSec-KAIST%2FBinKit/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/SoftSec-KAIST","download_url":"https://codeload.github.com/SoftSec-KAIST/BinKit/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":230568588,"owners_count":18246378,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["benchmark","binary-analysis"],"created_at":"2024-11-14T13:10:24.284Z","updated_at":"2024-12-20T10:08:14.767Z","avatar_url":"https://github.com/SoftSec-KAIST.png","language":"Shell","funding_links":[],"categories":[],"sub_categories":[],"readme":"# BinKit 2.0\n\nBinKit is a binary code similarity analysis (BCSA) benchmark. BinKit provides\nscripts for building a cross-compiling environment, as well as the compiled\ndataset. The current dataset includes 1,904 distinct combinations of compiler\noptions of 8 architectures, 6 optimization levels, and 23 compilers. It includes\n371,928 binaries.\n\nThe main improvements of the latest version of BinKit compared to the paper\nversion of BinKit are as follows: Additional support for relatively newer\ncompiler versions for major compilation options, and support for Ofast\noptimization option.\n\nIn particular, BinKit now includes GCC and Clang versions up to 11 and 13,\nrespectively. Currently, a total of 6 optimization options (O0, O1, O2, O3, Os,\nOfast) are supported. see the [Currently supported compile\noptions](https://github.com/SoftSec-KAIST/BinKit#currently-supported-compile-options)\nsection below for more detailed options.\n\nIn Binkit 2.0 dataset, the gsl package misses 8 binaries with Ofast option due\nto compiler bugs. See the [Missing binaries](https://github.com/SoftSec-KAIST/BinKit#Missing-binaries)\npart of the [Issues](https://github.com/topcue/tmp#issues) section for more\ninformation.\n\n## BinKit 1.0 (paper version)\nThe original dataset includes 1,352 distinct combinations of compiler options of\n8 architectures, 5 optimization levels, and 13 compilers. It includes 243,128\nbinaries. We tested this code in Ubuntu 16.04.\n\nFor more details, please check [our\npaper](https://0xdkay.me/pub/2020/kim-arxiv2020.pdf).\n\n# BCSA tool and Ground Truth Building\nFor a BCSA tool and ground truth building, please check\n[TikNib](https://github.com/SoftSec-KAIST/TikNib).\n\n## Pre-compiled dataset and toolchain\nYou can download our dataset and toolchain as below. The link will be changed to\n`git-lfs` soon.\n\n[//]: # (Cloning this repository also downloads below pre-compiled dataset and toolchain\nwith `git-lfs`. Please use `GIT_LFS_SKIP_SMUDGE=1` to skip the download.)\n\n### Dataset (latest version)\n\n- [BinKit 2.0 dataset](https://drive.google.com/file/d/1TrjFnv6BMpVEXYukVxrhlQ78S0NPKEXa/view?usp=share_link)\n\n### Dataset (old)\nBelow datasets are for reproduction of paper\n\n- [Normal dataset](https://drive.google.com/file/d/1K9ef-OoRBr0X5u8g2mlnYqh9o1i6zFij/view?usp=sharing)\n- [SizeOpt dataset](https://drive.google.com/file/d/1QgwbEfd8vdzg5glNZFL7dg4l4hrkoWO3/view?usp=sharing)\n- [Noinline dataset](https://drive.google.com/file/d/1wt7GY-DDp8J_2zeBBVUrcfWIyerg_xLO/view?usp=sharing)\n- [PIE dataset](https://drive.google.com/file/d/1IfEbnS9RtHhVhW8oiqnE7G75uPej1FPx/view?usp=sharing)\n- [LTO dataset](https://drive.google.com/file/d/1Tsd-WNO_JDlEX0GylBOxsFjOPUmUyeGh/view?usp=sharing)\n- [Obfus dataset](https://drive.google.com/file/d/1H5k3pfJH9zN4anfxKi1WvNqTKmjVjUUU/view?usp=sharing)\n- [Obfus 2-Loop dataset](https://drive.google.com/file/d/1C3SXt896R4rJvpvxcItFu9NIgN-hAxz8/view?usp=sharing)\n\nBelow data is only used for our evaluation.\n- [ASE dataset](https://drive.google.com/file/d/1MwXHRXjuPoQJAON6SZVoKcK6Xr2NMHdF/view?usp=sharing)\n\n### `.pickle` Files\nThese files include the extracted features and useful information for each function.\n- [Normal dataset `.pickle`](https://drive.google.com/file/d/1GjVoSXPvc7oTMJM4bIpmIOd6If7PTuOm/view?usp=sharing)\n- [SizeOpt dataset `.pickle`](https://drive.google.com/file/d/1MeT9Z5aaYf0kAtGxaCnHk8nyXJddPfqC/view?usp=sharing)\n- [Noinline dataset `.pickle`](https://drive.google.com/file/d/1bXj2ZjnNOGAijleBh5Tki1XZLG2i1Hng/view?usp=sharing)\n- [PIE dataset `.pickle`](https://drive.google.com/file/d/1mVzTKeJ4OzH1fyuSCn-_CPF8BmFMkAEw/view?usp=sharing)\n- [LTO dataset `.pickle`](https://drive.google.com/file/d/1ELxkiapNnMrjfcMdltWvritwBH9pBJ7o/view?usp=sharing)\n- [Obfus dataset `.pickle`](https://drive.google.com/file/d/12r4kdMvZYE4zTD3f4FDA-kxDOy8GU5ZL/view?usp=sharing)\n\nBelow data is only used for our evaluation.\n- [ASE dataset `.pickle`](https://drive.google.com/file/d/1NbhNRBpV5_evRXrNeUReBU7ju9bEtKxq/view?usp=sharing)\n\n### Toolchain\n- [tools](https://drive.google.com/file/d/1Ar8CT4xZceT083jMy2dU5q-CgcMHqrQ0/view?usp=sharing)\n\n# Currently supported compile options\n### Architecture\n- x86_32\n- x86_64\n- arm_32 (little endian)\n- arm_64 (little endian)\n- mips_32 (little endian)\n- mips_64 (little endian)\n- mipseb_32 (big endian)\n- mipseb_64 (big endian)\n\n### Optimization\n- O0\n- O1\n- O2\n- O3\n- Os\n- Ofast\n\n### Compilers\n- gcc\n  - gcc-4.9.4\n  - gcc-5.5.0\n  - gcc-6.4.0\n  - gcc-6.5.0\n  - gcc-7.3.0\n  - gcc-8.2.0\n  - gcc-8.5.0\n  - gcc-9.4.0\n  - gcc-10.3.0\n  - gcc-11.2.0\n- clang\n  - clang-4.0.0\n  - clang-5.0.2\n  - clang-6.0.1\n  - clang-7.0.1\n  - clang-8.0.0\n  - clang-9.0.1\n  - clang-10.0.1\n  - clang-11.0.1\n  - clang-12.0.1\n  - clang-13.0.0\n- clang-obfus\n  - clang-obfus-fla (Obfuscator-LLVM - FLA)\n  - clang-obfus-sub (Obfuscator-LLVM - SUB)\n  - clang-obfus-bcf (Obfuscator-LLVM - BCF)\n  - clang-obfus-all (Obfuscator-LLVM - FLA + SUB + BCF)\n\n# How to use\n### 1. Configure the environment in `scripts/env.sh`\n- `NUM_JOBS`: for `make`, `parallel`, and `python` multiprocessing\n- `MAX_JOBS`: maximum for `make`\n\n### 2. Build cross-compiling environment (takes lots of time)\nWe build crosstool-ng and clang environment. If you download pre-compiled\ntoolchain. Please skip this.\n\n```bash\n$ source scripts/env.sh\n# We may have missed some packages here ... please check\n$ scripts/install_default_deps.sh # install default packages for dataset compilation\n$ scripts/setup_ctng.sh       # setup crosstool-ng binaries\n$ scripts/setup_gcc.sh        # build ct-ng environment. Takes a lot of time\n$ scripts/cleanup_ctng.sh     # cleaning up ctng leftovers\n$ scripts/setup_clang.sh      # setup clang and llvm-obfuscator\n```\n\n### 3. Link toolchains\n```bash\n$ scripts/link_toolchains.sh  # link base toolchain\n```\nTo undo the linking, please check `scripts/unlink_toolchains.sh`\n\n### 4. Build dataset\nPlease configure variables in `compile_packages.sh` and run below. The script\nautomatically downloads the source code of GNU packages, and compiles them to\nmake all the dataset. However, it may take too much time to create all of them.\n\n- *NOTE* that it takes *SIGNIFIACNT* time.\n- *NOTE* that some packages would not be compiled for some compiler options.\n\n```bash\n$ scripts/install_gnu_deps.sh # install default packages for dataset compilation\n$ ./compile_packages.sh\n```\n\n### 4-1. Build dataset (manual)\n\nYou can download the source code of GNU packages of your interest as below.\n- Please check step 1 before running the command.\n- You must give *ABSOLUTE PATH* for `--base_dir`.\n\n```bash\n$ source scripts/env.sh\n$ python gnu_compile_script.py \\\n    --base_dir \"/home/dongkwan/binkit/dataset/gnu\" \\\n    --num_jobs 8 \\\n    --whitelist \"config/whitelist.txt\" \\\n    --download\n```\n\nYou can compile only the packages or compiler options of your interest as below.\n\n```bash\n$ source scripts/env.sh\n$ python gnu_compile_script.py \\\n    --base_dir \"/home/dongkwan/binkit/dataset/gnu\" \\\n    --num_jobs 8 \\\n    --config \"config/normal.yml\" \\\n    --whitelist \"config/whitelist.txt\"\n```\n\nYou can check the compiled binaries as below.\n\n```bash\n$ source scripts/env.sh\n$ python compile_checker.py \\\n    --base_dir \"/home/dongkwan/binkit/dataset/gnu\" \\\n    --num_jobs 8 \\\n    --config \"config/normal.yml\"\n```\n\nFor more details, please check `compile_packages.sh`\n\n### 4-2. Build dataset with customized options\n\nTo build datasets by customizing options, you can make your own configuration\nfile (`.yml`) and select target compiler options. You can check the format in\nthe existing sample files in the `/config` directory. Here, please make sure\nthat the name of your config file is not included in the blacklist in the\n[compilation\nscript](/SoftSec-KAIST/BinKit/blob/master/do_compile_utils.sh#L347).\n\n\n# Issues\n\n### Tested environment\nWe ran all our experiments on a server equipped with four Intel Xeon E7-8867v4\n2.40 GHz CPUs (total 144 cores), 896 GB DDR4 RAM, and 4 TB SSD. We setup Ubuntu\n16.04 on the server.\n\n### Tested python version\n- Python 3.8.0\n\n### Running example\n\nThe time spent for running the below script took `7` hours on our machine.\n\n```bash\n$ python gnu_compile_script.py \\\n    --base_dir \"/home/dongkwan/binkit/dataset/gnu\" \\\n    --num_jobs 72 \\\n    --config \"config/normal.yml\" \\\n    --whitelist \"config/whitelist.txt\"\n```\n\n### Compliation failure\n\nIf compilation fails, you may have to adjust the number of jobs for parallel\nprocessing in the step 1, which is machine-dependent.\n\n### Missing binaries\n\nIn Binkit 2.0 dataset, the gsl package misses 8 binaries with Ofast option due\nto compiler bugs. Clang-8 and clang-9 induce compiler hang bug when compiling\nthe gsl package for 32bit ARM with Ofast option. We reported this issue to\nbug-gsl and llvm-project respectively. However, bug-gsl did not reply, and the\nllvm-project replied that these versions are not currently supported. The bug\nreporting links are respectively as follows:\n[bug-gsl](https://lists.gnu.org/archive/html/bug-gsl/2023-02/msg00000.html),\n[llvm-project](https://github.com/llvm/llvm-project/issues/60692)\n\n# Authors\nThis project has been conducted by the below authors at KAIST.\n* [Dongkwan Kim](https://0xdkay.me/)\n* [Eunsoo Kim](https://hahah.kim)\n* [Sang Kil Cha](https://softsec.kaist.ac.kr/~sangkilc/)\n* [Sooel Son](https://sites.google.com/site/ssonkaist/home)\n* [Yongdae Kim](https://syssec.kaist.ac.kr/~yongdaek/)\n\n# Citation\nWe would appreciate if you consider citing [our\npaper](https://ieeexplore.ieee.org/document/9813408) when using BinKit.\n```bibtex\n@ARTICLE{kim:tse:2022,\n  author={Kim, Dongkwan and Kim, Eunsoo and Cha, Sang Kil and Son, Sooel and Kim, Yongdae},\n  journal={IEEE Transactions on Software Engineering}, \n  title={Revisiting Binary Code Similarity Analysis using Interpretable Feature Engineering and Lessons Learned}, \n  year={2022},\n  volume={},\n  number={},\n  pages={1-23},\n  doi={10.1109/TSE.2022.3187689}\n}\n```\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsoftsec-kaist%2Fbinkit","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsoftsec-kaist%2Fbinkit","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsoftsec-kaist%2Fbinkit/lists"}