{"id":19867306,"url":"https://github.com/huzecong/ghcc","last_synced_at":"2025-05-02T06:30:59.730Z","repository":{"id":46805718,"uuid":"213501816","full_name":"huzecong/ghcc","owner":"huzecong","description":"GitHub Cloner \u0026 Compiler","archived":false,"fork":false,"pushed_at":"2021-09-24T14:14:30.000Z","size":584,"stargazers_count":69,"open_issues_count":0,"forks_count":17,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-04-06T23:14:01.415Z","etag":null,"topics":["c","compilation","decompilation","docker"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/huzecong.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2019-10-07T22:56:13.000Z","updated_at":"2025-03-29T14:05:24.000Z","dependencies_parsed_at":"2022-09-23T04:34:20.087Z","dependency_job_id":null,"html_url":"https://github.com/huzecong/ghcc","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/huzecong%2Fghcc","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/huzecong%2Fghcc/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/huzecong%2Fghcc/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/huzecong%2Fghcc/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/huzecong","download_url":"https://codeload.github.com/huzecong/ghcc/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":251998221,"owners_count":21677950,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["c","compilation","decompilation","docker"],"created_at":"2024-11-12T15:28:59.779Z","updated_at":"2025-05-02T06:30:59.421Z","avatar_url":"https://github.com/huzecong.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# GitHub Cloner \u0026 Compiler\n\nThis project serves as the data collection process for training neural decompilers, such as\n[CMUSTRUDEL/DIRE](https://github.com/CMUSTRUDEL/DIRE).\n\nThe code for compilation is adapted from\n[bvasiles/decompilationRenaming](https://github.com/bvasiles/decompilationRenaming). The code for decompilation is\nadapted from [CMUSTRUDEL/DIRE](https://github.com/CMUSTRUDEL/DIRE). \n\n\n## Setup\n\n1. Install [Docker](https://docs.docker.com/install/) and [MongoDB](https://docs.mongodb.com/manual/installation/).\n2. Install required Python packages by:\n   ```bash\n   pip install -r requirements.txt\n   ```\n3. Rename `database-config-example.json` to `database-config.json`, and fill in appropriate values. This will be used\n   to connect to your MongoDB server.\n4. Build the Docker image used for compiling programs by:\n   ```bash\n   docker build -t gcc-custom .\n   ```\n\n\n## Usage\n\n### Running the Compiler\n\nYou will need a list of GitHub repository URLs to run the code. The current code expects one URL per line, for example:\n```\nhttps://github.com/huzecong/ghcc.git\nhttps://www.github.com/torvalds/linux\nFFmpeg/FFmpeg\nhttps://api.github.com/repos/pytorch/pytorch\n```\n\nTo run, simply execute:\n```bash\npython main.py --repo-list-file path/to/your/list [arguments...]\n```\n\nThe following arguments are supported:\n\n- `--repo-list-file [path]`: Path to the list of repository URLs.\n- `--clone-folder [path]`: The temporary directory to store cloned repository files. Defaults to `repos/`.\n- `--binary-folder [path]`: The directory to store compiled binaries. Defaults to `binaries/`.\n- `--archive-folder [path]`: The directory to store archived repository files. Defaults to `archives/`.\n- `--n-procs [int]`: Number of worker processes to spawn. Defaults to 0 (single-process execution).\n- `--log-file [path]`: Path to the log file. Defaults to `log.txt`.\n- `--clone-timeout [int]`: Maximum cloning time (seconds) for one repository. Defaults to 600 (10 minutes).\n- `--force-reclone`: If specified, all repositories are cloned regardless of whether it has been processed before or\n  whether an archived version exists.\n- `--compile-timeout [int]`: Maximum compilation time (seconds) for all Makefiles under a repository. Defaults to 900\n  (15 minutes).\n- `--force-recompile`: If specified, all repositories are compiled regardless of whether is has been processed before.\n- `--docker-batch-compile`: Batch compile all Makefiles in one repository using one Docker invocation. This is on by\n  default, and you almost always want this. Use the `--no-docker-batch-compile` flag to disable it. \n- `--compression-type [str]`: Format of the repository archive, available options are `gzip` (faster) and `xz`\n  (smaller). Defaults to `gzip`.\n- `--max-archive-size [int]`: Maximum size (bytes) of repositories to archive. Repositories with greater sizes will not\n  be archived. Defaults to 104,857,600 (100MB).\n- `--record-libraries [path]`: If specified, a list of libraries used during failed compilations will be written to the\n  specified path. See [Collecting and Installing Libraries](#collecting-and-installing-libraries) for details.\n- `--logging-level [str]`: The logging level. Defaults to `info`.\n- `--max-repos [int]`: If specified, only the first `max_repos` repositories from the list will be processed.\n- `--recursive-clone`: If specified, submodules in the repository will also be cloned if exists. This is on by default.\n  Use the `--no-recursive-clone` flag to disable it.\n- `--write-db`: If specified, compilation results will be written to database. This is on by default. Use the\n  `--no-write-db` flag to disable it.\n- `--record-metainfo`: If specified, additional statistics will be recorded.\n- `--gcc-override-flags`: If specified, these are passed as compiler flags to GCC. By default `-O0` is used.\n\n### Utilities\n\n- If compilation is interrupted, there may be leftovers that cannot be removed due to privilege issues. Purge them by:\n  ```bash\n  ./purge_folder.py /path/to/clone/folder\n  ``` \n  This is because intermediate files are created under different permissions, and we need root privileges (sneakily\n  obtained via Docker) to purge those files. This is also performed at the beginning of the `main.py` script.\n- If something messed up seriously, drop the database by:\n  ```bash\n  python -m ghcc.database clear\n  ```\n- If the code is modified, remember to rebuild the image since the `batch_make.py` script (executed inside Docker to\n  compile Makefiles) depends on the library code. If you don't do so, well, GHCC will remind you and refuse to proceed.\n\n### Running the Decompiler\n\nDecompilation requires an active installation of IDA with the Hex-Rays plugin. To run, simply execute:\n```bash\npython run_decompiler.py --ida path/to/idat64 [arguments...]\n```\n\nThe following arguments are supported:\n\n- `--ida [path]`: Path to the `idat64` executable found under the IDA installation folder.\n- `--binaries-dir [path]`: The directory where binaries are stored, i.e. the same value for `--binary-folder` in the\n  compilation arguments. Defaults to `binaries/`.\n- `--output-dir [path]`: The directory to store decompiled code. Defaults to `decompile_output/`. \n- `--log-file [path]`: Path to the log file. Defaults to `decompile-log.txt`.\n- `--timeout [int]`: Maximum decompilation time (seconds) for one binary. Defaults to 30.\n- `--n-procs [int]`: Number of worker processes to spawn. Defaults to 0 (single-process execution). \n\n\n## Advanced Topics\n\n### Heuristics for Compilation\n\nThe following procedure happens when compiling a Makefile:\n\n1. **Check if directory is \"make\"-able:** A directory is marked as \"make\"-able if it contains (case-insensitively) at\n   least one set of files among the following:\n\n   - *(Make)* `Makefile`\n   - *(automake)* `Makefile.am`\n\n   If the directory is not \"make\"-able, skip the following steps.\n\n2. **Clean Git repository:**\n\n   ```bash\n   git reset --hard  # reset modified files\n   git clean -xffd  # clean unversioned files\n   # do the same for submodules\n   git submodule foreach --recursive git reset --hard\n   git submodule foreach --recursive git clean -xffd\n   ```\n\n   If any command fails, ignore it and continue executing the rest.\n\n3. **Build:**\n\n   1. If exists a file named `Makefile.am`, run `automake`:\n\n      ```bash\n      autoreconf \u0026\u0026 automake --add-missing\n      ```\n\n   2. If exists a file named `configure`, run the configuration script:\n\n      ```bash\n      chmod +x ./configure \u0026\u0026 ./configure --disable-werror\n      ```\n\n      The `--disable-werror` prevents warnings being treated as errors in cases where `-Werror` is specified.\n      \n      If command fails within 2 seconds, try again without `--disable-werror`.\n\n   3. Run `make`:\n\n      ```bash\n      make --always-make --keep-going -j1\n      ```\n      \n      The `--always-make` flag rebuilds all dependent targets even if they exist. The `--keep-going` flag allows Make to\n      continue for targets if errors occur in non-dependent targets.\n\n      If command fails within 2 seconds and the output contains `\"Missing separator\"`, try again with `bmake`\n      *(BSD Make)*.\n\n      **Note:** We override certain program with our \"wrapped\" versions by modifying the `PATH` variable. The list of\n      wrapped programs are:\n\n      - **GCC:** (`gcc`, `cc`, `clang`) Swallows unnecessary and/or error-prone flags (`-Werror`, `-march`,\n        `-mlittle-endian`), records libraries used (`-l`), overrides the optimization level (`-O0`), adds override flags\n        specified in the arguments, and calls the real GCC. If the real GCC fails, writes the libraries to a predefined\n        path.\n      - **`sudo`:** Does not prompt for the password, but instead just tries to execute the command without privileges.\n      - **`pkg-config`:** Records libraries used, and calls the real `pkg-config`. If it fails (meaning packages cannot\n        be resolved), write the libraries to a predefined path.\n\n### Collecting and Installing Libraries\n\nMost repositories require linking to external libraries. To collect libraries that are linked to in Makefiles, run the\nscript with the flag `--record-libraries path/to/library_log.txt`. Only libraries in commands that failed to execute\n(GCC return code is non-zero) are recorded in the log file.\n\nAfter gathering the library log, run `install_libraries.py path/to/library_log.txt` to resolve libraries to package\nnames (based on `apt-cache`). This step requires actually installing packages, so it's recommended to run it in a Docker\nenvironment:\n```bash\ndocker run --rm \\\n    -v /absolute/path/to/directory/:/usr/src/ \\\n    gcc-custom \\\n    \"install_libraries.py /usr/src/library_log.txt\"\n```\nThis gives a list of packages to install. Add the list of packages to `Dockerfile` (the command that begins with\n`RUN apt-get install -y --no-install-recommends`) and rebuild the image to apply changes.\n\n### Notes on Docker Safety\n\nCompiling random code from GitHub is basically equivalent to running `curl | bash`, and doing so in Docker would be like\n`curl | sudo bash` as Docker (by default) doesn't protect you against kernel panics and fork bombs. The following notes\ndescribe what is done to (partly) ensure safety of the host machine when compiling code.\n\n1. Never run Docker as root. This means two things: 1) don't use `sudo docker run ...`, and 2) don't execute commands in\n   Docker as the root user (default). The first goal can be achieved by create a `docker` user group, and the second\n   can be achieved using a special entry-point: create a non-privileged user and use `gosu` to switch to that user and\n   run commands.\n\n   **Caveats:** When creating the non-privileged user, assign the same UID (user ID) or GID (group ID) as the host user,\n   so files created inside the container can be accessed/modified by the host user.\n\n2. Limit the number of processes. This is to prevent things like fork bombs or badly written recursive Makefiles from\n   taking up the kernel memory. A simple solution is to use `ulimit -u \u003cnprocs\u003e` to set the maximum allowed number of\n   processes, but such limits are on a per-user basis instead of a per-container or per-process-tree basis.\n\n   What we can do is: for each container we spawn, create a user that has the same GID as the host user, but with a\n   distinct UID, and call `ulimit` for that user. This serves as a workaround for per-container limits.\n   \n   Don't forget to `chmod g+w` for files that need to be accessed from host.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhuzecong%2Fghcc","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhuzecong%2Fghcc","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhuzecong%2Fghcc/lists"}