{"id":15116156,"url":"https://github.com/user1342/AutoCorpus","last_synced_at":"2025-09-27T21:32:37.802Z","repository":{"id":232805914,"uuid":"785263606","full_name":"user1342/AutoCorpus","owner":"user1342","description":"AutoCorpus is a tool backed by a large language model (LLM) for automatically generating corpus files for fuzzing.","archived":false,"fork":false,"pushed_at":"2024-04-23T00:01:04.000Z","size":399,"stargazers_count":50,"open_issues_count":0,"forks_count":7,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-01-09T03:08:16.985Z","etag":null,"topics":["corpus-generator","dynamic-analysis","fuzzing","large-language-models","llm","vulnerability-research"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/user1342.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-04-11T14:31:43.000Z","updated_at":"2025-01-04T11:09:54.000Z","dependencies_parsed_at":"2024-04-22T23:28:40.841Z","dependency_job_id":"c5600808-ede6-488e-b437-fe8d0e57c25f","html_url":"https://github.com/user1342/AutoCorpus","commit_stats":null,"previous_names":["user1342/autocorpus"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/user1342%2FAutoCorpus","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/user1342%2FAutoCorpus/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/user1342%2FAutoCorpus/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/user1342%2FAutoCorpus/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/user1342","download_url":"https://codeload.github.com/user1342/AutoCorpus/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":234461897,"owners_count":18837187,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["corpus-generator","dynamic-analysis","fuzzing","large-language-models","llm","vulnerability-research"],"created_at":"2024-09-26T01:44:11.518Z","updated_at":"2025-09-27T21:32:32.473Z","avatar_url":"https://github.com/user1342.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"\u003cp align=\"center\"\u003e\n    \u003cimg width=100% src=\"logo.gif\"\u003e\n  \u003c/a\u003e\n\u003c/p\u003e\n\u003cp align=\"center\"\u003e 🤖 Automated Fuzzing Corpus Generation 📁 \u003c/p\u003e\n\n\u003cdiv align=\"center\"\u003e\n\n![GitHub contributors](https://img.shields.io/github/contributors/user1342/AutoCorpus)\n![GitHub Repo stars](https://img.shields.io/github/stars/user1342/AutoCorpus?style=social)\n![GitHub watchers](https://img.shields.io/github/watchers/user1342/AutoCorpus?style=social)\n![GitHub last commit](https://img.shields.io/github/last-commit/user1342/AutoCorpus)\n\u003cbr\u003e\n\n\u003c/div\u003e\n\nAutoCorpus is a tool backed by a large language model (LLM) for automatically generating corpus files for fuzzing. \n\nAutoCorpus works best when generating corpus files that are based on natural language, such as JSON, XML, or other config files. \n\n# ⚙️ Setup\n\n## System Requirements\nAutoCorpus utilizes the Mistral-7B-Instruct-v0.2 model and, where possible, offloads processing to your system's GPU. It is recommended to run AutoCorpus on a machine with a minimum of 16GB of RAM and a dedicated Nvidia GPU with at least 4GB of memory. However, it can run on lower-spec machines, albeit at a significantly slower pace.\n\n**AutoCorpus has been tested on Windows 11; however, it should be compatible with Unix and other systems.**\n\n## Dependencies\nAutoCorpus requires **Nvidia CUDA** for enhanced LLM performance. Follow the steps below:\n- Ensure your Nvidia drivers are up to date: [Nvidia Drivers](https://www.nvidia.com/en-us/geforce/drivers/)\n- Install the appropriate dependencies from [here](https://pytorch.org/get-started/locally/)\n- Validate CUDA installation by running the following command and receiving a prompt: ```python -c \"import torch; print(torch.rand(2,3).cuda())\"```\n\nPython dependencies can be found in the `requirements.txt` file:\n\n```\npip install -r requirements.txt\n```\n\nAutoCorpus can then be installed using the ```./setup.py``` script as below:\n\n```\npython -m pip install .\n```\n\n## Running\n\nAutoCorpus can generate corpus files via three different scenarios:\n\n### A Single Prompt\nFor example asking AutoCorpus to generate an XML file would be as follows:\n```\nAutoCorpus.exe -o \"out\" -p \"xml file\"\n```\n### Existing Corpus File(s)\nAutoCorpus can base new corpus files off existing ones.\n```\nAutoCorpus.exe -i \"input_folder\" -o \"out\"\n```\n### Both Existing Corpus Files And a Prompt.\nGeneration can be run by using both an existing corpus and a prompt.\n```\nAutoCorpus.exe -i \"input_folder\" -o \"out\" -p \"xml file\"\n```\n\n### Usage\n```\nusage: AutoCorpus [-h] [--input_folder INPUT_FOLDER] [--output_folder OUTPUT_FOLDER] [--number_of_corpus_files NUMBER_OF_CORPUS_FILES] [--prompt PROMPT]\n                  [--size SIZE] [--verbose]\n\nA tool for automatically generating initial fuzzing input corpus test cases\n\noptional arguments:\n  -h, --help            show this help message and exit\n  --input_folder INPUT_FOLDER, -i INPUT_FOLDER\n                        The input folder to base generated corpus files off. If no prompt is given, the folder needs at least 1 file.\n  --output_folder OUTPUT_FOLDER, -o OUTPUT_FOLDER\n                        The folder to save generated corpus files to (will default to input folder).\n  --number_of_corpus_files NUMBER_OF_CORPUS_FILES, -n NUMBER_OF_CORPUS_FILES\n                        The number of corpus files to generate\n  --prompt PROMPT, -p PROMPT\n                        A sentence defining what the corpus files are for. This helps steer generation.\n  --size SIZE, -s SIZE  Max size of tokens created by the LLM\n  --verbose, -v         Provides verbose outputs\n```\n\n### Examples\n\n#### JSON Corpus Generation\nGenerates 5 corpus files solely on the prompt ```complex json files with varying data```.\n```\nAutoCorpus.exe -o \"out\" -p \"complex json files with varying data\"\n```\n\n```\n[{\"id\": 1, \"name\": \"John Doe\", \"age\": 30, \"city\": \"New York\"},\n\n{\"id\": 2, \"name\": \"Jane Smith\", \"age\": 28, \"city\": \"Los Angeles\"},\n\n{\"id\": 3, \"name\": \"Mike Johnson\", \"age\": 35, \"city\": \"Chicago\"},\n\n{\"id\": 4, \"name\": \"Emma Watson\", \"age\": 27, \"city\": \"London\"}]\n```\n\n#### AWK Config Corpus Generation\n\nCreates an AWK config based on existing example awk configs in the ``` ..\\corpus\\awk\\``` directory along with the prompt ```config file for busybox awk```.\n```\nAutoCorpus.exe -i ..\\corpus\\awk\\ -p \"config file for busybox awk\" -n 10 -s 700\n```\n```\n```bash\n#!/usr/bin/awk -f\n\nBEGIN {\n  FS=\"\\t\"\n  if (ARGC != 3) {\n    print \"Usage: awk-script.awk \u003cfile\u003e \u003cfield\u003e \u003cdelimiter\u003e\"\n    exit 1\n  }\n  print \"Input file:\", ARGV[1]\n  print \"Field to print:\", ARGC[2]\n  print \"Delimiter:\", ARGC[3]\n\n  FILENAME = ARGV[1]\n  if (open(FILENAME, \"r\")) {\n    while ((getline line \u003c FILENAME) \u003e 0) {\n      gsub(/[[:space:]]+/, \"\", line) # remove whitespaces\n      split(line, fields, FS)\n      for (i = 1; i \u003c= NF; i++) {\n        if (length(fields[i]) \u003e 0 \u0026\u0026 fields[i] ~ /\\Q\"[\\\"']\"ARGV[2]\"\\Q/ \u0026\u0026 i != ARGC[3]) {\n          next\n        }\n        if (i == ARGC[3] || i == NF) {\n          print fields[i]\n          break\n        }\n      }\n    }\n    close(FILENAME)\n  } else {\n    print \"Error opening file:\", FILENAME\n    exit 1\n  }\n}```\n```\n\n# 🤖 Mistral-7B-Instruct-v0.2\nBehind the scenes AutoCorpus uses the ```Mistral-7B-Instruct-v0.2``` model from The Mistral AI Team - see [here](https://arxiv.org/abs/2310.06825). The Mistral-7B-Instruct-v0.2 Large Language Model (LLM) is an instruct fine-tuned version of the Mistral-7B-v0.2. More can be found on the model [here!](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2).\n- 7.24B params\n- Tensor type: BF16\n- 32k context window (vs 8k context in v0.1)\n- Rope-theta = 1e6\n- No Sliding-Window Attention\n\n# 🙏 Contributions\nAutoCorpus is an open-source project and welcomes contributions from the community. If you would like to contribute to\nAutoCorpus, please follow these guidelines:\n\n- Fork the repository to your own GitHub account.\n- Create a new branch with a descriptive name for your contribution.\n- Make your changes and test them thoroughly.\n- Submit a pull request to the main repository, including a detailed description of your changes and any relevant documentation.\n- Wait for feedback from the maintainers and address any comments or suggestions (if any).\n- Once your changes have been reviewed and approved, they will be merged into the main repository.\n\n# ⚖️ Code of Conduct\nAutoCorpus follows the Contributor Covenant Code of Conduct. Please make sure to review and adhere to this code of conduct when contributing to AutoCorpus.\n\n# 🐛 Bug Reports and Feature Requests\nIf you encounter a bug or have a suggestion for a new feature, please open an issue in the GitHub repository. Please provide as much detail as possible, including steps to reproduce the issue or a clear description of the proposed feature. Your feedback is valuable and will help improve AutoCorpus for everyone.\n\n# 📜 License\n\n[GNU General Public License v3.0](https://choosealicense.com/licenses/gpl-3.0/)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fuser1342%2FAutoCorpus","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fuser1342%2FAutoCorpus","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fuser1342%2FAutoCorpus/lists"}