{"id":16271184,"url":"https://github.com/morpheuslord/cve-llm_dataset","last_synced_at":"2025-05-09T01:45:38.819Z","repository":{"id":190172411,"uuid":"678698822","full_name":"morpheuslord/CVE-llm_dataset","owner":"morpheuslord","description":"This is a dataset intended to train a LLM model for a completely CVE focused input and output.","archived":false,"fork":false,"pushed_at":"2024-11-25T06:28:14.000Z","size":186349,"stargazers_count":59,"open_issues_count":0,"forks_count":13,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-05-09T01:45:34.366Z","etag":null,"topics":["ai-dataset","ai-finetune","ai-training","llama2","openai","openai-chatgpt","textgeneration"],"latest_commit_sha":null,"homepage":"https://huggingface.co/datasets/morpheuslord/cve-llm-training","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/morpheuslord.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2023-08-15T06:48:25.000Z","updated_at":"2025-04-19T13:44:27.000Z","dependencies_parsed_at":"2023-11-18T08:38:33.767Z","dependency_job_id":"ee50f848-b593-4f3d-8873-b8efe392d15a","html_url":"https://github.com/morpheuslord/CVE-llm_dataset","commit_stats":{"total_commits":18,"total_committers":3,"mean_commits":6.0,"dds":0.2222222222222222,"last_synced_commit":"4dabd5f0e706b66213a64e4ddb746a06c2269f64"},"previous_names":["morpheuslord/cve-llm_dataset"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/morpheuslord%2FCVE-llm_dataset","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/morpheuslord%2FCVE-llm_dataset/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/morpheuslord%2FCVE-llm_dataset/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/morpheuslord%2FCVE-llm_dataset/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/morpheuslord","download_url":"https://codeload.github.com/morpheuslord/CVE-llm_dataset/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253176443,"owners_count":21866142,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai-dataset","ai-finetune","ai-training","llama2","openai","openai-chatgpt","textgeneration"],"created_at":"2024-10-10T18:12:48.239Z","updated_at":"2025-05-09T01:45:38.799Z","avatar_url":"https://github.com/morpheuslord.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# CVE-llm_dataset\r\nThis dataset is intended to train an LLM model for an utterly CVE-focused input and output.\r\n\r\n## Data extraction:\r\nFor the data extraction I first downloaded the CVE database from NVD lists and then loaded them using the `cve_dataset_2.py` and `cve_dataset.py` both have produce different datasets one is for llama and the other is for openai GPT.\r\n\r\nThe CVE json files are mapped in this format:\r\n```\r\ncves:\r\n|\r\n├─1999\r\n|   ├─0xxx\r\n|   |   ├─CVE-1999-0001.json\r\n|   |   └─CVE-1999-0999.json\r\n|   └─1xxx\r\n|      ├─CVE-1999-1000.json\r\n|      └─CVE-1999-1598.json\r\n└─2023\r\n\r\n``` \r\nThe programs traverse trough these folders and extracts the data in the files and arrainges them into usable formats for the fine-tune process.\r\n\r\n## llama2 Model dataset:\r\nThe llama2 fine-tune dataset follows this format:\r\n```\r\n    {\r\n        \"instruction\": \"Explain CVE-1999-0001\",\r\n        \"input\": \"Explain the vulnerability: CVE-1999-0001\",\r\n        \"output\": \"ip_input.c in BSD-derived TCP/IP implementations allows remote attackers to cause a denial of service (crash or hang) via crafted packets.\\nAffected Products: n/a\\nReferences: [{'tags': ['x_refsource_CONFIRM'], 'url': 'http://www.openbsd.org/errata23.html#tcpfix'}, {'name': '5707', 'tags': ['vdb-entry', 'x_refsource_OSVDB'], 'url': 'http://www.osvdb.org/5707'}]\\nCVE State: PUBLISHED\"\r\n    }\r\n```\r\nThe instruction is what we instruct the AI to do with the data provided for example we can command the AI `To take in user input analyze it and then based on what he asks returns an answer` This is also where we can add a `role` or a `personal` to the AI.\r\n\r\nThe input is the user Inputs the main query or data that must be processed by the AI. This is a crucial peace of information that the AI will process in order to provide an output.\r\n\r\nThe output is the format that we define and tell the AI to generate anwers in that format or provide that answer to the question asked.\r\n\r\n## OpenAI fine-tune dataset:\r\nThe OpenAI fine-tune format is way different from the Llama dataset this requires us to define roles and messages for the output and using this we can provide more details and increase the answer accuracy.\r\n\r\n```\r\n    {\r\n        \"messages\": [\r\n            {\r\n                \"role\": \"system\",\r\n                \"content\": \"CVE Vulnerability Information\"\r\n            },\r\n            {\r\n                \"role\": \"user\",\r\n                \"content\": \"Explain the vulnerability: CVE-1999-0001\"\r\n            },\r\n            {\r\n                \"role\": \"assistant\",\r\n                \"content\": \"ip_input.c in BSD-derived TCP/IP implementations allows remote attackers to cause a denial of service (crash or hang) via crafted packets.\\nAffected Products: n/a\\nReferences: [{'tags': ['x_refsource_CONFIRM'], 'url': 'http://www.openbsd.org/errata23.html#tcpfix'}, {'name': '5707', 'tags': ['vdb-entry', 'x_refsource_OSVDB'], 'url': 'http://www.osvdb.org/5707'}]\\nCVE State: PUBLISHED\"\r\n            }\r\n        ]\r\n    }\r\n```\r\nIn this dataset we define the AI and user role's and also the AI content and output for the users content. The core working is similar to llama or any text generation models datasets.\r\n\r\n## Trained model on this dataset.\r\nSomeone actually trained a model heres the [LINK](https://huggingface.co/basit0513/LLM), but the accuracy was not great so I modified the dataset to be more robust so that it can actually be useful to others.\r\n\r\n## OpenAI price calculation:\r\nThe `price-openai.py` file is calculates the datasets total tokens and does the necessary calculations to decide the operall price to train a custom gpt model from openai. The same goes for `tokencount.py` it mainly counts the total amount of tokens present in the dataset.\r\n\r\n## Cite this\r\n```\r\n@misc {chiranjeevi_g_2024,\r\n\tauthor       = { {Chiranjeevi G} },\r\n\ttitle        = { cve-llm-training (Revision b224515) },\r\n\tyear         = 2024,\r\n\turl          = { https://huggingface.co/datasets/morpheuslord/cve-llm-training },\r\n\tdoi          = { 10.57967/hf/3627 },\r\n\tpublisher    = { Hugging Face }\r\n}\r\n```\r\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmorpheuslord%2Fcve-llm_dataset","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmorpheuslord%2Fcve-llm_dataset","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmorpheuslord%2Fcve-llm_dataset/lists"}