{"id":19589714,"url":"https://github.com/jackaduma/threatreportextractor","last_synced_at":"2026-03-05T18:38:22.636Z","repository":{"id":41366610,"uuid":"409034195","full_name":"jackaduma/ThreatReportExtractor","owner":"jackaduma","description":"Extracting Attack Behavior from Threat Reports","archived":false,"fork":false,"pushed_at":"2023-04-28T22:16:45.000Z","size":22660,"stargazers_count":53,"open_issues_count":1,"forks_count":13,"subscribers_count":4,"default_branch":"main","last_synced_at":"2023-08-02T20:13:14.886Z","etag":null,"topics":["advanced-persistent-threat","cyber-threat-intelligence","cybersecurity","deep-learning","deeplearning","graph","graph-algorithms","machine-learning","machine-learning-algorithms","natural-language-processing","nlp","nlp-machine-learning","nlp-parsing","security","threat-analysis","threat-intelligence"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jackaduma.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-09-22T02:05:11.000Z","updated_at":"2023-07-31T16:02:20.000Z","dependencies_parsed_at":"2022-07-19T00:47:00.293Z","dependency_job_id":null,"html_url":"https://github.com/jackaduma/ThreatReportExtractor","commit_stats":null,"previous_names":[],"tags_count":0,"template":null,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jackaduma%2FThreatReportExtractor","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jackaduma%2FThreatReportExtractor/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jackaduma%2FThreatReportExtractor/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jackaduma%2FThreatReportExtractor/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jackaduma","download_url":"https://codeload.github.com/jackaduma/ThreatReportExtractor/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":224070410,"owners_count":17250652,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["advanced-persistent-threat","cyber-threat-intelligence","cybersecurity","deep-learning","deeplearning","graph","graph-algorithms","machine-learning","machine-learning-algorithms","natural-language-processing","nlp","nlp-machine-learning","nlp-parsing","security","threat-analysis","threat-intelligence"],"created_at":"2024-11-11T08:20:23.395Z","updated_at":"2026-03-05T18:38:17.599Z","avatar_url":"https://github.com/jackaduma.png","language":"Python","funding_links":["https://paypal.me/jackaduma?locale.x=zh_XC"],"categories":[],"sub_categories":[],"readme":"\u003c!--\n * @Author: Kun\n * @Date: 2021-09-16 11:11:28\n * @LastEditTime: 2023-04-29 06:15:22\n * @LastEditors: Kun\n * @Description: \n * @FilePath: /my_open_projects/ThreatReportExtractor/README.md\n--\u003e\n\n# **ThreatReportExtractor**\n\n[![standard-readme compliant](https://img.shields.io/badge/readme%20style-standard-brightgreen.svg?style=flat-square)](https://github.com/jackaduma/ThreatReportExtractor)\n[![Donate](https://img.shields.io/badge/Donate-PayPal-green.svg)](https://paypal.me/jackaduma?locale.x=zh_XC)\n\n[**中文说明**](./README.zh-CN.md) | [**English**](./README.md)\n\n------\n\nThis code is an implementation for paper: [EXTRACTOR: Extracting Attack Behavior from Threat Reports](https://arxiv.org/abs/2104.08618), a nice work on **Threat Report Extracting** in Cyber Threat Intelligence (CTI) .\n\n- [x] Environment\n  - [x] NLP submodules\n  - [x] NLP pretrained models\n  - [x] Dependent libraries\n- [x] Usage\n  - [x] Example \n- [x] Demo\n- [x] Reference\n\n------\n\n## **EXTRACTOR: Extracting Attack Behavior from Threat Reports**\n\n### [**Paper Page**](https://arxiv.org/abs/2104.08618)\n\n\nThe knowledge on attacks contained in **Cyber Threat Intelligence (CTI)** reports is very important to effectively identify and quickly respond to cyber threats. However, this knowledge is often embedded in large amounts of text, and therefore difficult to use effectively. To address this challenge, we propose a novel approach and tool called EXTRACTOR that allows precise automatic extraction of concise attack behaviors from CTI reports. EXTRACTOR makes no strong assumptions about the text and is capable of extracting attack behaviors as provenance graphs from unstructured text. We evaluate EXTRACTOR using real-world incident reports from various sources as well as reports of DARPA adversarial engagements that involve several attack campaigns on various OS platforms of Windows, Linux, and FreeBSD. Our evaluation results show that EXTRACTOR can extract concise provenance graphs from CTI reports and show that these graphs can successfully be used by cyber-analytics tools in threat-hunting.\n\n\n------\n## **Environment**\n\nthis code supports python3; not support python2\n\n### **spacy**\n\ndownload model for spacy\n\n```\npython -m spacy download en_core_web_lg \n```\n\n### **nltk**\n\ndownload nltk when setting param crf is false\n\n```python\nimport nltk\nnltk.download('averaged_perceptron_tagger')\n```\n------\n\n## **submodules**\n\n```bash\ncd $PROJECT_HOME\ngit submodule init\ngit submodule update\n```\n\n### **allennlp**\n\ndownload pretrain model for allennlp\n\n```bash\nwget -c -t 0 https://s3-us-west-2.amazonaws.com/allennlp/models/srl-model-2018.05.25.tar.gz\nmv srl-model-2018.05.25.tar.gz srl-model.tar.gz  # in current dir\n```\n\n------\n\n## **graphviz**\n\n### installation \n\nLinux: \n\n```\nUbuntu: sudo apt install graphviz\nFedora: sudo yum install graphviz\nDebian: sudo apt install graphviz\nRedhat/Centos: sudo yum install graphviz # Stable and development rpms for Redhat Enterprise, or CentOS systems* available but are out of date.\n```\nMac:\n```\nsudo port install graphviz\nbrew install graphviz\n```\n\n\n### graphviz generate image file\n\n```bash\ndot xxx.dot -T png -o xxx.png\n```\n\n## **Usage**\n\nRun EXTRACTOR with \n```\npython3 main.py [-h] [--asterisk ASTERISK] [--crf CRF] [--rmdup RMDUP] [--elip ELIP] [--gname GNAME] [--input_file INPUT_FILE]\n```\n\nDepending on the usage, each argument helps to provide a different representation of the attack behavior. \n`[--asterisk true]` creates abstraction and can be used to replace anything that is not perceived as IOC/system entity into a wild-card. This representation can be used to be searched within the audit-logs.  \n\n`[--crf true/false]` allows activating or deactivating of the co-referencing module. \n\n`[--rmdup true/false]` enables removal of duplicate nodes-edge. \n\n`[--elip true/false]` is to choose whether to replace ellipsis subjects using the surrounding subject or not.\n\n`[--input_file path/filename.txt]` is to pass the text file to the application. \n\n`[--gname graph_name]` is to specify the name output graph (two files will be created, e.g., graph.pdf and graph.dot).\n\n\n## **Example**\n```\npython3 main.py --asterisk true --crf true --rmdup true --elip true --input_file input.txt --gname mygraph`\n```\n\n```\npython main.py --asterisk false --crf false --rmdup false --input_file input.txt \n```\n\n```\npython main.py --asterisk false --crf true --rmdup false --input_file input.txt \n```\n\n```\npython main.py --asterisk true --crf true --rmdup true --elip true --input_file input.txt --gname mygraph \n```\n\n```\npython main.py --asterisk true --crf false --rmdup true --elip true --input_file input.txt --gname mygraph \n```\n\n------\n\n## **Reference**\n1. **EXTRACTOR: Extracting Attack Behavior from Threat Reports**. [Paper](https://arxiv.org/abs/2104.08618)\n2. EXTRACTOR. [Code](https://github.com/ksatvat/EXTRACTOR)\n3. Passive/Active sentence Transformer. [Code](https://github.com/DanManN/pass2act)\n\n------\n## **Star-History**\n\n![star-history](https://api.star-history.com/svg?repos=jackaduma/ThreatReportExtractor\u0026type=Date \"star-history\")\n\n------\n\n## Donation\nIf this project help you reduce time to develop, you can give me a cup of coffee :) \n\nAliPay(支付宝)\n\u003cdiv align=\"center\"\u003e\n\t\u003cimg src=\"./misc/ali_pay.png\" alt=\"ali_pay\" width=\"400\" /\u003e\n\u003c/div\u003e\n\nWechatPay(微信)\n\u003cdiv align=\"center\"\u003e\n    \u003cimg src=\"./misc/wechat_pay.png\" alt=\"wechat_pay\" width=\"400\" /\u003e\n\u003c/div\u003e\n\n------\n\n## **License**\n\n[GPL-3.0](LICENSE) © Kun\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjackaduma%2Fthreatreportextractor","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjackaduma%2Fthreatreportextractor","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjackaduma%2Fthreatreportextractor/lists"}