{"id":13540400,"url":"https://github.com/InQuest/iocextract","last_synced_at":"2025-04-02T07:31:03.099Z","repository":{"id":37664794,"uuid":"129942209","full_name":"InQuest/iocextract","owner":"InQuest","description":"Defanged Indicator of Compromise (IOC) Extractor.","archived":false,"fork":false,"pushed_at":"2023-12-07T21:05:36.000Z","size":845,"stargazers_count":485,"open_issues_count":2,"forks_count":88,"subscribers_count":28,"default_branch":"master","last_synced_at":"2024-04-16T04:18:40.778Z","etag":null,"topics":["base64","decoding","defang","dfir","indicators-of-compromise","ioc","ioc-extractor","library","malware-research","osint","threat-intelligence","threat-sharing","threatintel","yara"],"latest_commit_sha":null,"homepage":"https://inquest.readthedocs.io/projects/iocextract/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/InQuest.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2018-04-17T17:37:18.000Z","updated_at":"2024-04-15T20:24:59.000Z","dependencies_parsed_at":"2024-01-07T09:40:28.948Z","dependency_job_id":"9d2b84eb-eded-4da7-8bd5-da14c3892d1b","html_url":"https://github.com/InQuest/iocextract","commit_stats":null,"previous_names":["inquest/python-iocextract"],"tags_count":29,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/InQuest%2Fiocextract","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/InQuest%2Fiocextract/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/InQuest%2Fiocextract/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/InQuest%2Fiocextract/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/InQuest","download_url":"https://codeload.github.com/InQuest/iocextract/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246774366,"owners_count":20831528,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["base64","decoding","defang","dfir","indicators-of-compromise","ioc","ioc-extractor","library","malware-research","osint","threat-intelligence","threat-sharing","threatintel","yara"],"created_at":"2024-08-01T09:01:48.735Z","updated_at":"2025-04-02T07:30:58.085Z","avatar_url":"https://github.com/InQuest.png","language":"Python","readme":"iocextract\n==========\n\n![Developed by InQuest](https://inquest.net/images/inquest-badge.svg)\n![Build Status](https://github.com/InQuest/iocextract/workflows/iocextract-build/badge.svg)\n[![Documentation Status](https://readthedocs.org/projects/iocextract/badge/?version=latest)](https://inquest.readthedocs.io/projects/iocextract/en/latest/?badge=latest)\n![PyPI Version](https://img.shields.io/pypi/v/iocextract.svg)\n\n[**Indicator of Compromise**](https://en.wikipedia.org/wiki/Indicator_of_compromise) (IOC) extractor for some of the most commonly ingested artifacts.\n\nTable of contents\n=================\n\n* [Overview](#overview)\n    * [The Problem](#the-problem)\n    * [Our Solution](#our-solution)\n    * [Example Use Case](#example-use-case)\n* [Installation](#installation)\n* [Usage](#usage)\n    * [Library](#library)\n    * [Command Line Interface](#command-line-interface)\n* [Helpful Information](#helpful-information)\n    * [FAQ](#faq)\n    * [More Details](#more-details)\n    * [Custom Regex](#custom-regex)\n    * [Related Projects](#related-projects)\n    * [Contributing](#contributing)\n\nOverview\n========\n\nThe `iocextract` package is a library and command line interface (CLI) for extracting URLs, IP addresses, MD5/SHA hashes, email addresses, and YARA rules from text corpora. It allows for you to extract encoded and \"defanged\" IOCs and optionally decode or refang them.\n\nThe Problem\n-----------\n\nIt is common practice for malware analysts or endpoint software to \"defang\" IOCs such as URLs and IP addresses, in order to prevent accidental exposure to live malicious content. Being able to extract and aggregate these IOCs is often valuable for analysts. Unfortunately, existing \"IOC extraction\" tools often pass right by them, as they are not caught by standard regex.\n\nFor example, the simple defanging technique of surrounding periods with brackets:\n```\n127[.]0[.]0[.]1\n```\n\nExisting tools that use a simple IP address regex will ignore this IOC entirely.\n\nOur Solution\n------------\n\nBy combining specially crafted regex with some custom post-processing, we are able to both detect and deobfuscate \"defanged\" IOCs. This saves time and effort for the analyst, who might otherwise have to manually find and convert IOCs into machine-readable format.\n\nExample Use Case\n-----------------\n\nMany Twitter users post C2s or other valuable IOC information with defanged URLs.\nFor example, [this tweet from @InQuest](https://twitter.com/InQuest/status/969469856931287041):\n\n```\nRecommended reading and great work from @unit42_intel:\nhttps://researchcenter.paloaltonetworks.com/2018/02/unit42-sofacy-attacks-multiple-government-entities/ ...\nInQuest customers have had detection for threats delivered from hotfixmsupload[.]com\nsince 6/3/2017 and cdnverify[.]net since 2/1/18.\n```\n\nIf we run this through the extractor, we can easily pull out the URLs:\n\n```\nhttps://researchcenter.paloaltonetworks.com/2018/02/unit42-sofacy-attacks-multiple-government-entities/\nhotfixmsupload[.]com\ncdnverify[.]net\n```\n\nPassing in `refang=True` at extraction time would remove the obfuscation, but since these are real IOCs, let's leave them defanged in our documentation.\n\nInstallation\n============\n\nYou may need to install the Python development headers in order to install the `regex` dependency. On Ubuntu/Debian-based systems, try:\n\n```bash\nsudo apt-get install python-dev\n```\n\nThen install `iocextract` from pip:\n\n```bash\npip install iocextract\n```\n\nIf you have problems installing on Windows, try installing `regex` directly by downloading the [appropriate wheel from PyPI](https://pypi.org/project/regex/#files) and installing via `pip`:\n\n```bash\npip install regex-2018.06.21-cp27-none-win_amd64.whl\n```\n\nUsage\n=====\n\nLibrary\n-------\n\nTry extracting some defanged URLs:\n\n```python\nimport iocextract\n\ncontent = \\\n\"\"\"\nI really love example[.]com!\nAll the bots are on hxxp://example.com/bad/url these days.\nC2: tcp://example[.]com:8989/bad\n\"\"\"\n\nfor url in iocextract.extract_urls(content):\n    print(url)\n\n    # Output\n\n    # hxxp://example.com/bad/url\n    # tcp://example[.]com:8989/bad\n    # example[.]com\n    # tcp://example[.]com:8989/bad\n```\n\nNOTE: Some URLs may show up twice if they are caught by multiple regexes.\n\nIf you want, you can also \"refang\", or remove common obfuscation methods from IOCs:\n\n```python\nimport iocextract\n\nfor url in iocextract.extract_urls(content, refang=True):\n    print(url)\n\n    # Output\n\n    # http://example.com/bad/url\n    # http://example.com:8989/bad\n    # http://example.com\n    # http://example.com:8989/bad\n```\n\nIf you don't want to defang the extracted IOCs at all during extraction, you can disable this as well:\n\n```python\nimport iocextract\n\ncontent = \\\n\"\"\"\nhttp://example.com/bad/url\nhttp://example.com:8989/bad\nhttp://example.com\nhttp://example.com:8989/bad\n\"\"\"\n\nfor url in iocextract.extract_urls(content, defang=False):\n    print(url)\n\n    # Output\n\n    # http://example.com/bad/url\n    # http://example.com:8989/bad\n    # http://example.com\n    # http://example.com:8989/bad\n```\n\nAll `extract_*` functions in this library return iterators, not lists. The benefit of this behavior is that `iocextract` can process extremely large inputs, with a very low overhead. However, if for some reason you need to iterate over the IOCs more than once, you will have to save the results as a list:\n\n```python\nimport iocextract\n\ncontent = \\\n\"\"\"\nI really love example[.]com!\nAll the bots are on hxxp://example.com/bad/url these days.\nC2: tcp://example[.]com:8989/bad\n\"\"\"\n\nprint(list(iocextract.extract_urls(content)))\n# ['hxxp://example.com/bad/url', 'tcp://example[.]com:8989/bad', 'example[.]com', 'tcp://example[.]com:8989/bad']\n```\n\nCommand Line Interface\n----------------------\n\nA command-line tool is also included:\n\n```bash\n$ iocextract -h\n    usage: iocextract [-h] [--input INPUT] [--output OUTPUT] [--extract-emails]\n                  [--extract-ips] [--extract-ipv4s] [--extract-ipv6s]\n                  [--extract-urls] [--extract-yara-rules] [--extract-hashes]\n                  [--custom-regex REGEX_FILE] [--refang] [--strip-urls]\n                  [--wide]\n\n    Advanced Indicator of Compromise (IOC) extractor. If no arguments are\n    specified, the default behavior is to extract all IOCs.\n\n    optional arguments:\n      -h, --help            show this help message and exit\n      --input INPUT         default: stdin\n      --output OUTPUT       default: stdout\n      --extract-emails\n      --extract-ips\n      --extract-ipv4s\n      --extract-ipv6s\n      --extract-urls\n      --extract-yara-rules\n      --extract-hashes\n      --custom-regex REGEX_FILE file with custom regex strings, one per line, with one capture group each\n      --refang              default: no\n      --strip-urls          remove possible garbage from the end of urls. default: no\n      --wide                preprocess input to allow wide-encoded character matches. default: no\n```\n\nNOTE: Only URLs, emails, and IPv4 addresses can be \"refanged\".\n\nHelpful Information\n===================\n\nFAQ\n---\n\nAre you...\n\n\u003e Q. Extracting possibly-defanged IOCs from plain text, like the contents of tweets or blog posts?\n\u003e\u003e A. Yes! This is exactly what iocextract was designed for, and where it performs best. Want to go a step farther and automate extraction and storage? Check out [ThreatIngestor](https://github.com/InQuest/ThreatIngestor).\n\n\u003e Q. Extracting URLs that have been hex or base64 encoded?\n\u003e\u003e A. Yes, but the CLI might not give you the best results. Try writing a Python script and calling `iocextract.extract_encoded_urls` directly.\n\nNote: You will most likely end up with extra garbage at the end of URLs.\n\n\u003e Q. Extracting IOCs that have not been defanged, from HTML/XML/RTF?\n\u003e\u003e A. Maybe, but you should consider using the `--strip-urls` CLI flag (or the `strip=True` parameter in the library), and you may still get some extra garbage in your output. If you're extracting from HTML, consider using something like [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/) to first isolate the text content, and then pass that to iocextract, [like this](https://gist.github.com/rshipp/d399491305c5d293357a800d5a51b0aa).\n\n\u003e Q. Extracting IOCs that have not been defanged, from binary data like executables, or very large inputs?\n\u003e\u003e A. There is a very simplistic version of this available when running as a library, but it requires the `defang=False` parameter and could potentially miss some of the IOCs. The regex in iocextract is designed to be flexible to catch defanged IOCs. If you're unable to collect the information you need, consider using something like [Cacador](https://github.com/sroberts/cacador) instead.\n\nMore Details\n------------\n\nThis library currently supports the following IOCs:\n\n* IP Addresses\n    * IPv4 fully supported\n    * IPv6 partially supported\n* URLs\n    * With protocol specifier: http, https, tcp, udp, ftp, sftp, ftps\n    * With `[.]` anchor, even with no protocol specifier\n    * IPv4 and IPv6 (RFC2732) URLs are supported\n    * Hex-encoded URLs with protocol specifier: http, https, ftp\n    * URL-encoded URLs with protocol specifier: http, https, ftp, ftps, sftp\n    * Base64-encoded URLs with protocol specifier: http, https, ftp\n* Emails\n    * Partially supported, anchoring on `@` or `at`\n* YARA rules\n    * With imports, includes, and comments\n* Hashes\n    * MD5\n    * SHA1\n    * SHA256\n    * SHA512\n* Telephone numbers\n* Custom regex\n    * With exactly one capture group\n\nFor IPv4 addresses, the following defang techniques are supported:\n\n| Technique       | Defanged      | Refanged  |\n|-----------------|---------------|-----------|\n| `.` -\u003e `[.]`    | 1[.]1[.]1[.]1 | 1.1.1.1   |\n| `.` -\u003e `(.)`    | 1(.)1(.)1(.)1 | 1.1.1.1   |\n| `.` -\u003e `\\.`     | 1\\\\.1\\\\.1\\\\.1  | 1.1.1.1  |\n| Partial         | 1[.1[.1.]1    | 1.1.1.1   |\n| Any combination | 1\\.)1[.1.)1   | 1.1.1.1   |\n\nFor email addresses, the following defang techniques are supported:\n\n| Technique       | Defanged           | Refanged       |\n|-----------------|--------------------|----------------|\n| `.` -\u003e `[.]`    | me@example[.]com   | me@example.com |\n| `.` -\u003e `(.)`    | me@example(.)com   | me@example.com |\n| `.` -\u003e `{.}`    | me@example{.}com   | me@example.com |\n| `.` -\u003e `_dot_`  | me@example dot com | me@example.com |\n| `@` -\u003e `[@]`    | me[@]example.com   | me@example.com |\n| `@` -\u003e `(@)`    | me(@)example.com   | me@example.com |\n| `@` -\u003e `{@}`    | me{@}example.com   | me@example.com |\n| `@` -\u003e `_at_`   | me at example.com  | me@example.com |\n| Partial         | me@} example[.com  | me@example.com |\n| Added spaces    | me@example [.] com | me@example.com |\n| Any combination | me @example [.)com | me@example.com |\n\nFor URLs, the following defang techniques are supported:\n\n| Technique       | Defanged                                           | Refanged                  |\n|-----------------|----------------------------------------------------|---------------------------|\n| `.` -\u003e `[.]`    | `example[.]com/path`                               | `http://example.com/path` |\n| `.` -\u003e `(.)`    | `example(.)com/path`                               | `http://example.com/path` |\n| `.` -\u003e `\\.`     | `example\\.com/path`                                | `http://example.com/path` |\n| Partial         | `http://example[.com/path`                         | `http://example.com/path` |\n| `/` -\u003e `[/]`    | `http://example.com[/]path`                        | `http://example.com/path` |\n| [Cisco ESA](https://www.cisco.com/c/en/us/support/docs/security/email-security-appliance/118775-technote-esa-00.html)   | `http:// example .com /path`                       | `http://example.com/path` |\n| `://` -\u003e `__`   | `http__example.com/path`                           | `http://example.com/path` |\n| `://` -\u003e `:\\\\`  | `http:\\\\example.com/path`                          | `http://example.com/path` |\n| `:` -\u003e `[:]`    | `http[:]//example.com/path`                        | `http://example.com/path` |\n| `hxxp`          | `hxxp://example.com/path`                          | `http://example.com/path` |\n| Any combination | `hxxp__ example( .com[/]path`                      | `http://example.com/path` |\n| Hex encoded     | `687474703a2f2f6578616d706c652e636f6d2f70617468`   | `http://example.com/path` |\n| URL encoded     | `http%3A%2F%2fexample%2Ecom%2Fpath`                | `http://example.com/path` |\n| Base64 encoded  | `aHR0cDovL2V4YW1wbGUuY29tL3BhdGgK`                 | `http://example.com/path` |\n\nNOTE: The tables above are not exhaustive, and other URL/defang patterns may also be extracted correctly. If you notice something missing or not working correctly, feel free to let us know via the [GitHub Issues](https://github.com/inquest/iocextract/issues).\n\nThe base64 regex was generated with [@deadpixi](https://github.com/deadpixi)'s [base64 regex tool](https://www.erlang-factory.com/upload/presentations/225/ErlangFactorySFBay2010-RobKing.pdf).\n\nCustom Regex\n------------\n\nIf you'd like to use the CLI to extract IOCs using your own custom regex, create a plain text file with one regex string per line, and pass it in with the `--custom-regex` flag. Be sure each regex string includes exactly one [capture group](https://www.regular-expressions.info/brackets.html).\n\nFor example:\n\n```\nhttp://(example\\.com)/\n(?:https|ftp)://(example\\.com)/\n```\n\nThis custom regex file will extrac the domain `example.com` from matching URLs. The `(?: )` noncapture group won't be included in matches.\n\nIf you would like to extract the entire match, just put parentheses around your entire regex string, like this:\n\n```\n(https?://.*?.com)\n```\n\nIf your regex is invalid, you'll see an error message like this:\n\n```\nError in custom regex: missing ) at position 5\n```\n\nIf your regex does not include a capture group, you'll see an error message like this:\n\n```\nError in custom regex: no such group\n```\n\nAlways use a single capture group when working with custom regex. Here's a quick example:\n\n```python\n[\n    r'(my regex)',  # This yields 'my regex' if the pattern matches\n    r'my (re)gex',  # This yields 're' if the pattern matches\n]\n```\n\nUsing more than a single capture group can cause unexpected results. Check out this example:\n\n```python\n[\n    r'my regex',  # This doesn't yield anything\n    r'(my) (re)gex',  # This yields 'my' if the pattern matches\n]\n```\n\nWhy? Because the result will always yield only the first *group* match from each regex.\n\nFor more complicated regex queries, you can combine capture and non-capture groups like so:\n\n```python\n[\n    r'(?:my|your) (re)gex',  # This yields 're' if the pattern matches\n]\n```\n\nYou can now compare the `(?: )` syntax for noncapture groups vs the `( )` syntax for the capture group.\n\n\nRelated Projects\n----------------\n\nIf iocextract doesn't fit your use case, several similar projects exist. Check out the [defang](https://github.com/topics/defang)  and [indicators-of-compromise](https://github.com/topics/indicators-of-compromise) tags on GitHub, as well as:\n\n* [Cacador](https://github.com/sroberts/cacador) in Go\n* [ioc-extractor](https://github.com/ninoseki/ioc-extractor) in JavaScript\n* [Cyobstract](https://github.com/cmu-sei/cyobstract) in Python\n\nIf you'd like to automate IOC extraction, enrichment, export, and more, check out [ThreatIngestor](https://github.com/InQuest/ThreatIngestor).\n\nIf you're working with YARA rules, you may be interested in [plyara](https://github.com/plyara/plyara).\n\nContributing\n------------\n\nIf you have a defang technique that doesn't make it through the extractor, or if you find any bugs, Pull Requests and Issues are always welcome. The library is released under a GPL-2.0 [license](https://github.com/InQuest/iocextract/blob/master/LICENSE).\n\nWho's using iocextract?\n-----------------------\n\n- [InQuest](https://inquest.net)\n- [PacketTotal](https://www.packettotal.com)\n\nAre you using it? Want to see your site listed here? Let us know!","funding_links":[],"categories":["\u003ca id=\"f56806b5b229bdf6c118f5fb1092e141\"\u003e\u003c/a\u003e威胁情报"],"sub_categories":["\u003ca id=\"3e10f389acfbd56b79f52ab4765e11bf\"\u003e\u003c/a\u003eIOC"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FInQuest%2Fiocextract","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FInQuest%2Fiocextract","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FInQuest%2Fiocextract/lists"}