{"id":15374427,"url":"https://github.com/ivan-sincek/file-scraper","last_synced_at":"2025-04-15T11:32:59.608Z","repository":{"id":206460881,"uuid":"622365300","full_name":"ivan-sincek/file-scraper","owner":"ivan-sincek","description":"Scrape files for sensitive information, and generate an interactive HTML report. Based on Rabin2.","archived":false,"fork":false,"pushed_at":"2025-03-17T09:21:12.000Z","size":929,"stargazers_count":11,"open_issues_count":0,"forks_count":2,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-28T20:51:26.749Z","etag":null,"topics":["bug-bounty","desktop-penetration-testing","ethical-hacking","incident-response","malware-analysis","mobile-penetration-testing","offensive-security","penetration-testing","python","rabin2","radare2","red-team-engagement","scraping","secrets-finder","secrets-management","security","sensitive-data","sensitive-files","strings","web-penetration-testing"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ivan-sincek.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-04-01T22:22:03.000Z","updated_at":"2025-03-17T09:19:23.000Z","dependencies_parsed_at":"2024-09-17T23:27:50.760Z","dependency_job_id":"12bd9845-691e-4c5d-b0fa-fd756c0a17cf","html_url":"https://github.com/ivan-sincek/file-scraper","commit_stats":{"total_commits":2,"total_committers":1,"mean_commits":2.0,"dds":0.0,"last_synced_commit":"5afd64b2bd071c9d06919555b937c3fb99f73ed6"},"previous_names":["ivan-sincek/file-scraper"],"tags_count":9,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ivan-sincek%2Ffile-scraper","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ivan-sincek%2Ffile-scraper/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ivan-sincek%2Ffile-scraper/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ivan-sincek%2Ffile-scraper/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ivan-sincek","download_url":"https://codeload.github.com/ivan-sincek/file-scraper/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":249061213,"owners_count":21206470,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bug-bounty","desktop-penetration-testing","ethical-hacking","incident-response","malware-analysis","mobile-penetration-testing","offensive-security","penetration-testing","python","rabin2","radare2","red-team-engagement","scraping","secrets-finder","secrets-management","security","sensitive-data","sensitive-files","strings","web-penetration-testing"],"created_at":"2024-10-01T13:58:46.150Z","updated_at":"2025-04-15T11:32:59.602Z","avatar_url":"https://github.com/ivan-sincek.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# File Scraper\n\nScrape files for sensitive information, and generate an interactive HTML report. Based on Rabin2.\n\nThis tool is only as good as your [RegEx](https://github.com/ivan-sincek/file-scraper?tab=readme-ov-file#build-the-template--run) skills.\n\nYou can also style your own [report](https://github.com/ivan-sincek/file-scraper/blob/main/src/file_scraper/reports/default.html).\n\nTested on Kali Linux v2024.2 (64-bit).\n\nMade for educational purposes. I hope it will help!\n\n## Table of Contents\n\n* [How to Install](#how-to-install)\n\t* [Install Radare2](#install-radare2)\n\t* [Standard Install](#standard-install)\n\t* [Build and Install From the Source](#build-and-install-from-the-source)\n* [Build the Template \u0026 Run](#build-the-template--run)\n* [Usage](#usage)\n* [Images](#images)\n\n## How to Install\n\n### Install Radare2\n\nOn Kali Linux, run:\n\n```bash\napt-get -y install radare2\n```\n\n---\n\nOn Windows OS, download and unpack [radareorg/radare2](https://github.com/radareorg/radare2/releases), then, add the `bin` directory to Windows `PATH` environment variable.\n\n---\n\nOn macOS, run:\n\n```bash\nbrew install radare2\n```\n\n### Standard Install\n\n```bash\npip3 install --upgrade file-scraper\n```\n\n### Build and Install From the Source\n\n```bash\ngit clone https://github.com/ivan-sincek/file-scraper \u0026\u0026 cd file-scraper\n\npython3 -m pip install --upgrade build\n\npython3 -m build\n\npython3 -m pip install dist/file_scraper-4.6-py3-none-any.whl\n```\n\n## Build the Template \u0026 Run\n\nPrepare a template such as [the default template](https://github.com/ivan-sincek/file-scraper/blob/main/src/file_scraper/templates/default.json):\n\n```json\n{\n   \"Auth.\":{\n      \"query\":\"(?:basic|bearer)\\\\ \",\n      \"ignorecase\":true,\n      \"search\":true\n   },\n   \"Variables\":{\n      \"query\":\"(?:access|account|admin|auth|card|conf|cookie|cred|customer|email|history|ident|info|jwt|key|kyc|log|otp|pass|pin|priv|refresh|salt|secret|seed|session|setting|sign|token|transaction|transfer|user)[\\\\w\\\\d\\\\-\\\\_]*(?:\\\\\\\"\\\\ *\\\\:|\\\\ *\\\\=[^\\\\=]{1})\",\n      \"ignorecase\":true,\n      \"search\":true\n   },\n   \"Comments\":{\n      \"query\":\"(?:(?\u003c!\\\\:)\\\\/\\\\/|\\\\#).*(?:bug|compatibility|crash|deprecated|fix|issue|legacy|problem|review|security|todo|to do|to-do|to_do|vuln|warning)\",\n      \"ignorecase\":true,\n      \"search\":true\n   },\n   \"Abs. URL\":{\n      \"query\":\"[\\\\w\\\\d\\\\+]*\\\\:\\\\/\\\\/[\\\\w\\\\d\\\\@\\\\-\\\\_\\\\.\\\\:\\\\/\\\\?\\\\\u0026\\\\=\\\\%\\\\#]+\",\n      \"unique\":true,\n      \"collect\":true\n   },\n   \"IPv4\":{\n      \"query\":\"(?:\\b25[0-5]|\\b2[0-4][0-9]|\\b[01]?[0-9][0-9]?)(?:\\\\.(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)){3}\",\n      \"unique\":true,\n      \"collect\":true\n   },\n   \"Base64\":{\n      \"query\":\"(?:[a-zA-Z0-9\\\\+\\\\/]{4})*(?:[a-zA-Z0-9\\\\+\\\\/]{4}|[a-zA-Z0-9\\\\+\\\\/]{3}\\\\=|[a-zA-Z0-9\\\\+\\\\/]{2}\\\\=\\\\=)\",\n      \"minimum\":8,\n      \"decode\":\"base64\",\n      \"minimum_decode\":6,\n      \"unique\":true,\n      \"collect\":true\n   },\n   \"HEX\":{\n      \"query\":\"(?:(?:0x|(?:\\\\\\\\)+x)[a-fA-F0-9]{2})+|(?:[a-fA-F0-9]{2})+\",\n      \"minimum\":12,\n      \"decode\":\"hex\",\n      \"minimum_decode\":6,\n      \"unique\":true,\n      \"collect\":true\n   },\n   \"PEM\":{\n      \"query\":\"-----BEGIN (?:CERTIFICATE|PRIVATE KEY)-----[\\\\s\\\\S]+?-----END (?:CERTIFICATE|PRIVATE KEY)-----\",\n      \"decode\":\"pem\",\n      \"unique\":true,\n      \"collect\":true\n   }\n}\n```\n\n**Make sure your regular expressions return only one capturing group, e.g., `[1, 2, 3, 4]`; and not a touple, e.g., `[(1, 2), (3, 4)]`.**\n\nMake sure to properly escape regular expression specific symbols in your template file, e.g., make sure to escape dot `.` as `\\\\.`, and forward slash `/` as `\\\\/`, etc.\n\n| Name | Type | Required |Description |\n| --- | --- | --- | --- |\n| query | str | yes | Regular expression query. |\n| search | bool | no | Highlight matches within the searched lines; otherwise, extract the matches. |\n| ignorecase | bool | no | Case-insensitive search. |\n| minimum | int | no | Only accept matches longer than `int` characters. |\n| maximum | int | no | Only accept matches lesser than `int` characters. |\n| decode | str | no | Decode the matches. Available decodings: `url`, `base64` `hex`, `pem`. |\n| minimum_decode | int | no | Only accept decodings longer than `int` characters. |\n| maximum_decode | int | no | Only accept decodings lesser than `int` characters. |\n| unique | bool | no | Filter out duplicates. |\n| collect | bool | no | Collect all the matches in one place. |\n\n`minimum_decode` and `maximum_decode` will check the length of the decoded string after bad characters are removed.\n\n---\n\nHow I typically run the tool:\n\n```fundamental\nfile-scraper -dir directory -o results.html -e default\n```\n\nDefault (built-in) exclude file types:\n\n```fundamental\ncar, css, gif, jpeg, jpg, mp3, mp4, nib, ogg, otf, eot, png, storyboard, strings, svg, ttf, webp, woff, woff2, xib, vtt\n```\n\n## Usage\n\n```fundamental\nFile Scraper v4.6 ( github.com/ivan-sincek/file-scraper )\n\nUsage:   file-scraper -dir directory -o out          [-t template     ] [-th threads]\nExample: file-scraper -dir decoded   -o results.html [-t template.json] [-th 10     ]\n\nDESCRIPTION\n    Scrape files for sensitive information\nDIRECTORY\n    Directory containing files or a single file to scrape\n    -dir, --directory\u003e = decoded | files | test.exe | etc.\nTEMPLATE\n    File containing extraction details or a single RegEx to use\n    Default: built-in JSON template file\n    -t, --template = template.json | \"secret\\: [\\w\\d]+\" | etc.\nEXCLUDES\n    Exclude all files ending with the specified extension\n    Specify 'default' to load the built-in list\n    Use comma-separated values\n    -e, --excludes = mp3 | default,jpg,png | etc.\nINCLUDES\n    Include all files ending with the specified extension\n    Overrides the excludes\n    Use comma-separated values\n    -i, --includes = java | json,xml,yaml | etc.\nBEAUTIFY\n    Beautify [minified] JavaScript (.js) files\n    -b, --beautify\nTHREADS\n    Number of parallel threads to run\n    Default: 30\n    -th, --threads = 10 | etc.\nOUT\n    Output file\n    -o, --out = results.html | etc.\nDEBUG\n    Enable debug output\n    -dbg, --debug\n```\n\n## Images\n\n\u003cp align=\"center\"\u003e\u003cimg src=\"https://github.com/ivan-sincek/file-scraper/blob/main/img/interactive_report_1.png\" alt=\"Interactive Report (1)\"\u003e\u003c/p\u003e\n\n\u003cp align=\"center\"\u003eFigure 1 - Interactive Report (1)\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\u003cimg src=\"https://github.com/ivan-sincek/file-scraper/blob/main/img/interactive_report_2.png\" alt=\"Interactive Report (2)\"\u003e\u003c/p\u003e\n\n\u003cp align=\"center\"\u003eFigure 2 - Interactive Report (2)\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\u003cimg src=\"https://github.com/ivan-sincek/file-scraper/blob/main/img/interactive_report_3.png\" alt=\"Interactive Report (3)\"\u003e\u003c/p\u003e\n\n\u003cp align=\"center\"\u003eFigure 3 - Interactive Report (3)\u003c/p\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fivan-sincek%2Ffile-scraper","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fivan-sincek%2Ffile-scraper","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fivan-sincek%2Ffile-scraper/lists"}