{"id":14064783,"url":"https://github.com/PadishahIII/SecretScraper","last_synced_at":"2025-07-29T19:31:18.855Z","repository":{"id":233499404,"uuid":"783732290","full_name":"PadishahIII/SecretScraper","owner":"PadishahIII","description":"SecretScraper is a web scraper that crawl through target websites, scrape from http response and extract secret information via regular expression.","archived":false,"fork":false,"pushed_at":"2024-05-01T06:38:25.000Z","size":572,"stargazers_count":18,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2024-05-02T19:10:09.213Z","etag":null,"topics":["crawler","cyper","hyperscan","pentest-tool","pentesting","python","sensitivity-analysis","webscraper"],"latest_commit_sha":null,"homepage":"https://pypi.org/project/secretscraper/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/PadishahIII.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-04-08T13:14:57.000Z","updated_at":"2024-06-02T10:06:26.312Z","dependencies_parsed_at":"2024-06-02T10:06:25.070Z","dependency_job_id":"d5d375b7-0016-4c2d-b54d-c506ab9611fc","html_url":"https://github.com/PadishahIII/SecretScraper","commit_stats":null,"previous_names":["padishahiii/secretscraper"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PadishahIII%2FSecretScraper","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PadishahIII%2FSecretScraper/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PadishahIII%2FSecretScraper/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PadishahIII%2FSecretScraper/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/PadishahIII","download_url":"https://codeload.github.com/PadishahIII/SecretScraper/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":227656969,"owners_count":17799908,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawler","cyper","hyperscan","pentest-tool","pentesting","python","sensitivity-analysis","webscraper"],"created_at":"2024-08-13T07:04:04.652Z","updated_at":"2024-12-04T03:31:30.591Z","avatar_url":"https://github.com/PadishahIII.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"# SecretScraper\n\n![Tests](https://github.com/PadishahIII/SecretScraper/actions/workflows/main.yml/badge.svg)\n![Pypi Python Version](https://img.shields.io/pypi/pyversions/secretscraper.svg?style=plastic)\n\n## Overview\n\nSecretScraper is a highly configurable web scrape tool that crawl links  from target websites and scrape sensitive\ndata via regular expression.\n\n\n \u003cimg alt=\"Shows an illustrated sun in light mode and a moon with stars in dark mode.\" src=\"https://github.com/PadishahIII/SecretScraper/assets/83501709/d1aa763f-5711-47c4-8b8f-9309bac88ae2\" width=800\u003e\n\n## Feature\n- Web crawler: extract links via both DOM hierarchy and regex\n- Support domain white list and black list\n- Support multiple targets, input target URLs from a file\n- Support local file scan\n- Scalable customization: header, proxy, timeout, cookie, scrape depth, follow redirect, etc.\n- Built-in regex to search for sensitive information\n- Flexible configuration in yaml format\n\n## Prerequisite\n- Platform: Test on MacOS, Ubuntu and Windows.\n- Python Version \u003e= 3.9\n\n## Usage\n\n### Install\n\n```bash\npip install secretscraper\n```\n\n### Update\n\n```bash\npip install --upgrade secretscraper\n```\n**Note that**, since _Secretscraper_ generates a default configuration under the work directory if `settings.yml` is absent, so remember to update the `settings.yml` to the latest version(just copy from [Customize Configuration](https://github.com/PadishahIII/SecretScraper?tab=readme-ov-file#customize-configuration)).\n\n### Basic Usage\n\nStart with single target:\n\n```bash\nsecretscraper -u https://scrapeme.live/shop/\n```\n\nStart with multiple targets:\n\n```bash\nsecretscraper -f urls\n```\n\n```text\n# urls\nhttp://scrapeme.live/1\nhttp://scrapeme.live/2\nhttp://scrapeme.live/3\nhttp://scrapeme.live/4\nhttp://scrapeme.live/1\n```\nSample output:\n\u003cimg width=\"971\" alt=\"image\" src=\"https://github.com/PadishahIII/SecretScraper/assets/83501709/e2f12441-a1ec-4fea-933e-17cdc3e31583\"\u003e\n\n\u003cimg width=\"904\" alt=\"image\" src=\"https://github.com/PadishahIII/SecretScraper/assets/83501709/d19faa0e-3ab2-452b-9a82-95e6607d54c6\"\u003e\n\n\n\nAll supported options:\n```bash\n\u003e secretscraper --help\nUsage: secretscraper [OPTIONS]\n\n  Main commands\n\nOptions:\n  -V, --version                Show version and exit.\n  --debug                      Enable debug.\n  -a, --ua TEXT                Set User-Agent\n  -c, --cookie TEXT            Set cookie\n  -d, --allow-domains TEXT     Domain white list, wildcard(*) is supported,\n                               separated by commas, e.g. *.example.com,\n                               example*\n  -D, --disallow-domains TEXT  Domain black list, wildcard(*) is supported,\n                               separated by commas, e.g. *.example.com,\n                               example*\n  -f, --url-file FILE          Target urls file, separated by line break\n  -i, --config FILE            Set config file, defaults to settings.yml\n  -m, --mode [1|2]             Set crawl mode, 1(normal) for max_depth=1,\n                               2(thorough) for max_depth=2, default 1\n  --max-page INTEGER           Max page number to crawl, default 100000\n  --max-depth INTEGER          Max depth to crawl, default 1\n  -o, --outfile FILE           Output result to specified file in csv format\n  -s, --status TEXT            Filter response status to display, seperated by\n                               commas, e.g. 200,300-400\n  -x, --proxy TEXT             Set proxy, e.g. http://127.0.0.1:8080,\n                               socks5://127.0.0.1:7890\n  -H, --hide-regex             Hide regex search result\n  -F, --follow-redirects       Follow redirects\n  -u, --url TEXT               Target url\n  --detail                     Show detailed result\n  --validate                   Validate the status of found urls\n  -l, --local PATH             Local file or directory, scan local\n                               file/directory recursively\n  --help                       Show this message and exit.\n```\n\n### Advanced Usage\n#### Validate the Status of Links\nUse `--validate` option to check the status of found links, this helps reduce invalid links in the result.\n```bash\nsecretscraper -u https://scrapeme.live/shop/ --validate --max-page=10\n```\n\n#### Thorough Crawl\n\nThe max depth is set to 1, which means only the start urls will be crawled. To change that, you can specify\nvia `--max-depth \u003cnumber\u003e`. Or in a simpler way, use `-m 2` to run the crawler in thorough mode which is equivalent\nto `--max-depth 2`. By default the normal mode `-m 1` is adopted with max depth set to 1.\n```bash\nsecretscraper -u https://scrapeme.live/shop/ -m 2\n```\n\n#### Write Results to Csv File\n```bash\nsecretscraper -u https://scrapeme.live/shop/ -o result.csv\n```\n\n#### Domain White/Black List\nSupport wildcard(*), white list:\n```bash\nsecretscraper -u https://scrapeme.live/shop/ -d *scrapeme*\n```\nBlack list:\n```bash\nsecretscraper -u https://scrapeme.live/shop/ -D *.gov\n```\n\n#### Hide Regex Result\nUse `-H` option to hide regex-matching results. Only found links will be displayed.\n```bash\nsecretscraper -u https://scrapeme.live/shop/ -H\n```\n\n#### Extract secrets from local file\n```bash\nsecretscraper -l \u003cdir or file\u003e\n```\n\n#### Switch to hyperscan\nI have implemented the regex matching functionality with both `hyperscan` and `re` module, `re` module is used as default, if you purse higher performance, you can switch to `hyperscan` by changing the `handler_type` to `hyperscan` in `settings.yml`.\n\nThere are some pitfalls of `hyperscan` which you have to take caution to use it:\n1. not support regex group: you can not extract content by parentheses.\n2. different syntax from `re`\n\nYou'd better write regex separately for the two regex engine.\n\n#### Customize Configuration\nThe built-in config is shown as below. You can assign custom configuration via `-i settings.yml`.\n```yaml\nverbose: false\ndebug: false\nloglevel: critical\nlogpath: log\nhandler_type: re\n\nproxy: \"\" # http://127.0.0.1:7890\nmax_depth: 1 # 0 for no limit\nmax_page_num: 1000 # 0 for no limit\ntimeout: 5\nfollow_redirects: true\nworkers_num: 1000\nheaders:\n  Accept: \"*/*\"\n  Cookie: \"\"\n  User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Safari/537.36 SE 2.X MetaSr 1.0\n\nurlFind:\n  - \"[\\\"'‘“`]\\\\s{0,6}(https{0,1}:[-a-zA-Z0-9()@:%_\\\\+.~#?\u0026//={}]{2,250}?)\\\\s{0,6}[\\\"'‘“`]\"\n  - \"=\\\\s{0,6}(https{0,1}:[-a-zA-Z0-9()@:%_\\\\+.~#?\u0026//={}]{2,250})\"\n  - \"[\\\"'‘“`]\\\\s{0,6}([#,.]{0,2}/[-a-zA-Z0-9()@:%_\\\\+.~#?\u0026//={}]{2,250}?)\\\\s{0,6}[\\\"'‘“`]\"\n  - \"\\\"([-a-zA-Z0-9()@:%_\\\\+.~#?\u0026//={}]+?[/]{1}[-a-zA-Z0-9()@:%_\\\\+.~#?\u0026//={}]+?)\\\"\"\n  - \"href\\\\s{0,6}=\\\\s{0,6}[\\\"'‘“`]{0,1}\\\\s{0,6}([-a-zA-Z0-9()@:%_\\\\+.~#?\u0026//={}]{2,250})|action\\\\s{0,6}=\\\\s{0,6}[\\\"'‘“`]{0,1}\\\\s{0,6}([-a-zA-Z0-9()@:%_\\\\+.~#?\u0026//={}]{2,250})\"\njsFind:\n  - (https{0,1}:[-a-zA-Z0-9（）@:%_\\+.~#?\u0026//=]{2,100}?[-a-zA-Z0-9（）@:%_\\+.~#?\u0026//=]{3}[.]js)\n  - '[\"''‘“`]\\s{0,6}(/{0,1}[-a-zA-Z0-9（）@:%_\\+.~#?\u0026//=]{2,100}?[-a-zA-Z0-9（）@:%_\\+.~#?\u0026//=]{3}[.]js)'\n  - =\\s{0,6}[\",',’,”]{0,1}\\s{0,6}(/{0,1}[-a-zA-Z0-9（）@:%_\\+.~#?\u0026//=]{2,100}?[-a-zA-Z0-9（）@:%_\\+.~#?\u0026//=]{3}[.]js)\n\ndangerousPath:\n  - logout\n  - update\n  - remove\n  - insert\n  - delete\n\nrules:\n  - name: Swagger\n    regex: \\b[\\w/]+?((swagger-ui.html)|(\\\"swagger\\\":)|(Swagger UI)|(swaggerUi)|(swaggerVersion))\\b\n    loaded: true\n  - name: ID Card\n    regex: \\b((\\d{8}(0\\d|10|11|12)([0-2]\\d|30|31)\\d{3})|(\\d{6}(18|19|20)\\d{2}(0[1-9]|10|11|12)([0-2]\\d|30|31)\\d{3}(\\d|X|x)))\\b\n    loaded: true\n  - name: Phone\n    regex: \"['\\\"](1(3([0-35-9]\\\\d|4[1-8])|4[14-9]\\\\d|5([\\\\d]\\\\d|7[1-79])|66\\\\d|7[2-35-8]\\\\d|8\\\\d{2}|9[89]\\\\d)\\\\d{7})['\\\"]\"\n    loaded: true\n  - name: JS Map\n    regex: \\b([\\w/]+?\\.js\\.map)\n    loaded: true\n  - name: URL as a Value\n    regex: (\\b\\w+?=(https?)(://|%3a%2f%2f))\n    loaded: false\n  - name: Email\n    regex: \"['\\\"]([\\\\w]+(?:\\\\.[\\\\w]+)*@(?:[\\\\w](?:[\\\\w-]*[\\\\w])?\\\\.)+[\\\\w](?:[\\\\w-]*[\\\\w])?)['\\\"]\"\n    loaded: true\n  - name: Internal IP\n    regex: '[^0-9]((127\\.0\\.0\\.1)|(10\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3})|(172\\.((1[6-9])|(2\\d)|(3[01]))\\.\\d{1,3}\\.\\d{1,3})|(192\\.168\\.\\d{1,3}\\.\\d{1,3}))'\n    loaded: true\n  - name: Cloud Key\n    regex: \\b((accesskeyid)|(accesskeysecret)|\\b(LTAI[a-z0-9]{12,20}))\\b\n    loaded: true\n  - name: Shiro\n    regex: (=deleteMe|rememberMe=)\n    loaded: true\n  - name: Suspicious API Key\n    regex: \"[\\\"'][0-9a-zA-Z]{32}['\\\"]\"\n    loaded: true\n  - name: Jwt\n    regex: \"['\\\"](ey[A-Za-z0-9_-]{10,}\\\\.[A-Za-z0-9._-]{10,}|ey[A-Za-z0-9_\\\\/+-]{10,}\\\\.[A-Za-z0-9._\\\\/+-]{10,})['\\\"]\"\n    loaded: true\n\n```\n\n---\n\n# TODO\n- [ ] Support headless browser\n- [ ] Add regex doc reference\n- [ ] Fuzz path that are 404\n- [x] Separate subdomains in the result\n- [x] Optimize url collector\n[//]: # (- [ ] Employ jsbeautifier)\n- [x] Generate configuration file\n- [x] Detect dangerous paths and avoid requesting them\n- [x] Support url-finder output format, add `--detail` option\n- [x] Support windows\n- [x] Scan local file\n- [x] Extract links via regex\n\n---\n\n# Change Log\n## 2024.5.25 Version 1.4\n- Support csv output\n- Set `re` module as regex engine by default\n- Support to select regex engine by configuration `handler_type`\n## 2024.4.30 Version 1.3.9\n- Add `--validate` option: Validate urls after the crawler finish, which helps reduce useless links\n- Optimize url collector\n- Optimize built-in regex\n## 2024.4.29 Version 1.3.8\n- Optimize log output\n- Optimize the performance of `--debug` option\n## 2024.4.29 Version 1.3.7\n- Test on multiple python versions\n- Support python 3.9~3.11\n## 2024.4.29 Version 1.3.6\n- Repackage\n\n## 2024.4.28 Version 1.3.5\n- **New Features**\n  - Support windows\n  - Optimize crawler\n  - Prettify output, add `--detail` option\n  - Generate default configuration to settings.yml\n  - Avoid requesting dangerous paths\n\n## 2024.4.28 Version 1.3.2\n- **New Features**\n  - Extract links via regex\n\n## 2024.4.26 Version 1.3.1\n- **New Features**\n  - [x] Support scan local files\n\n## 2024.4.15\n- [x] Add status to url result\n- [x] All crawler test passed\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FPadishahIII%2FSecretScraper","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FPadishahIII%2FSecretScraper","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FPadishahIII%2FSecretScraper/lists"}