{"id":24857272,"url":"https://github.com/blackhatinside/python_osint_dork_scrap","last_synced_at":"2025-03-26T15:41:08.808Z","repository":{"id":132828437,"uuid":"474140761","full_name":"blackhatinside/python_osint_dork_scrap","owner":"blackhatinside","description":"python_osint_dork_scrap","archived":false,"fork":false,"pushed_at":"2022-03-25T19:59:32.000Z","size":1015,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-01-31T17:16:29.077Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/blackhatinside.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2022-03-25T19:31:33.000Z","updated_at":"2022-08-06T23:18:00.000Z","dependencies_parsed_at":null,"dependency_job_id":"ed4338b7-d7f2-48cb-bc5f-d3e19785a3df","html_url":"https://github.com/blackhatinside/python_osint_dork_scrap","commit_stats":null,"previous_names":["blackhatinside/python_osint_dork_scrap"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/blackhatinside%2Fpython_osint_dork_scrap","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/blackhatinside%2Fpython_osint_dork_scrap/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/blackhatinside%2Fpython_osint_dork_scrap/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/blackhatinside%2Fpython_osint_dork_scrap/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/blackhatinside","download_url":"https://codeload.github.com/blackhatinside/python_osint_dork_scrap/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245683301,"owners_count":20655537,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-01-31T17:16:48.776Z","updated_at":"2025-03-26T15:41:08.780Z","avatar_url":"https://github.com/blackhatinside.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# pagodo - Passive Google Dork\r\n\r\n## Introduction\r\n\r\n`pagodo` automates Google searching for potentially vulnerable web pages and applications on the Internet.  It replaces\r\nmanually performing Google dork searches with a web GUI browser.\r\n\r\nThere are 2 parts.  The first is `ghdb_scraper.py` that retrieves the latest Google dorks and the second portion is\r\n`pagodo.py` that leverages the information gathered by `ghdb_scraper.py`.\r\n\r\nThe core Google search library now uses the more flexible [yagooglesearch](https://github.com/opsdisk/yagooglesearch)\r\ninstead of [googlesearch](https://github.com/MarioVilas/googlesearch).  Check out the\r\n[yagooglesearch README](https://github.com/opsdisk/yagooglesearch/blob/master/README.md) for a more in-depth explanation\r\nof the library differences and capabilities.\r\n\r\nThis version of `pagodo` also supports native HTTP(S) and SOCKS5 application support, so no more wrapping it in a tool\r\nlike `proxychains4` if you need proxy support.  You can specify multiple proxies to use in a round-robin fashion by\r\nproviding a comma separated string of proxies using the `-p` switch.\r\n\r\n## What are Google dorks?\r\n\r\nOffensive Security maintains the Google Hacking Database (GHDB) found here:\r\n\u003chttps://www.exploit-db.com/google-hacking-database\u003e.  It is a collection of Google searches, called dorks, that can be\r\nused to find potentially vulnerable boxes or other juicy info that is picked up by Google's search bots.\r\n\r\n## Terms and Conditions\r\n\r\nThe terms and conditions for `pagodo` are the same terms and conditions found in\r\n[yagooglesearch](https://github.com/opsdisk/yagooglesearch#terms-and-conditions).\r\n\r\nThis code is supplied as-is and you are fully responsible for how it is used.  Scraping Google Search results may\r\nviolate their [Terms of Service](https://policies.google.com/terms).  Another Python Google search library had some\r\ninteresting information/discussion on it:\r\n\r\n* [Original issue](https://github.com/aviaryan/python-gsearch/issues/1)\r\n* [A response](https://github.com/aviaryan/python-gsearch/issues/1#issuecomment-365581431\u003e)\r\n* Author created a separate [Terms and Conditions](https://github.com/aviaryan/python-gsearch/blob/master/T_AND_C.md)\r\n* ...that contained link to this [blog](https://benbernardblog.com/web-scraping-and-crawling-are-perfectly-legal-right/)\r\n\r\nGoogle's preferred method is to use their [API](https://developers.google.com/custom-search/v1/overview).\r\n\r\n## Installation\r\n\r\nScripts are written for Python 3.6+.  Clone the git repository and install the requirements.\r\n\r\n```bash\r\ngit clone https://github.com/opsdisk/pagodo.git\r\ncd pagodo\r\nvirtualenv -p python3.7 .venv  # If using a virtual environment.\r\nsource .venv/bin/activate  # If using a virtual environment.\r\npip install -r requirements.txt\r\n```\r\n\r\n## ghdb_scraper.py\r\n\r\nTo start off, `pagodo.py` needs a list of all the current Google dorks.  The repo contains a `dorks/` directory with\r\nthe current dorks when the `ghdb_scraper.py` was last run. It's advised to run `ghdb_scraper.py` to get the freshest\r\ndata before running `pagodo.py`.  The `dorks/` directory contains:\r\n\r\n* the `all_google_dorks.txt` file which contains all the Google dorks, one per line\r\n* the `all_google_dorks.json` file which is the JSON response from GHDB\r\n* Individual category dorks\r\n\r\nDork categories:\r\n\r\n```python\r\ncategories = {\r\n    1: \"Footholds\",\r\n    2: \"File Containing Usernames\",\r\n    3: \"Sensitives Directories\",\r\n    4: \"Web Server Detection\",\r\n    5: \"Vulnerable Files\",\r\n    6: \"Vulnerable Servers\",\r\n    7: \"Error Messages\",\r\n    8: \"File Containing Juicy Info\",\r\n    9: \"File Containing Passwords\",\r\n    10: \"Sensitive Online Shopping Info\",\r\n    11: \"Network or Vulnerability Data\",\r\n    12: \"Pages Containing Login Portals\",\r\n    13: \"Various Online devices\",\r\n    14: \"Advisories and Vulnerabilities\",\r\n}\r\n```\r\n\r\n### Using ghdb_scraper.py as a script\r\n\r\nWrite all dorks to `all_google_dorks.txt`, `all_google_dorks.json`, and individual categories if you want more\r\ncontextual data about each dork.\r\n\r\n```bash\r\npython ghdb_scraper.py -s -j -i\r\n```\r\n\r\n### Using ghdb_scraper as a module\r\n\r\nThe `ghdb_scraper.retrieve_google_dorks()` function returns a dictionary with the following data structure:\r\n\r\n```python\r\nghdb_dict = {\r\n    \"total_dorks\": total_dorks,\r\n    \"extracted_dorks\": extracted_dorks,\r\n    \"category_dict\": category_dict,\r\n}\r\n```\r\n\r\nUsing a Python shell (like `python` or `ipython`) to explore the data:\r\n\r\n```python\r\nimport ghdb_scraper\r\n\r\ndorks = ghdb_scraper.retrieve_google_dorks(save_all_dorks_to_file=True)\r\ndorks.keys()\r\ndorks[\"total_dorks\"]\r\n\r\ndorks[\"extracted_dorks\"]\r\n\r\ndorks[\"category_dict\"].keys()\r\n\r\ndorks[\"category_dict\"][1][\"category_name\"]\r\n```\r\n\r\n## \u003cspan\u003epagodo.py\u003c/span\u003e\r\n\r\n### Using \u003cspan\u003epagodo.py\u003c/span\u003e as a script\r\n\r\n```bash\r\npython pagodo.py -d example.com -g dorks.txt \r\n```\r\n\r\n### Using pagodo as a module\r\n\r\nThe `pagodo.Pagodo.go()` function returns a dictionary with the data structure below (dorks used are made up examples):\r\n\r\n```python\r\n{\r\n    \"dorks\": {\r\n        \"inurl:admin\": {\r\n            \"urls_size\": 3,\r\n            \"urls\": [\r\n                \"https://github.com/marmelab/ng-admin\",\r\n                \"https://github.com/settings/admin\",\r\n                \"https://github.com/akveo/ngx-admin\",\r\n            ],\r\n        },\r\n        \"inurl:gist\": {\r\n            \"urls_size\": 3,\r\n            \"urls\": [\r\n                \"https://gist.github.com/\",\r\n                \"https://gist.github.com/index\",\r\n                \"https://github.com/defunkt/gist\",\r\n            ],\r\n        },\r\n    },\r\n    \"initiation_timestamp\": \"2021-08-27T11:35:30.638705\",\r\n    \"completion_timestamp\": \"2021-08-27T11:36:42.349035\",\r\n}\r\n```\r\n\r\nUsing a Python shell (like `python` or `ipython`) to explore the data:\r\n\r\n```python\r\nimport pagodo\r\n\r\npg = pagodo.Pagodo(\r\n    google_dorks_file=\"dorks.txt\",\r\n    domain=\"github.com\",\r\n    max_search_result_urls_to_return_per_dork=3,\r\n    save_pagodo_results_to_json_file=True,\r\n    save_urls_to_file=True,\r\n    verbosity=5,\r\n)\r\npagodo_results_dict = pg.go()\r\n\r\npagodo_results_dict.keys()\r\n\r\npagodo_results_dict[\"initiation_timestamp\"]\r\n\r\npagodo_results_dict[\"completion_timestamp\"]\r\n\r\nfor key,value in pagodo_results_dict[\"dorks\"].items():\r\n    print(f\"dork: {key}\")\r\n    for url in value[\"urls\"]:\r\n        print(url)\r\n```\r\n\r\n## Tuning Results\r\n\r\n## Scope to a specific domain\r\n\r\nThe `-d` switch can be used to scope the results to a specific domain and functions as the Google search operator:\r\n\r\n```none\r\nsite:github.com\r\n```\r\n\r\n### Wait time between Google dork searchers\r\n\r\n* `-i` - Specify the **minimum** delay between dork searches, in seconds.  Don't make this too small, or your IP will\r\nget HTTP 429'd quickly.\r\n* `-x` - Specify the **maximum** delay between dork searches, in seconds.  Don't make this too big or the searches will\r\ntake a long time.\r\n\r\nThe values provided by `-i` and `-x` are used to generate a list of 20 randomly wait times, that are randomly selected\r\nbetween each different Google dork search.\r\n\r\n### Number of results to return\r\n\r\n`-m` - The total max search results to return per Google dork.  Each Google search request can pull back at most 100\r\nresults at a time, so if you pick `-m 500`, 5 separate search queries will have to be made for each Google dork search,\r\nwhich will increase the amount of time to complete.\r\n\r\n## Google is blocking me!\r\n\r\nPerforming 7300+ search requests to Google as fast as possible will simply not work.  Google will rightfully detect it\r\nas a bot and block your IP for a set period of time.  One solution is to use a bank of HTTP(S)/SOCKS proxies and pass\r\nthem to `pagodo`\r\n\r\n### Native proxy support\r\n\r\nPass a comma separated string of proxies to `pagodo` using the `-p` switch.\r\n\r\n```bash\r\npython pagodo.py -g dorks.txt -p http://myproxy:8080,socks5h://127.0.0.1:9050,socks5h://127.0.0.1:9051\r\n```\r\n\r\nYou could even decrease the `-i` and `-x` values because you will be leveraging different proxy IPs.  The proxies passed\r\nto `pagodo` are selected by round robin.\r\n\r\n### proxychains4 support\r\n\r\nAnother solution is to use `proxychains4` to round robin the lookups.\r\n\r\nInstall `proxychains4`\r\n\r\n```bash\r\napt install proxychains4 -y\r\n```\r\n\r\nEdit the `/etc/proxychains4.conf` configuration file to round robin the look ups through different proxy servers.  In\r\nthe example below, 2 different dynamic socks proxies have been set up with different local listening ports (9050 and\r\n9051).\r\n\r\n```bash\r\nvim /etc/proxychains4.conf\r\n```\r\n\r\n```ini\r\nround_robin\r\nchain_len = 1\r\nproxy_dns\r\nremote_dns_subnet 224\r\ntcp_read_time_out 15000\r\ntcp_connect_time_out 8000\r\n[ProxyList]\r\nsocks4 127.0.0.1 9050\r\nsocks4 127.0.0.1 9051\r\n```\r\n\r\nThrow `proxychains4` in front of the `pagodo.py` script and each *request* lookup will go through a different proxy (and\r\nthus source from a different IP).\r\n\r\n```bash\r\nproxychains4 python pagodo.py -g dorks/all_google_dorks.txt -o -s\r\n```\r\n\r\nNote that this may not appear natural to Google if you:\r\n\r\n1) Simulate \"browsing\" to `google.com` from IP #1\r\n2) Make the first search query from IP #2\r\n3) Simulate clicking \"Next\" to make the second search query from IP #3\r\n4) Simulate clicking \"Next to make the third search query from IP #1\r\n\r\nFor that reason, using the built in `-p` proxy support is preferred because, as stated in the `yagooglesearch`\r\ndocumentation, the \"provided proxy is used for the entire life cycle of the search to make it look more human, instead\r\nof rotating through various proxies for different portions of the search.\"\r\n\r\n## License\r\n\r\nDistributed under the GNU General Public License v3.0. See [LICENSE](./LICENSE) for more information.\r\n\r\n## Contact\r\n\r\n[@opsdisk](https://twitter.com/opsdisk)\r\n\r\nProject Link: [https://github.com/opsdisk/pagodo](https://github.com/opsdisk/pagodo)\r\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fblackhatinside%2Fpython_osint_dork_scrap","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fblackhatinside%2Fpython_osint_dork_scrap","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fblackhatinside%2Fpython_osint_dork_scrap/lists"}