{"id":15631364,"url":"https://github.com/rosesecurity/scrappy","last_synced_at":"2025-04-04T08:05:21.197Z","repository":{"id":64697745,"uuid":"561860366","full_name":"RoseSecurity/ScrapPY","owner":"RoseSecurity","description":"ScrapPY is a Python utility for scraping manuals, documents, and other sensitive PDFs to generate wordlists that can be utilized by offensive security tools to perform brute force, forced browsing, and dictionary attacks against targets. The tool dives deep to discover keywords and phrases leading to potential passwords or hidden directories.","archived":false,"fork":false,"pushed_at":"2025-01-19T20:55:51.000Z","size":209,"stargazers_count":207,"open_issues_count":2,"forks_count":23,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-03-28T07:04:09.007Z","etag":null,"topics":["cybersecurity","hacking","pdf","python3","scraper"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/RoseSecurity.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-11-04T16:50:59.000Z","updated_at":"2025-03-26T16:36:00.000Z","dependencies_parsed_at":"2023-11-07T15:31:19.308Z","dependency_job_id":"85761e02-770d-477b-97d5-5cfdf6434859","html_url":"https://github.com/RoseSecurity/ScrapPY","commit_stats":{"total_commits":40,"total_committers":4,"mean_commits":10.0,"dds":0.25,"last_synced_commit":"96c22a744c9591e488df787975bbb39c52cde2c5"},"previous_names":[],"tags_count":4,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RoseSecurity%2FScrapPY","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RoseSecurity%2FScrapPY/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RoseSecurity%2FScrapPY/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RoseSecurity%2FScrapPY/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/RoseSecurity","download_url":"https://codeload.github.com/RoseSecurity/ScrapPY/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247140953,"owners_count":20890553,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cybersecurity","hacking","pdf","python3","scraper"],"created_at":"2024-10-03T10:40:04.182Z","updated_at":"2025-04-04T08:05:21.166Z","avatar_url":"https://github.com/RoseSecurity.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# :dog2: ScrapPY: PDF Scraping Made Easy\n\n\u003cp align=\"center\"\u003e\n\u003cimg width=40% height=40% src=\"https://user-images.githubusercontent.com/72598486/200046477-94c17a93-2dc8-418b-96eb-2b554227dce2.png\"\u003e\n\u003c/p\u003e\n\nScrapPY is a Python utility for scraping manuals, documents, and other sensitive PDFs to generate targeted wordlists that can be utilized by offensive security tools to perform brute force, forced browsing, and dictionary attacks. ScrapPY performs word frequency, entropy, and metadata analysis, and can run in full output modes to craft custom wordlists for targeted attacks. The tool dives deep to discover keywords and phrases leading to potential passwords or hidden directories, outputting to a text file that is readable by tools such as Hydra, Dirb, and Nmap. Expedite initial access, vulnerability discovery, and lateral movement with ScrapPY!\n\n# Demo:\n\nhttps://user-images.githubusercontent.com/72598486/201235531-6b037daf-d1f3-4d33-b256-8411e3a0b3da.mov\n\n# Install:\n\nDownload Repository:\n\n```\n$ mkdir ScrapPY\n$ cd ScrapPY/\n$ sudo git clone https://github.com/RoseSecurity/ScrapPY.git\n```\n\nInstall Dependencies:\n\n```\n$ python3 -m venv venv\n$ source .venv/bin/activate\n$ pip3 install -r requirements.txt\n```\n\n# ScrapPY Usage:\n\n```\nusage: ScrapPY.py [-h] [-f FILE] [-m {word-frequency,full,metadata,entropy}] [-o OUTPUT]\n```\n\n\nOutput metadata of document:\n\n```\n$ python3 ScrapPY.py -f example.pdf -m metadata\n```\n\nOutput top 100 frequently used keywords to a file name ```Top_100_Keywords.txt```:\n\n```\n$ python3 ScrapPY.py -f example.pdf -m word-frequency -o Top_100_Keywords.txt\n```\n\nOutput all keywords to default ScrapPY.txt file:\n\n```\n$ python3 ScrapPY.py -f example.pdf\n```\n\nOutput top 100 keywords with highest entropy rating:\n\n```\n$ python3 ScrapPY.py -f example.pdf -m entropy\n```\n\nScrapPY Output:\n\n```\n# ScrapPY outputs the ScrapPY.txt file or specified name file to the directory in which the tool was ran. To view the first fifty lines of the file, run this command:\n\n$ head -50 ScrapPY.txt\n\n# To see how many words were generated, run this command:\n\n$ wc -l ScrapPY.txt\n```\n\n# Integration with Offensive Security Tools:\n\nEasily integrate with tools such as Dirb to expedite the process of discovering hidden subdirectories:\n\n```\nroot@RoseSecurity:~# dirb http://192.168.1.123/ /root/ScrapPY/ScrapPY.txt\n\n-----------------\nDIRB v2.21\nBy The Dark Raver\n-----------------\n\nSTART_TIME: Fri May 16 13:41:45 2014\nURL_BASE: http://192.168.1.123/\nWORDLIST_FILES: /root/ScrapPY/ScrapPY.txt\n\n-----------------\n\nGENERATED WORDS: 4592\n\n---- Scanning URL: http://192.168.1.123/ ----\n==\u003e DIRECTORY: http://192.168.1.123/vi/\n+ http://192.168.1.123/programming (CODE:200|SIZE:2726)\n+ http://192.168.1.123/s7-logic/ (CODE:403|SIZE:1122)\n==\u003e DIRECTORY: http://192.168.1.123/config/\n==\u003e DIRECTORY: http://192.168.1.123/docs/\n==\u003e DIRECTORY: http://192.168.1.123/external/\n```\n\nUtilize ScrapPY with Hydra for advanced brute force attacks:\n\n```\nroot@RoseSecurity:~# hydra -l root -P /root/ScrapPY/ScrapPY.txt -t 6 ssh://192.168.1.123\nHydra v7.6 (c)2013 by van Hauser/THC \u0026 David Maciejak - for legal purposes only\n\nHydra (http://www.thc.org/thc-hydra) starting at 2014-05-19 07:53:33\n[DATA] 6 tasks, 1 server, 1003 login tries (l:1/p:1003), ~167 tries per task\n[DATA] attacking service ssh on port 22\n```\n\nEnhance Nmap scripts with ScrapPY wordlists:\n\n```\nnmap -p445 --script smb-brute.nse --script-args userdb=users.txt,passdb=ScrapPY.txt 192.168.1.123\n```\n\n## Future Development:\n\n- [x] Allow for custom output file naming and increased verbosity\n- [x] Integrate different modes of operation including word frequency analysis\n- [x] Allow for metadata analysis\n- [x] Search for high-entropy data\n- [ ] Prepare packaging for `homebrew` installation\n- [ ] Search for path-like data \n- [ ] Implement image OCR to enumerate data from images in PDFs\n- [ ] Allow for processing of multiple PDFs\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frosesecurity%2Fscrappy","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frosesecurity%2Fscrappy","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frosesecurity%2Fscrappy/lists"}