{"id":24553937,"url":"https://github.com/nathabonfim59/py-extract-links","last_synced_at":"2025-03-16T14:46:20.240Z","repository":{"id":241853967,"uuid":"808044605","full_name":"nathabonfim59/py-extract-links","owner":"nathabonfim59","description":"Extractt the links from any URL or html file","archived":false,"fork":false,"pushed_at":"2024-05-30T09:39:32.000Z","size":10,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-01-23T02:16:44.249Z","etag":null,"topics":["pentesting-tools","python-scraper","scrapper-script"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/nathabonfim59.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-05-30T09:16:00.000Z","updated_at":"2024-05-30T09:39:35.000Z","dependencies_parsed_at":"2024-05-30T10:54:02.559Z","dependency_job_id":"aed0371f-a5ef-4542-8713-d7dba1ef3520","html_url":"https://github.com/nathabonfim59/py-extract-links","commit_stats":null,"previous_names":["nathabonfim59/py-extract-links"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nathabonfim59%2Fpy-extract-links","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nathabonfim59%2Fpy-extract-links/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nathabonfim59%2Fpy-extract-links/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nathabonfim59%2Fpy-extract-links/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/nathabonfim59","download_url":"https://codeload.github.com/nathabonfim59/py-extract-links/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243885891,"owners_count":20363644,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["pentesting-tools","python-scraper","scrapper-script"],"created_at":"2025-01-23T02:16:46.947Z","updated_at":"2025-03-16T14:46:20.217Z","avatar_url":"https://github.com/nathabonfim59.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# What is it?\n\nWhen I'm doing a pentest, there is a tedious process of extracting all the links from a given webpage to see if there is anything interesting.\nSometimes, they are inside JSONs, JS, and lots of stuff. This is just a script to automate this otherwise kind of tedious process.\n\nIf you find it useful, give us a star, and if you find a bug or have a suggestion, feel free to open a PR.\n\n**TLDR:** just some hacked together regexes to extract links from a webpage.\n\n## Usage\n\n```\nusage: extract_links.py [-h] [--domains DOMAINS [DOMAINS ...]] [--summary] [--subdomains] source\n\nExtract all links from an HTML file\n\npositional arguments:\n  source                URL or file path of the HTML content to extract domains from\n\noptions:\n  -h, --help            show this help message and exit\n  --domains DOMAINS [DOMAINS ...]\n                        A list of domains with wildcards like *.google.com\n  --summary             Return a summary separated by root domain\n  --subdomains          Have a list of subdomains in the summary\n```\n\n\nExample:\n\n### From URL\n```\n./extract_links.py http://google.com\nDomains extracted:\n----------------------------------------------------------------------------------------------------\nhttps://mail.google.com/mail/?tab=wm\nhttps://drive.google.com/?tab=wo\nhttp://www.google.com/setprefdomain?prefdom=BR\u0026amp;prev=http://www.google.com.br/\u0026amp;sig=K_0YZ7AcnuSOXsvin5UXnjkzw3HJA%3D\nhttp://www.google.com.br/history/optout?hl=pt-BR\nhttps://www.google.com/url?q=https://gemini.google.com/advanced%3Futm_source%3DHPP-ms%26utm_medium%3DOwned%26utm_campaign%3Di18n-adv-may\u0026amp;source=hpp\u0026amp;id=19042168\u0026amp;ct=3\u0026amp;usg=AOvVa\nw259on_boc9RupjNMiGrnfV\u0026amp;sa=X\u0026amp;ved=0ahUKEwjbxd7RhbWGAxUdLrkGHbFOCvMQ8IcBCAY\nhttps://play.google.com/?hl=pt-BR\u0026tab=w8\nhttp://schema.org/WebPage\nhttps://www.youtube.com/?tab=w1\nhttps://www.google.com/imghp?hl=pt-BR\u0026tab=wi\nhttps://www.google.com/images/hpp/gemini-advanced-sparkle-rgb-1-42px.png\nhttps://accounts.google.com/ServiceLogin?hl=pt-BR\u0026passive=true\u0026continue=http://www.google.com/\u0026ec=GAZAAQ\nhttps://news.google.com/?tab=wn\nhttp://maps.google.com.br/maps?hl=pt-BR\u0026tab=wl\nhttps://www.google.com.br/intl/pt-BR/about/products?tab=wh\n```\n\n### Summary root domains\n```\n./extract_links.py http://google.com --summary\nSummary separated by root domain:\n----------------------------------------------------------------------------------------------------\n   9 occurrences: google.com\n   3 occurrences: google.com.br\n   1 occurrences: schema.org\n   1 occurrences: youtube.com\n```\n\n\n### Summary subdomains\n```\n./extract_links.py http://google.com --summary --subdomains\nSummary separated by root domain:\n----------------------------------------------------------------------------------------------------\n   4 occurrences: www.google.com\n   2 occurrences: www.google.com.br\n   1 occurrences: news.google.com\n   1 occurrences: mail.google.com\n   1 occurrences: drive.google.com\n   1 occurrences: accounts.google.com\n   1 occurrences: schema.org\n   1 occurrences: play.google.com\n   1 occurrences: maps.google.com.br\n   1 occurrences: www.youtube.com\n```\n\n### Filter in by domain\n\n\u003e youtube and google urls (you can use `--summary` as well)\n\n```\n./extract_links.py http://google.com --domains *google.com.br *youtube.com\nDomains extracted:\n----------------------------------------------------------------------------------------------------\nhttp://maps.google.com.br/maps?hl=pt-BR\u0026tab=wl\nhttp://www.google.com.br/history/optout?hl=pt-BR\nhttps://www.youtube.com/?tab=w1\nhttps://www.google.com.br/intl/pt-BR/about/products?tab=wh\n```\n\n# License\nMIT - Basically, you can do whatever you want with it, and I'm not responsible for anything you do with it ;)\nSee the details in the LICENSE file.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnathabonfim59%2Fpy-extract-links","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnathabonfim59%2Fpy-extract-links","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnathabonfim59%2Fpy-extract-links/lists"}