https://github.com/nathabonfim59/py-extract-links
Extractt the links from any URL or html file
https://github.com/nathabonfim59/py-extract-links
pentesting-tools python-scraper scrapper-script
Last synced: 3 months ago
JSON representation
Extractt the links from any URL or html file
- Host: GitHub
- URL: https://github.com/nathabonfim59/py-extract-links
- Owner: nathabonfim59
- License: mit
- Created: 2024-05-30T09:16:00.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2024-05-30T09:39:32.000Z (about 1 year ago)
- Last Synced: 2025-01-23T02:16:44.249Z (5 months ago)
- Topics: pentesting-tools, python-scraper, scrapper-script
- Language: Python
- Homepage:
- Size: 9.77 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# What is it?
When I'm doing a pentest, there is a tedious process of extracting all the links from a given webpage to see if there is anything interesting.
Sometimes, they are inside JSONs, JS, and lots of stuff. This is just a script to automate this otherwise kind of tedious process.If you find it useful, give us a star, and if you find a bug or have a suggestion, feel free to open a PR.
**TLDR:** just some hacked together regexes to extract links from a webpage.
## Usage
```
usage: extract_links.py [-h] [--domains DOMAINS [DOMAINS ...]] [--summary] [--subdomains] sourceExtract all links from an HTML file
positional arguments:
source URL or file path of the HTML content to extract domains fromoptions:
-h, --help show this help message and exit
--domains DOMAINS [DOMAINS ...]
A list of domains with wildcards like *.google.com
--summary Return a summary separated by root domain
--subdomains Have a list of subdomains in the summary
```Example:
### From URL
```
./extract_links.py http://google.com
Domains extracted:
----------------------------------------------------------------------------------------------------
https://mail.google.com/mail/?tab=wm
https://drive.google.com/?tab=wo
http://www.google.com/setprefdomain?prefdom=BR&prev=http://www.google.com.br/&sig=K_0YZ7AcnuSOXsvin5UXnjkzw3HJA%3D
http://www.google.com.br/history/optout?hl=pt-BR
https://www.google.com/url?q=https://gemini.google.com/advanced%3Futm_source%3DHPP-ms%26utm_medium%3DOwned%26utm_campaign%3Di18n-adv-may&source=hpp&id=19042168&ct=3&usg=AOvVa
w259on_boc9RupjNMiGrnfV&sa=X&ved=0ahUKEwjbxd7RhbWGAxUdLrkGHbFOCvMQ8IcBCAY
https://play.google.com/?hl=pt-BR&tab=w8
http://schema.org/WebPage
https://www.youtube.com/?tab=w1
https://www.google.com/imghp?hl=pt-BR&tab=wi
https://www.google.com/images/hpp/gemini-advanced-sparkle-rgb-1-42px.png
https://accounts.google.com/ServiceLogin?hl=pt-BR&passive=true&continue=http://www.google.com/&ec=GAZAAQ
https://news.google.com/?tab=wn
http://maps.google.com.br/maps?hl=pt-BR&tab=wl
https://www.google.com.br/intl/pt-BR/about/products?tab=wh
```### Summary root domains
```
./extract_links.py http://google.com --summary
Summary separated by root domain:
----------------------------------------------------------------------------------------------------
9 occurrences: google.com
3 occurrences: google.com.br
1 occurrences: schema.org
1 occurrences: youtube.com
```### Summary subdomains
```
./extract_links.py http://google.com --summary --subdomains
Summary separated by root domain:
----------------------------------------------------------------------------------------------------
4 occurrences: www.google.com
2 occurrences: www.google.com.br
1 occurrences: news.google.com
1 occurrences: mail.google.com
1 occurrences: drive.google.com
1 occurrences: accounts.google.com
1 occurrences: schema.org
1 occurrences: play.google.com
1 occurrences: maps.google.com.br
1 occurrences: www.youtube.com
```### Filter in by domain
> youtube and google urls (you can use `--summary` as well)
```
./extract_links.py http://google.com --domains *google.com.br *youtube.com
Domains extracted:
----------------------------------------------------------------------------------------------------
http://maps.google.com.br/maps?hl=pt-BR&tab=wl
http://www.google.com.br/history/optout?hl=pt-BR
https://www.youtube.com/?tab=w1
https://www.google.com.br/intl/pt-BR/about/products?tab=wh
```# License
MIT - Basically, you can do whatever you want with it, and I'm not responsible for anything you do with it ;)
See the details in the LICENSE file.