https://github.com/thiiagoms/links-extractor
:hammer: extract all links from website
https://github.com/thiiagoms/links-extractor
extractor hack-the-world link-extraction python pythonic scrapy
Last synced: 12 months ago
JSON representation
:hammer: extract all links from website
- Host: GitHub
- URL: https://github.com/thiiagoms/links-extractor
- Owner: thiiagoms
- License: mit
- Created: 2019-12-12T15:26:52.000Z (about 6 years ago)
- Default Branch: main
- Last Pushed: 2024-07-06T13:15:17.000Z (over 1 year ago)
- Last Synced: 2025-01-07T05:51:30.659Z (about 1 year ago)
- Topics: extractor, hack-the-world, link-extraction, python, pythonic, scrapy
- Language: Python
- Homepage:
- Size: 150 KB
- Stars: 7
- Watchers: 1
- Forks: 3
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
Library that allows for the extraction of links from web pages
- [Dependencies :heavy_plus_sign:](#dependencies)
- [Install :package:](#install)
- [Run :runner:](#run)
- [Bonus :medal_sports:](#bonus)
## Dependencies
- Python 3.8+
- Requests
- BeautifulSoup
## Install
01 -) Clone:
```shell
$ git clone https://github.com/thiiagoms/links-extractor
```
02 -) Go to `links-extractor` directory:
```shell
$ cd links-extractor
links-extractor $
```
## Run
01 -) In your `script.py` call `Extractor` main class like:
```python
from src.services.extractor import Extractor
from src.utils.printer import Printer
urls = ['https://github.com', 'https://google.com']
extractor = Extractor()
links = extractor.extract(urls, timeout=10)
for url, extracted_links in links.items():
Printer.message(f"Url: {url}")
for link in extracted_links:
Printer.success(f" { link}")
Printer.message("###############")
```
And you should receive this output:
```text
$ python example.py
Url: https://github.com
#start-of-content
https://github.com/
/signup?ref_cta=Sign+up&ref_loc=header+logged+out&ref_page=%2F&source=header-home
/features/actions
/features/packages
/features/security
###############
Url: https://google.com
https://www.google.com/imghp?hl=pt-BR&tab=wi
https://maps.google.com.br/maps?hl=pt-BR&tab=wl
https://play.google.com/?hl=pt-BR&tab=w8
###############
```
## Bonus
01 -) Run tests with **pytest**:
```bash
links-extractor $ pytest
```
02 -) Run **autopep8** lint on files like:
```bash
links-extractor $ autopep8 --in-place --aggressive --aggressive src/services/extractor.py
```