Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/thiiagoms/links-extractor
:hammer: extract all links from website
https://github.com/thiiagoms/links-extractor
extractor hack-the-world link-extraction python pythonic scrapy
Last synced: 2 months ago
JSON representation
:hammer: extract all links from website
- Host: GitHub
- URL: https://github.com/thiiagoms/links-extractor
- Owner: thiiagoms
- License: mit
- Created: 2019-12-12T15:26:52.000Z (about 5 years ago)
- Default Branch: main
- Last Pushed: 2024-07-06T13:15:17.000Z (7 months ago)
- Last Synced: 2024-07-06T14:33:41.919Z (7 months ago)
- Topics: extractor, hack-the-world, link-extraction, python, pythonic, scrapy
- Language: Python
- Homepage:
- Size: 150 KB
- Stars: 6
- Watchers: 1
- Forks: 3
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
Library that allows for the extraction of links from web pages- [Dependencies :heavy_plus_sign:](#dependencies)
- [Install :package:](#install)
- [Run :runner:](#run)
- [Bonus :medal_sports:](#bonus)## Dependencies
- Python 3.8+
- Requests
- BeautifulSoup## Install
01 -) Clone:
```shell
$ git clone https://github.com/thiiagoms/links-extractor
```02 -) Go to `links-extractor` directory:
```shell
$ cd links-extractor
links-extractor $
```## Run
01 -) In your `script.py` call `Extractor` main class like:
```python
from src.services.extractor import Extractor
from src.utils.printer import Printerurls = ['https://github.com', 'https://google.com']
extractor = Extractor()
links = extractor.extract(urls, timeout=10)for url, extracted_links in links.items():
Printer.message(f"Url: {url}")
for link in extracted_links:
Printer.success(f" { link}")
Printer.message("###############")
```And you should receive this output:
```text
$ python example.pyUrl: https://github.com
#start-of-content
https://github.com/
/signup?ref_cta=Sign+up&ref_loc=header+logged+out&ref_page=%2F&source=header-home
/features/actions
/features/packages
/features/security###############
Url: https://google.com
https://www.google.com/imghp?hl=pt-BR&tab=wi
https://maps.google.com.br/maps?hl=pt-BR&tab=wl
https://play.google.com/?hl=pt-BR&tab=w8###############
```## Bonus
01 -) Run tests with **pytest**:
```bash
links-extractor $ pytest
```02 -) Run **autopep8** lint on files like:
```bash
links-extractor $ autopep8 --in-place --aggressive --aggressive src/services/extractor.py
```