Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/gabfl/sitecrawl
Simple Python module to crawl a website and extract URLs
https://github.com/gabfl/sitecrawl
crawl crawler crawler-python crawling-sites
Last synced: 3 months ago
JSON representation
Simple Python module to crawl a website and extract URLs
- Host: GitHub
- URL: https://github.com/gabfl/sitecrawl
- Owner: gabfl
- License: mit
- Created: 2022-01-24T23:50:52.000Z (almost 3 years ago)
- Default Branch: main
- Last Pushed: 2023-05-09T00:08:05.000Z (over 1 year ago)
- Last Synced: 2024-08-09T02:55:36.715Z (5 months ago)
- Topics: crawl, crawler, crawler-python, crawling-sites
- Language: Python
- Homepage:
- Size: 30.3 KB
- Stars: 5
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# sitecrawl
[![Pypi](https://img.shields.io/pypi/v/sitecrawl.svg)](https://pypi.org/project/sitecrawl)
[![Build Status](https://github.com/gabfl/sitecrawl/actions/workflows/ci.yml/badge.svg?branch=main)](https://github.com/gabfl/sitecrawl/actions)
[![codecov](https://codecov.io/gh/gabfl/sitecrawl/branch/main/graph/badge.svg)](https://codecov.io/gh/gabfl/sitecrawl)
[![MIT licensed](https://img.shields.io/badge/license-MIT-green.svg)](https://raw.githubusercontent.com/gabfl/sitecrawl/main/LICENSE)Simple Python module to crawl a website and extract URLs.
## Installation
Using pip:
```bash
pip3 install sitecrawlsitecrawl --help
```Or build from sources:
```bash
# Clone project
git clone https://github.com/gabfl/sitecrawl && cd sitecrawl# Installation
pip3 install .
```## Usage
### CLI
```bash
sitecrawl --url https://www.yahoo.com/ --depth 2 --max 4 --verbose
```->
```
* Found 4 internal URLs
https://www.yahoo.com
https://www.yahoo.com/entertainment
https://www.yahoo.com/lifestyle
https://www.yahoo.com/plus* Found 5 external URLs
https://mail.yahoo.com/
https://news.yahoo.com/
https://finance.yahoo.com/
https://sports.yahoo.com/
https://shopping.yahoo.com/* Skipped 0 URLs
```### As a module
Basic example:
```py
from sitecrawl import crawlcrawl.base_url = 'https://www.yahoo.com'
crawl.deep_crawl(depth=2)print('Internal URLs:', crawl.get_internal_urls())
print('External URLs:', crawl.get_external_urls())
print('Skipped URLs:', crawl.get_skipped_urls())
```A more detailed example is available in [example.py](https://github.com/gabfl/sitecrawl/blob/main/example.py).