Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/gabfl/sitecrawl

Simple Python module to crawl a website and extract URLs
https://github.com/gabfl/sitecrawl

crawl crawler crawler-python crawling-sites

Last synced: 4 months ago
JSON representation

Simple Python module to crawl a website and extract URLs

Host: GitHub
URL: https://github.com/gabfl/sitecrawl
Owner: gabfl
License: mit
Created: 2022-01-24T23:50:52.000Z (about 3 years ago)
Default Branch: main
Last Pushed: 2023-05-09T00:08:05.000Z (over 1 year ago)
Last Synced: 2024-08-09T02:55:36.715Z (6 months ago)
Topics: crawl, crawler, crawler-python, crawling-sites
Language: Python
Homepage:
Size: 30.3 KB
Stars: 5
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # sitecrawl

[![Pypi](https://img.shields.io/pypi/v/sitecrawl.svg)](https://pypi.org/project/sitecrawl)

[![Build Status](https://github.com/gabfl/sitecrawl/actions/workflows/ci.yml/badge.svg?branch=main)](https://github.com/gabfl/sitecrawl/actions)

[![codecov](https://codecov.io/gh/gabfl/sitecrawl/branch/main/graph/badge.svg)](https://codecov.io/gh/gabfl/sitecrawl)

[![MIT licensed](https://img.shields.io/badge/license-MIT-green.svg)](https://raw.githubusercontent.com/gabfl/sitecrawl/main/LICENSE)

Simple Python module to crawl a website and extract URLs.

## Installation

Using pip:

```bash

pip3 install sitecrawl

sitecrawl --help

```

Or build from sources:

```bash

# Clone project

git clone https://github.com/gabfl/sitecrawl && cd sitecrawl

# Installation

pip3 install .

```

## Usage

### CLI

```bash

sitecrawl --url https://www.yahoo.com/ --depth 2 --max 4 --verbose

```

->

```

* Found 4 internal URLs

  https://www.yahoo.com

  https://www.yahoo.com/entertainment

  https://www.yahoo.com/lifestyle

  https://www.yahoo.com/plus

* Found 5 external URLs

  https://mail.yahoo.com/

  https://news.yahoo.com/

  https://finance.yahoo.com/

  https://sports.yahoo.com/

  https://shopping.yahoo.com/

* Skipped 0 URLs

```

### As a module

Basic example:

```py

from sitecrawl import crawl

crawl.base_url = 'https://www.yahoo.com'

crawl.deep_crawl(depth=2)

print('Internal URLs:', crawl.get_internal_urls())

print('External URLs:', crawl.get_external_urls())

print('Skipped URLs:', crawl.get_skipped_urls())

```

A more detailed example is available in [example.py](https://github.com/gabfl/sitecrawl/blob/main/example.py).