Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/e73b025/simple-python-url-crawler

Super simple Python3 website URL scraper/crawler. Multi-threaded.
https://github.com/e73b025/simple-python-url-crawler

crawler googlebot lightweight link-collection multi-threaded python python3 scraper simple

Last synced: 3 months ago
JSON representation

Super simple Python3 website URL scraper/crawler. Multi-threaded.

Host: GitHub
URL: https://github.com/e73b025/simple-python-url-crawler
Owner: e73b025
License: mit
Created: 2020-03-03T15:27:19.000Z (almost 5 years ago)
Default Branch: master
Last Pushed: 2022-05-16T13:46:30.000Z (over 2 years ago)
Last Synced: 2024-10-12T09:45:56.179Z (4 months ago)
Topics: crawler, googlebot, lightweight, link-collection, multi-threaded, python, python3, scraper, simple
Language: Python
Size: 21.5 KB
Stars: 2
Watchers: 2
Forks: 3
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        ## Description

A super simple multi-threaded website URL crawler. Returns a Python array of all found URLs. It can be configured to

return either internal urls, external urls or both.

## Dependencies

- pip install requests

- pip install beautifulsoup4

## Features

- Super simple; two lines of code to get a list of URLs on a website.

- Multi-threaded.

- Enable or disable logging.

- Can return internal, external or both URLs.

- Can provide optional callback method for LIVE URL finds.

- Not much else.

## Usage

The following code sample will scan site "strongscot.com", using 5 threads and hiding all logging information.

### Find Internal and External URLs

```python

crawler = SiteUrlCrawler("https://strongscot.com", 5, False)

# Print the found URLs

for url in crawler.crawl(SiteUrlCrawler.Mode.ALL):

    print("Found: " + url)

```

Will output something similar to this:

```

Found: https://strongscot.com/

Found: https://strongscot.com/projects/

Found: https://strongscot.com/cv/

Found: https://strongscot.com/contact/

Found: https://github.com/strongscot

Found: https://strongscot.com/blog/20/03/03/simple-site-crawler.html

Found: https://strongscot.com/blog/20/02/19/birthday.html

Found: https://strongscot.com/blog/19/12/09/new-site.html

Found: https://strongscot.com/blog/19/09/09/body-goals.html

Found: https://strongscot.com/blog/19/09/09/cool-dropdown-ui.html

Found: https://strongscot.com/blog/19/09/09/flying-in-a-flight-machine.html

Found: https://github.com/strongscot/simple-python-url-crawler

```

### Find Only Internal URLs

```python

crawler = SiteUrlCrawler("https://strongscot.com")

# Print the found URLs

for url in crawler.crawl(SiteUrlCrawler.Mode.INTERNAL):

    print("Found: " + url)

```

### Find Only External URLs

```python

crawler = SiteUrlCrawler("https://strongscot.com")

# Print the found URLs

for url in crawler.crawl(SiteUrlCrawler.Mode.EXTERNAL):

    print("Found: " + url)

```

Will output:

```

Found: https://github.com/strongscot

Found: https://twitter.com/thestrongscot

```

## Using Callback (getting live URL finds as they happen)

If you wish to get each URL as it is found rather than at the end in an array, you can pass an optional argument to the

``crawl()`` method that will do exactly that. For example:

```python

crawler = SiteUrlCrawler("https://strongscot.com")

def callback(url):

    print("Found: " + url)

# Get ALL urls and print them

crawler.crawl(SiteUrlCrawler.Mode.ALL, callback)

```

## Bad-Tip

Want to make it a small Google Bot? Comment-out lines ``134`` - ``136`` in file ``SiteUrlCrawler.py`` and it will trawl even external links.

## Author

@strongscot

## License

MIT