Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/RuedigerVoigt/exoskeleton
A Python framework to build polite, but tenacious crawlers / scrapers with a MariaDB backend
https://github.com/RuedigerVoigt/exoskeleton
crawler crawling-framework database machine-learning mariadb network python python-3 scraping
Last synced: 3 months ago
JSON representation
A Python framework to build polite, but tenacious crawlers / scrapers with a MariaDB backend
- Host: GitHub
- URL: https://github.com/RuedigerVoigt/exoskeleton
- Owner: RuedigerVoigt
- License: apache-2.0
- Created: 2019-10-23T13:55:48.000Z (about 5 years ago)
- Default Branch: master
- Last Pushed: 2023-10-15T19:33:43.000Z (about 1 year ago)
- Last Synced: 2024-07-27T22:44:18.334Z (3 months ago)
- Topics: crawler, crawling-framework, database, machine-learning, mariadb, network, python, python-3, scraping
- Language: Python
- Homepage:
- Size: 706 KB
- Stars: 21
- Watchers: 4
- Forks: 1
- Open Issues: 6
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: contributing.md
- License: LICENSE
Awesome Lists containing this project
README
# Exoskeleton
![pypi version](https://img.shields.io/pypi/v/exoskeleton)
![Supported Python Versions](https://img.shields.io/pypi/pyversions/exoskeleton)
![Build](https://github.com/RuedigerVoigt/exoskeleton/workflows/Build/badge.svg)
![Last commit](https://img.shields.io/github/last-commit/RuedigerVoigt/exoskeleton)
[![Downloads](https://pepy.tech/badge/exoskeleton)](https://pepy.tech/project/exoskeleton)
[![Coverage](https://img.shields.io/badge/coverage-85%25-lightgreen)](https://www.ruediger-voigt.eu/coverage/exoskeleton/index.html)Machine Learning and other applications make it necessary to download thousands or sometimes hundreds of thousands of files.
Using a high-speed-connection carries the risk to run an involuntary denial-of-service attack on the servers that provide those files and webpages.
Exoskeleton is a Python framework that helps you build a crawler / scraper that avoids too high loads on the connection and instead runs permanently and fault tolerant to ultimately download all files.
Its main functionalities are:
* Managing the download queue and document data within a MariaDB database.
* Avoid processing the same URL more than once.
* Working through the queue by either
* downloading files to disk,
* storing the page source code into a database table,
* storing the page text,
* or making PDF-copies of webpages.
* Managing already downloaded files:
* Storing multiple versions of a specific file.
* Assigning labels to downloads, so they can be found and grouped easily.
* Sending progress reports to the admin.# Documentation
## How To Use Exoskeleton
* [Installation and Requirements](https://github.com/RuedigerVoigt/exoskeleton/tree/master/documentation/installation.md)
* [Create a Bot](https://github.com/RuedigerVoigt/exoskeleton/tree/master/documentation/create-a-bot.md)
* [Dealing with result pages](https://github.com/RuedigerVoigt/exoskeleton/tree/master/documentation/parse-search-results.md)
* [Avoiding duplicates](https://github.com/RuedigerVoigt/exoskeleton/tree/master/documentation/avoiding-duplicates.md)
* [The Queue: Downloading files / Saving the page code / Creating PDF](https://github.com/RuedigerVoigt/exoskeleton/tree/master/documentation/handling-pages.md)
* [Bot Behavior](https://github.com/RuedigerVoigt/exoskeleton/tree/master/documentation/behavior-settings.md)
* [Progress Reports via Email](https://github.com/RuedigerVoigt/exoskeleton/tree/master/documentation/progress-reports-via-email.md)
* [File Versions and Labels](https://github.com/RuedigerVoigt/exoskeleton/tree/master/documentation/versions-and-labels.md)
* [Using the Blocklist](https://github.com/RuedigerVoigt/exoskeleton/tree/master/documentation/blocklist.md)## Example Uses
* [Downloading an Archive](https://www.ruediger-voigt.eu/exoskeleton-download-an-archive.html) : A quite complex use case requiring some custom SQL. This is the actual project that triggered the development of exoskeleton.
## Technical Documentation
* [Contributing](https://github.com/RuedigerVoigt/exoskeleton/tree/master/contributing.md)
* [Database Structure](https://github.com/RuedigerVoigt/exoskeleton/tree/master/documentation/database-schema.md)
* [Testing](https://github.com/RuedigerVoigt/exoskeleton/tree/master/documentation/testing-exoskeleton.md)## Example
```python
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import loggingimport exoskeleton
logging.basicConfig(level=logging.DEBUG)
# Create a bot
# exoskeleton makes reasonable assumptions about
# parameters left out, like:
# - host = localhost
# - port = 3306 (MariaDB standard)
# - ...
exo = exoskeleton.Exoskeleton(
project_name='Bot',
database_settings={'database': 'exoskeleton',
'username': 'exoskeleton',
'passphrase': ''},
# True, to stop after the queue is empty, Otherwise it will
# look consistently for new tasks in the queue:
bot_behavior={'stop_if_queue_empty': True},
filename_prefix='bot_',
chrome_name='chromium-browser',
target_directory='/home/myusername/myBot/'
)exo.add_file_download('https://www.ruediger-voigt.eu/examplefile.txt')
# => Will be saved in the target directory. The filename will be the
# chosen prefix followed by the database id and .txt.exo.add_file_download(
'https://www.ruediger-voigt.eu/examplefile.txt',
{'example-label', 'foo'})
# => Duplicate will be recognized and not added to the queue,
# but the labels will be associated with the file in the
# database.exo.add_file_download(
'https://www.ruediger-voigt.eu/file_does_not_exist.pdf')
# => Nonexistent file: will be marked, but will not stop the bot.# Save a page's code into the database:
exo.add_save_page_code('https://www.ruediger-voigt.eu/')# Use chromium or Google chrome to generate a PDF of the website:
exo.add_page_to_pdf('https://github.com/RuedigerVoigt/exoskeleton')# work through the queue:
exo.process_queue()
```