{"id":15297484,"url":"https://github.com/ruedigervoigt/exoskeleton","last_synced_at":"2025-04-13T22:32:22.542Z","repository":{"id":57427603,"uuid":"217070596","full_name":"RuedigerVoigt/exoskeleton","owner":"RuedigerVoigt","description":"A Python framework to build polite, but tenacious crawlers / scrapers with a MariaDB backend","archived":false,"fork":false,"pushed_at":"2023-10-15T19:33:43.000Z","size":723,"stargazers_count":21,"open_issues_count":6,"forks_count":1,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-04-12T14:41:09.564Z","etag":null,"topics":["crawler","crawling-framework","database","machine-learning","mariadb","network","python","python-3","scraping"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/RuedigerVoigt.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"contributing.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2019-10-23T13:55:48.000Z","updated_at":"2024-01-03T14:16:42.000Z","dependencies_parsed_at":"2024-03-31T10:46:33.170Z","dependency_job_id":null,"html_url":"https://github.com/RuedigerVoigt/exoskeleton","commit_stats":{"total_commits":555,"total_committers":4,"mean_commits":138.75,"dds":0.05585585585585584,"last_synced_commit":"2cdeaeca0094a7aa37c5e2b78a0e4c82da609817"},"previous_names":[],"tags_count":27,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RuedigerVoigt%2Fexoskeleton","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RuedigerVoigt%2Fexoskeleton/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RuedigerVoigt%2Fexoskeleton/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RuedigerVoigt%2Fexoskeleton/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/RuedigerVoigt","download_url":"https://codeload.github.com/RuedigerVoigt/exoskeleton/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248790711,"owners_count":21162076,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawler","crawling-framework","database","machine-learning","mariadb","network","python","python-3","scraping"],"created_at":"2024-09-30T19:17:47.649Z","updated_at":"2025-04-13T22:32:22.224Z","avatar_url":"https://github.com/RuedigerVoigt.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Exoskeleton\n\n![pypi version](https://img.shields.io/pypi/v/exoskeleton)\n![Supported Python Versions](https://img.shields.io/pypi/pyversions/exoskeleton)\n![Build](https://github.com/RuedigerVoigt/exoskeleton/workflows/Build/badge.svg)\n![Last commit](https://img.shields.io/github/last-commit/RuedigerVoigt/exoskeleton)\n[![Downloads](https://pepy.tech/badge/exoskeleton)](https://pepy.tech/project/exoskeleton)\n[![Coverage](https://img.shields.io/badge/coverage-85%25-lightgreen)](https://www.ruediger-voigt.eu/coverage/exoskeleton/index.html)\n\nMachine Learning and other applications make it necessary to download thousands or sometimes hundreds of thousands of files.\n\nUsing a high-speed-connection carries the risk to run an involuntary denial-of-service attack on the servers that provide those files and webpages.\n\nExoskeleton is a Python framework that helps you build a crawler / scraper that avoids too high loads on the connection and instead runs permanently and fault tolerant to ultimately download all files.\n\nIts main functionalities are:\n* Managing the download queue and document data within a MariaDB database.\n* Avoid processing the same URL more than once.\n* Working through the queue by either\n    * downloading files to disk,\n    * storing the page source code into a database table,\n    * storing the page text,\n    * or making PDF-copies of webpages.\n* Managing already downloaded files:\n    * Storing multiple versions of a specific file.\n    * Assigning labels to downloads, so they can be found and grouped easily.\n* Sending progress reports to the admin.\n\n# Documentation\n\n## How To Use Exoskeleton\n\n* [Installation and Requirements](https://github.com/RuedigerVoigt/exoskeleton/tree/master/documentation/installation.md)\n* [Create a Bot](https://github.com/RuedigerVoigt/exoskeleton/tree/master/documentation/create-a-bot.md)\n* [Dealing with result pages](https://github.com/RuedigerVoigt/exoskeleton/tree/master/documentation/parse-search-results.md)\n* [Avoiding duplicates](https://github.com/RuedigerVoigt/exoskeleton/tree/master/documentation/avoiding-duplicates.md)\n* [The Queue: Downloading files / Saving the page code / Creating PDF](https://github.com/RuedigerVoigt/exoskeleton/tree/master/documentation/handling-pages.md)\n* [Bot Behavior](https://github.com/RuedigerVoigt/exoskeleton/tree/master/documentation/behavior-settings.md)\n* [Progress Reports via Email](https://github.com/RuedigerVoigt/exoskeleton/tree/master/documentation/progress-reports-via-email.md)\n* [File Versions and Labels](https://github.com/RuedigerVoigt/exoskeleton/tree/master/documentation/versions-and-labels.md)\n* [Using the Blocklist](https://github.com/RuedigerVoigt/exoskeleton/tree/master/documentation/blocklist.md)\n\n## Example Uses\n\n* [Downloading an Archive](https://www.ruediger-voigt.eu/exoskeleton-download-an-archive.html) : A quite complex use case requiring some custom SQL. This is the actual project that triggered the development of exoskeleton.\n\n## Technical Documentation\n\n* [Contributing](https://github.com/RuedigerVoigt/exoskeleton/tree/master/contributing.md)\n* [Database Structure](https://github.com/RuedigerVoigt/exoskeleton/tree/master/documentation/database-schema.md)\n* [Testing](https://github.com/RuedigerVoigt/exoskeleton/tree/master/documentation/testing-exoskeleton.md)\n\n\n\n## Example\n\n```python\n#!/usr/bin/env python3\n# -*- coding: utf-8 -*-\nimport logging\n\nimport exoskeleton\n\nlogging.basicConfig(level=logging.DEBUG)\n\n# Create a bot\n# exoskeleton makes reasonable assumptions about\n# parameters left out, like:\n# - host = localhost\n# - port = 3306 (MariaDB standard)\n# - ...\nexo = exoskeleton.Exoskeleton(\n    project_name='Bot',\n    database_settings={'database': 'exoskeleton',\n                       'username': 'exoskeleton',\n                       'passphrase': ''},\n    # True, to stop after the queue is empty, Otherwise it will\n    # look consistently for new tasks in the queue:\n    bot_behavior={'stop_if_queue_empty': True},\n    filename_prefix='bot_',\n    chrome_name='chromium-browser',\n    target_directory='/home/myusername/myBot/'\n)\n\nexo.add_file_download('https://www.ruediger-voigt.eu/examplefile.txt')\n# =\u003e Will be saved in the target directory. The filename will be the\n#    chosen prefix followed by the database id and .txt.\n\nexo.add_file_download(\n    'https://www.ruediger-voigt.eu/examplefile.txt',\n    {'example-label', 'foo'})\n# =\u003e Duplicate will be recognized and not added to the queue,\n#    but the labels will be associated with the file in the\n#    database.\n\n\nexo.add_file_download(\n    'https://www.ruediger-voigt.eu/file_does_not_exist.pdf')\n# =\u003e Nonexistent file: will be marked, but will not stop the bot.\n\n# Save a page's code into the database:\nexo.add_save_page_code('https://www.ruediger-voigt.eu/')\n\n# Use chromium or Google chrome to generate a PDF of the website:\nexo.add_page_to_pdf('https://github.com/RuedigerVoigt/exoskeleton')\n\n# work through the queue:\nexo.process_queue()\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fruedigervoigt%2Fexoskeleton","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fruedigervoigt%2Fexoskeleton","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fruedigervoigt%2Fexoskeleton/lists"}