Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/archiveteam/webarchiver
Decentralized web archiving
https://github.com/archiveteam/webarchiver
archiver archiving crawler decentralized python warc web webarchiving
Last synced: 2 months ago
JSON representation
Decentralized web archiving
- Host: GitHub
- URL: https://github.com/archiveteam/webarchiver
- Owner: ArchiveTeam
- Created: 2018-04-22T19:20:55.000Z (almost 7 years ago)
- Default Branch: master
- Last Pushed: 2018-08-07T15:28:31.000Z (over 6 years ago)
- Last Synced: 2024-11-13T14:23:01.730Z (2 months ago)
- Topics: archiver, archiving, crawler, decentralized, python, warc, web, webarchiving
- Language: Python
- Homepage:
- Size: 323 KB
- Stars: 19
- Watchers: 17
- Forks: 3
- Open Issues: 2
-
Metadata Files:
- Readme: README.rst
Awesome Lists containing this project
README
WebArchiver
====**WebArchiver** is a decentralized web archiving system. It allows for servers to be added and removed and minimizes data-loss when a server is offline.
This project is still being developed.
Usage
----WebArchiver has the following dependencies:
* ``flask``
* ``requests``
* ``warcio``Install these by running ``pip install flask requests warcio`` or use ``pip3`` in case your default Python version is Python 2.
``wget`` is also required, this can be installed using::
sudo apt-get install wget
To run WebArchiver:
#. ``git clone`` this repository,
#. ``cd`` into it,
#. Run ``python main.py`` with options or use ``python3`` if your default Python version is Python 2.Options
~~~~The following options are available for setting up a server in a network or creating a network.
* ``-h``
``--help``
* ``-v````-version``: Get the version of WebArchiver.
* ``-S SORT````--sort=SORT``: The sort of server to be created. ``SORT`` can be ``stager`` for a stager or ``crawler`` for a crawler. This argument is required.
* ``-SH HOST````--stager-host=HOST``: The host of the stager to connect to. This should not be set if this is the first stager.
* ``-SP PORT````--stager-port=PORT``: The port of the stager to connect to. This should not be set if this is the first stager.
* ``-H HOST````--host=HOST``: The host to use for communication. If not set the scripts will try to determine the host.
* ``-P PORT````--port=PORT``: The port to use for communication. If not set a random port between 3000 and 6000 will be chosen.
* ``--no-dashboard``: Do not create a dashboard.
* ``--dashboard-port=PORT``: The port to use for the dashboard. Default port is 5000.Add a job
~~~~A crawl of a website or a list of URLs is called a job. To add a job a configuration file needs to be processed and added to WebArchiver. The configuration file has the identifier and the following possible options.
* ``url``: URL to crawl.
* ``urls file``: Filename of a file containing a list of URLs.
* ``urls url``: URL to a webpage containing a raw list of URLs.
* ``rate``: URL crawl rate in URLs per second.
* ``allow regex``: Regular expression a discovered URL should match.
* ``ignore regex``: Regular expression a discovered URL should not match.
* ``depth``: Maximum depth to crawl.For all settings except ``rate`` and ``depth`` multiple entries are possible.
An example of a configuration file is
.. code:: ini
[identifier]
url = https://example.com/
url = https://example.com/page2
urls file = list
urls url = https://pastebin.com/raw/tMpQQk7B
rate = 4
allow regex = https?://(?:www)?example\.com/
allow regex = https?://[^/]+\.london
ignore regex = https?://[^/]+\.nl
depth = 3To process the configuration file and add it to WebArchiver, run ``python add_job.py FILENAME``, where ``FILENAME`` is the name of the configuration file.
Servers
----WebArchiver consists of stagers and crawlers. Stagers divide the work among crawlers and other stagers.
Stager
~~~~The stager distributes new jobs and URLs and received WARCs from crawlers.
Crawling
~~~~The crawler received URLs from the stager it is connected to, crawls these URLs and send back the WARC and new found URLs.