{"id":21035475,"url":"https://github.com/archiveteam/webarchiver","last_synced_at":"2025-05-15T14:31:17.245Z","repository":{"id":140744392,"uuid":"130599962","full_name":"ArchiveTeam/WebArchiver","owner":"ArchiveTeam","description":"Decentralized web archiving","archived":false,"fork":false,"pushed_at":"2018-08-07T15:28:31.000Z","size":331,"stargazers_count":20,"open_issues_count":2,"forks_count":4,"subscribers_count":16,"default_branch":"master","last_synced_at":"2025-05-12T08:31:37.893Z","etag":null,"topics":["archiver","archiving","crawler","decentralized","python","warc","web","webarchiving"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ArchiveTeam.png","metadata":{"files":{"readme":"README.rst","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2018-04-22T19:20:55.000Z","updated_at":"2025-03-20T17:38:41.000Z","dependencies_parsed_at":null,"dependency_job_id":"39f8c56f-575c-4762-83ff-9cb35872d3d4","html_url":"https://github.com/ArchiveTeam/WebArchiver","commit_stats":{"total_commits":63,"total_committers":2,"mean_commits":31.5,"dds":"0.015873015873015928","last_synced_commit":"c2144a498764de6f086fc2fca3c653b6c4adcf57"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ArchiveTeam%2FWebArchiver","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ArchiveTeam%2FWebArchiver/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ArchiveTeam%2FWebArchiver/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ArchiveTeam%2FWebArchiver/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ArchiveTeam","download_url":"https://codeload.github.com/ArchiveTeam/WebArchiver/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254358688,"owners_count":22057959,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["archiver","archiving","crawler","decentralized","python","warc","web","webarchiving"],"created_at":"2024-11-19T13:15:01.060Z","updated_at":"2025-05-15T14:31:17.205Z","avatar_url":"https://github.com/ArchiveTeam.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"WebArchiver\n====\n\n**WebArchiver** is a decentralized web archiving system. It allows for servers to be added and removed and minimizes data-loss when a server is offline.\n\nThis project is still being developed.\n\nUsage\n----\n\nWebArchiver has the following dependencies:\n\n * ``flask``\n * ``requests``\n * ``warcio``\n\nInstall these by running ``pip install flask requests warcio`` or use ``pip3`` in case your default Python version is Python 2.\n\n``wget`` is also required, this can be installed using::\n\n    sudo apt-get install wget\n\nTo run WebArchiver:\n #. ``git clone`` this repository,\n #. ``cd`` into it,\n #. Run ``python main.py`` with options or use ``python3`` if your default Python version is Python 2.\n\nOptions\n~~~~\n\nThe following options are available for setting up a server in a network or creating a network.\n\n * ``-h``\n\n   ``--help``\n * ``-v``\n\n   ``-version``: Get the version of WebArchiver.\n * ``-S SORT``\n\n   ``--sort=SORT``: The sort of server to be created. ``SORT`` can be ``stager`` for a stager or ``crawler`` for a crawler. This argument is required.\n * ``-SH HOST``\n\n   ``--stager-host=HOST``: The host of the stager to connect to. This should not be set if this is the first stager.\n * ``-SP PORT``\n\n   ``--stager-port=PORT``: The port of the stager to connect to. This should not be set if this is the first stager.\n * ``-H HOST``\n\n   ``--host=HOST``: The host to use for communication. If not set the scripts will try to determine the host.\n * ``-P PORT``\n\n   ``--port=PORT``: The port to use for communication. If not set a random port between 3000 and 6000 will be chosen.\n * ``--no-dashboard``: Do not create a dashboard.\n * ``--dashboard-port=PORT``: The port to use for the dashboard. Default port is 5000.\n\nAdd a job\n~~~~\n\nA crawl of a website or a list of URLs is called a job. To add a job a configuration file needs to be processed and added to WebArchiver. The configuration file has the identifier and the following possible options.\n\n * ``url``: URL to crawl.\n * ``urls file``: Filename of a file containing a list of URLs.\n * ``urls url``: URL to a webpage containing a raw list of URLs.\n * ``rate``: URL crawl rate in URLs per second.\n * ``allow regex``: Regular expression a discovered URL should match.\n * ``ignore regex``: Regular expression a discovered URL should not match.\n * ``depth``: Maximum depth to crawl.\n\nFor all settings except ``rate`` and ``depth`` multiple entries are possible.\n\nAn example of a configuration file is\n\n.. code:: ini\n\n    [identifier]\n    url = https://example.com/\n    url = https://example.com/page2\n    urls file = list\n    urls url = https://pastebin.com/raw/tMpQQk7B\n    rate = 4\n    allow regex = https?://(?:www)?example\\.com/\n    allow regex = https?://[^/]+\\.london\n    ignore regex = https?://[^/]+\\.nl\n    depth = 3\n\nTo process the configuration file and add it to WebArchiver, run ``python add_job.py FILENAME``, where ``FILENAME`` is the name of the configuration file.\n\nServers\n----\n\nWebArchiver consists of stagers and crawlers. Stagers divide the work among crawlers and other stagers.\n\nStager\n~~~~\n\nThe stager distributes new jobs and URLs and received WARCs from crawlers.\n\nCrawling\n~~~~\n\nThe crawler received URLs from the stager it is connected to, crawls these URLs and send back the WARC and new found URLs.\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Farchiveteam%2Fwebarchiver","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Farchiveteam%2Fwebarchiver","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Farchiveteam%2Fwebarchiver/lists"}