{"id":13586091,"url":"https://github.com/oduwsdl/archivenow","last_synced_at":"2025-04-05T22:08:42.298Z","repository":{"id":48598308,"uuid":"81448050","full_name":"oduwsdl/archivenow","owner":"oduwsdl","description":"A Tool To Push Web Resources Into Web Archives","archived":false,"fork":false,"pushed_at":"2024-01-23T17:39:56.000Z","size":21443,"stargazers_count":419,"open_issues_count":18,"forks_count":41,"subscribers_count":20,"default_branch":"master","last_synced_at":"2025-03-29T21:05:46.916Z","etag":null,"topics":["internet-archive","web-archiving"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/oduwsdl.png","metadata":{"files":{"readme":"README.rst","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-02-09T12:29:08.000Z","updated_at":"2025-03-12T18:38:08.000Z","dependencies_parsed_at":"2024-11-06T04:42:59.587Z","dependency_job_id":null,"html_url":"https://github.com/oduwsdl/archivenow","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oduwsdl%2Farchivenow","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oduwsdl%2Farchivenow/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oduwsdl%2Farchivenow/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oduwsdl%2Farchivenow/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/oduwsdl","download_url":"https://codeload.github.com/oduwsdl/archivenow/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247406091,"owners_count":20933803,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["internet-archive","web-archiving"],"created_at":"2024-08-01T15:05:19.256Z","updated_at":"2025-04-05T22:08:42.277Z","avatar_url":"https://github.com/oduwsdl.png","language":"Python","funding_links":[],"categories":["Tools \u0026 Software","Python","工具","[💾 sysadmin-devops](https://github.com/stars/ketsapiwiq/lists/sysadmin-devops)","Onboarding","Web Archiving"],"sub_categories":["Acquisition","信息备份","Capture Operators \u0026 Services"],"readme":"Archive Now (archivenow)\n=============================\nA Tool To Push Web Resources Into Web Archives\n----------------------------------------------\n\nArchive Now (**archivenow**) currently is configured to push resources into four public web archives. You can easily add more archives by writing a new archive handler (e.g., myarchive_handler.py) and place it inside the folder \"handlers\". \n\nUpdate January 2021\n~~~~~~~~~\nOriginally, **archivenow** was configured to push to 6 different public web archives. The two removed web archives are `WebCite \u003chttps://www.webcitation.org/\u003e`_ and `archive.st \u003chttp://archive.st/\u003e`_. WebCite was removed from **archivenow** as they are no longer accepting archiving requests. Archive.st was removed from **archivenow** due to encountering a Captcha when attempting to push to the archive. In addition to removing those 2 archives, the method for pushing to `archive.today \u003chttps://archive.vn/\u003e`_ and `megalodon.jp \u003chttps://megalodon.jp/\u003e`_ from **archivenow** has been updated. In order to push to `archive.today \u003chttps://archive.vn/\u003e`_ and `megalodon.jp \u003chttps://megalodon.jp/\u003e`_, `Selenium \u003chttps://selenium-python.readthedocs.io/\u003e`_ is used.\n\nAs explained below, this library can be used through:\n\n- Command Line Interface (CLI)\n\n- A Web Service\n\n- A Docker Container\n\n- Python\n\n\nInstalling\n----------\nThe latest release of **archivenow** can be installed using pip:\n\n.. code-block:: bash\n\n      $ pip install archivenow\n\nThe latest development version containing changes not yet released can be installed from source:\n\n.. code-block:: bash\n      \n      $ git clone git@github.com:oduwsdl/archivenow.git\n      $ cd archivenow\n      $ pip install -r requirements.txt\n      $ pip install ./\n      \nIn order to push to `archive.today \u003chttps://archive.vn/\u003e`_ and `megalodon.jp \u003chttps://megalodon.jp/\u003e`_, **archivenow** must use `Selenium \u003chttps://selenium-python.readthedocs.io/\u003e`_, which has already been added to the requirements.txt. However, Selenium additionally needs a driver to interface with the chosen browser. It is recommended to use Selenium and **archivenow** with `Firefox \u003chttps://www.mozilla.org/en-US/firefox/releases/\u003e`_ and Firefox's corresponding `GeckoDriver \u003chttps://github.com/mozilla/geckodriver/releases\u003e`_.\n\nYou can download the latest versions of `Firefox \u003chttps://www.mozilla.org/en-US/firefox/releases/\u003e`_ and the `GeckoDriver \u003chttps://github.com/mozilla/geckodriver/releases\u003e`_ to use with **archivenow**.\n\nAfter installing the driver, you can push to `archive.today \u003chttps://archive.vn/\u003e`_ and `megalodon.jp \u003chttps://megalodon.jp/\u003e`_ from **archivenow**.\n\nCLI USAGE \n---------\nUsage of sub-commands in **archivenow** can be accessed through providing the `-h` or `--help` flag, like any of the below.\n\n.. code-block:: bash\n\n      $ archivenow -h\n      usage: archivenow.py [-h] [--mg] [--cc] [--cc_api_key [CC_API_KEY]]\n                           [--is] [--ia] [--warc [WARC]] [-v] [--all]\n                           [--server] [--host [HOST]] [--agent [AGENT]]\n                           [--port [PORT]]\n                           [URI]\n\n      positional arguments:\n        URI                   URI of a web resource\n\n      optional arguments:\n        -h, --help            show this help message and exit\n        --mg                  Use Megalodon.jp\n        --cc                  Use The Perma.cc Archive\n        --cc_api_key [CC_API_KEY]\n                              An API KEY is required by The Perma.cc Archive\n        --is                  Use The Archive.is\n        --ia                  Use The Internet Archive\n        --warc [WARC]         Generate WARC file\n        -v, --version         Report the version of archivenow\n        --all                 Use all possible archives\n        --server              Run archiveNow as a Web Service\n        --host [HOST]         A server address\n        --agent [AGENT]       Use \"wget\" or \"squidwarc\" for WARC generation\n        --port [PORT]         A port number to run a Web Service\n\nExamples\n--------\n\n\nExample 1\n~~~~~~~~~\n\nTo save the web page (www.foxnews.com) in the Internet Archive:\n\n.. code-block:: bash\n\n      $ archivenow --ia www.foxnews.com\n      https://web.archive.org/web/20170209135625/http://www.foxnews.com\n\nExample 2\n~~~~~~~~~\n\nBy default, the web page (e.g., www.foxnews.com) will be saved in the Internet Archive if no optional arguments are provided:\n\n.. code-block:: bash\n\n      $ archivenow www.foxnews.com\n      https://web.archive.org/web/20170215164835/http://www.foxnews.com\n\nExample 3\n~~~~~~~~~\n\nTo save the web page (www.foxnews.com) in the Internet Archive (archive.org) and Archive.is:\n\n.. code-block:: bash\n      \n      $ archivenow --ia --is www.foxnews.com\n      https://web.archive.org/web/20170209140345/http://www.foxnews.com\n      http://archive.is/fPVyc\n\n\nExample 4\n~~~~~~~~~\n\nTo save the web page (https://nypost.com/) in all configured web archives. In addition to preserving the page in all configured archives, this command will also locally create a WARC file:\n\n.. code-block:: bash\n      \n      $ archivenow --all https://nypost.com/ --cc_api_key $Your-Perma-CC-API-Key\n      http://archive.is/dcnan\n      https://perma.cc/53CC-5ST8\n      https://web.archive.org/web/20181002081445/https://nypost.com/\n      https://megalodon.jp/2018-1002-1714-24/https://nypost.com:443/\n      https_nypost.com__96ec2300.warc\n\nExample 5\n~~~~~~~~~\n\nTo download the web page (https://nypost.com/) and create a WARC file:\n\n.. code-block:: bash\n      \n      $ archivenow --warc=mypage --agent=wget https://nypost.com/\n      mypage.warc\n      \nServer\n------\n\nYou can run **archivenow** as a web service. You can specify the server address and/or the port number (e.g., --host localhost  --port 12345)\n\n.. code-block:: bash\n      \n      $ archivenow --server\n      \n      Running on http://0.0.0.0:12345/ (Press CTRL+C to quit)\n\n\nExample 6\n~~~~~~~~~\n\nTo save the web page (www.foxnews.com) in The Internet Archive through the web service:\n\n.. code-block:: bash\n\n      $ curl -i http://0.0.0.0:12345/ia/www.foxnews.com\n      \n          HTTP/1.0 200 OK\n          Content-Type: application/json\n          Content-Length: 95\n          Server: Werkzeug/0.11.15 Python/2.7.10\n          Date: Tue, 02 Oct 2018 08:20:18 GMT\n\n          {\n            \"results\": [\n              \"https://web.archive.org/web/20181002082007/http://www.foxnews.com\"\n            ]\n          }\n      \nExample 7\n~~~~~~~~~\n\nTo save the web page (www.foxnews.com) in all configured archives though the web service:\n\n.. code-block:: bash\n      \n      $ curl -i http://0.0.0.0:12345/all/www.foxnews.com\n\n          HTTP/1.0 200 OK\n          Content-Type: application/json\n          Content-Length: 385\n          Server: Werkzeug/0.11.15 Python/2.7.10\n          Date: Tue, 02 Oct 2018 08:23:53 GMT\n\n          {\n            \"results\": [\n              \"Error (The Perma.cc Archive): An API Key is required \", \n              \"http://archive.is/ukads\", \n              \"https://web.archive.org/web/20181002082007/http://www.foxnews.com\", \n              \"Error (Megalodon.jp): We can not obtain this page because the time limit has been reached or for technical ... \", \n              \"http://www.webcitation.org/72rbKsX8B\"\n            ]\n          }\n\nExample 8\n~~~~~~~~~\n\nBecause an API Key is required by Perma.cc, the HTTP request should be as follows:\n        \n.. code-block:: bash\n      \n      $ curl -i http://127.0.0.1:12345/all/https://nypost.com/?cc_api_key=$Your-Perma-CC-API-Key\n\nOr use only Perma.cc:\n\n.. code-block:: bash\n\n      $ curl -i http://127.0.0.1:12345/cc/https://nypost.com/?cc_api_key=$Your-Perma-CC-API-Key\n\nRunning as a Docker Container\n-----------------------------\n\n.. code-block:: bash\n\n    $ docker image pull oduwsdl/archivenow\n\nDifferent ways to run archivenow    \n\n.. code-block:: bash\n\n    $ docker container run -it --rm oduwsdl/archivenow -h\n\nAccessible at 127.0.0.1:12345:\n\n.. code-block:: bash\n\n    $ docker container run -p 12345:12345 -it --rm oduwsdl/archivenow --server --host 0.0.0.0\n\nAccessible at 127.0.0.1:22222:\n\n.. code-block:: bash\n\n    $ docker container run -p 22222:11111 -it --rm oduwsdl/archivenow --server --port 11111 --host 0.0.0.0\n\n.. image:: http://www.cs.odu.edu/~maturban/archivenow-6-archives.gif\n   :width: 10pt\n\n\nTo save the web page (http://www.cnn.com) in The Internet Archive\n\n.. code-block:: bash\n\n    $ docker container run -it --rm oduwsdl/archivenow --ia http://www.cnn.com\n    \n\nPython Usage\n------------\n\n.. code-block:: bash\n   \n    \u003e\u003e\u003e from archivenow import archivenow\n\nExample 9\n~~~~~~~~~~\n\nTo save the web page (www.foxnews.com) in all configured archives:\n\n.. code-block:: bash\n\n      \u003e\u003e\u003e archivenow.push(\"www.foxnews.com\",\"all\")\n      ['https://web.archive.org/web/20170209145930/http://www.foxnews.com','http://archive.is/oAjuM','http://www.webcitation.org/6o9LcQoVV','Error (The Perma.cc Archive): An API KEY is required]\n\nExample 10\n~~~~~~~~~~\n\nTo save the web page (www.foxnews.com) in The Perma.cc:\n\n.. code-block:: bash\n\n      \u003e\u003e\u003e archivenow.push(\"www.foxnews.com\",\"cc\",{\"cc_api_key\":\"$YOUR-Perma-cc-API-KEY\"})\n      ['https://perma.cc/8YYC-C7RM']\n      \nExample 11\n~~~~~~~~~~\n\nTo start the server from Python do the following. The server/port number can be passed (e.g., start(port=1111, host='localhost')):\n\n.. code-block:: bash\n\n      \u003e\u003e\u003e archivenow.start()\n      \n          2017-02-09 15:02:37\n          Running on http://127.0.0.1:12345\n          (Press CTRL+C to quit)\n\n\nConfiguring a new archive or removing existing one\n--------------------------------------------------\nAdditional archives may be added by creating a handler file in the \"handlers\" directory.\n\nFor example, if I want to add a new archive named \"My Archive\", I would create a file \"ma_handler.py\" and store it in the folder \"handlers\". The \"ma\" will be the archive identifier, so to push a web page (e.g., www.cnn.com) to this archive through the Python code, I should write:\n\n\n.. code-block:: python\n\n      archivenow.push(\"www.cnn.com\",\"ma\")\n      \n\nIn the file \"ma_handler.py\", the name of the class must be \"MA_handler\". This class must have at least one function called \"push\" which has one argument. See the existing `handler files`_ for examples on how to organized a newly configured archive handler.\n\nRemoving an archive can be done by one of the following options:\n\n- Removing the archive handler file from the folder \"handlers\"\n\n- Renaming the archive handler file to other name that does not end with \"_handler.py\"\n\n- Setting the variable \"enabled\" to \"False\" inside the handler file\n\n\nNotes\n-----\nThe Internet Archive (IA) sets a time gap of at least two minutes between creating different copies of the \"same\" resource. \n\nFor example, if you send a request to IA to capture (www.cnn.com) at 10:00pm, IA will create a new copy (*C*) of this URI. IA will then return *C* for all requests to the archive for this URI received until 10:02pm. Using this same submission procedure for Archive.is requires a time gap of five minutes.  \n\n.. _handler files: https://github.com/oduwsdl/archivenow/tree/master/archivenow/handlers\n\n\nCiting Project\n--------------\n\n.. code-block:: latex\n\n      @INPROCEEDINGS{archivenow-jcdl2018,\n        AUTHOR    = {Mohamed Aturban and\n                     Mat Kelly and\n                     Sawood Alam and\n                     John A. Berlin and\n                     Michael L. Nelson and\n                     Michele C. Weigle},\n        TITLE     = {{ArchiveNow}: Simplified, Extensible, Multi-Archive Preservation},\n        BOOKTITLE = {Proceedings of the 18th {ACM/IEEE-CS} Joint Conference on Digital Libraries},\n        SERIES    = {{JCDL} '18},\n        PAGES     = {321--322},\n        MONTH     = {June},\n        YEAR      = {2018},\n        ADDRESS   = {Fort Worth, Texas, USA},\n        URL       = {https://doi.org/10.1145/3197026.3203880},\n        DOI       = {10.1145/3197026.3203880}\n      }\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Foduwsdl%2Farchivenow","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Foduwsdl%2Farchivenow","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Foduwsdl%2Farchivenow/lists"}