Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

awesome-web-archiving

An Awesome List for getting started with web archiving
https://github.com/iipc/awesome-web-archiving

Last synced: 3 days ago
JSON representation

  • Training/Documentation

  • Tools & Software

    • Acquisition

      • WebMemex - Browser extension for Firefox and Chrome which lets you archive web pages you visit. *(In Development)*
      • WARCreate - A [Google Chrome](https://www.google.com/intl/en/chrome/browser/) extension for archiving an individual webpage or website to a WARC file. *(Stable)*
      • Web Curator Tool - Open-source workflow management for selective web archiving. *(Stable)*
      • Wget - An open source file retrieval utility that of [version 1.14 supports writing warcs](http://www.archiveteam.org/index.php?title=Wget_with_WARC_output). *(Stable)*
      • ArchiveBox - A tool which maintains an additive archive from RSS feeds, bookmarks, and links using wget, Chrome headless, and other methods (formerly `Bookmark Archiver`). *(In Development)*
      • ArchiveWeb.Page - A plugin for Chrome and other Chromium based browsers that lets you interactively archive web pages, replay them, and export them as WARC data. Also available as an Electron based desktop application.
      • Crawl - A simple web crawler in Golang. *(Stable)*
      • Heritrix - An open source, extensible, web-scale, archival quality web crawler. *(Stable)*
      • Heritrix Q&A - A discussion forum for asking questions and getting answers about using Heritrix.
      • HTTrack - An open source website copying utility. *(Stable)*
      • SiteStory - A transactional archive that selectively captures and stores transactions that take place between a web client (browser) and a web server. *(Stable)*
      • ArchiveBox - A tool which maintains an additive archive from RSS feeds, bookmarks, and links using wget, Chrome headless, and other methods (formerly `Bookmark Archiver`). *(In Development)*
      • ArchiveWeb.Page - A plugin for Chrome and other Chromium based browsers that lets you interactively archive web pages, replay them, and export them as WARC data. Also available as an Electron based desktop application.
      • DiskerNet - A non-WARC-based tool which hooks into the Chrome browser and archives everything you browse making it available for offline replay. *(In Development)*
      • WARCreate - A [Google Chrome](https://www.google.com/intl/en/chrome/browser/) extension for archiving an individual webpage or website to a WARC file. *(Stable)*
      • Web Curator Tool - Open-source workflow management for selective web archiving. *(Stable)*
      • Webrecorder - Create high-fidelity, interactive recordings of any web site you browse. *(Stable)*
      • Wpull - A Wget-compatible (or remake/clone/replacement/alternative) web downloader and crawler. *(Stable)*
      • Webrecorder - Create high-fidelity, interactive recordings of any web site you browse. *(Stable)*
      • SecurityTrails - Web based archive for WHOIS and DNS records. REST API available free of charge.
      • Tempas v1 - Temporal web archive search based on [Delicious](https://en.wikipedia.org/wiki/Delicious_(website)) tags. *(Stable)*
      • Tempas v2 - Temporal web archive search based on links and anchor texts extracted from the German web from 1996 to 2013 (results are not limited to German pages, e.g., [Obama@2005-2009 in Tempas](http://tempas.l3s.de/v2/query?q=obama&from=2005&to=2009)). *(Stable)*
      • here
    • Utilities

      • cdx-toolkit - Library and CLI to consult cdx indexes and create WARC extractions of subsets. Abstracts away Common Crawl's unusual crawl structure. *(Stable)*
      • httpreserve.info - Service to return the status of a web page or save it to the Internet Archive. HTTPreserve includes disambiguation of well-known short link services. It returns JSON via the browser or command line via CURL using GET. Describes web sites using earliest and latest dates in the Internet Archive and demonstrates the construction of Robust Links in its output using that range. (Golang). *(Stable)*
      • The Archive Browser - The Archive Browser is a program that lets you browse the contents of archives, as well as extract them. It will let you open files from inside archives, and lets you preview them using Quick Look. WARC is supported (macOS only, Proprietary app).
      • The Unarchiver - Program to extract the contents of many archive formats, inclusive of WARC, to a file system. Free variant of The Archive Browser (macOS only, Proprietary app).
      • The Archive Browser - The Archive Browser is a program that lets you browse the contents of archives, as well as extract them. It will let you open files from inside archives, and lets you preview them using Quick Look. WARC is supported (macOS only, Proprietary app).
      • The Unarchiver - Program to extract the contents of many archive formats, inclusive of WARC, to a file system. Free variant of The Archive Browser (macOS only, Proprietary app).
      • The Archive Browser - The Archive Browser is a program that lets you browse the contents of archives, as well as extract them. It will let you open files from inside archives, and lets you preview them using Quick Look. WARC is supported (macOS only, Proprietary app).
      • The Unarchiver - Program to extract the contents of many archive formats, inclusive of WARC, to a file system. Free variant of The Archive Browser (macOS only, Proprietary app).
    • WARC I/O Libraries

      • Jwat - Libraries and tools for reading/writing/validating WARC/ARC/GZIP files (Java). *(Stable)*
    • Analysis

    • Quality Assurance

      • Chrome Check My Links - Browser extension: a link checker with more options.
      • Chrome link checker - Browser extension: basic link checker.
      • Chrome Open Multiple URLs - Browser extension: opens multiple URLs and also extracts URLs from text.
      • Chrome Revolver - Browser extension: switches between browser tabs.
      • FlameShot - Screen capture and annotation on Ubuntu.
      • PlayOnLinux - For running Xenu and Notepad++ on Ubuntu.
      • PlayOnMac - For running Xenu and Notepad++ on macOS.
      • Windows Snipping Tool - Windows built-in for partial screen capture and annotation. On macOS you can use Command + Shift + 4 (keyboard shortcut for taking partial screen capture).
      • WineBottler - For running Xenu and Notepad++ on macOS.
      • Xenu - Desktop link checker for Windows.
      • Chrome link gopher - Browser extension: link harvester on a page.
      • FlameShot - Screen capture and annotation on Ubuntu.
      • Windows Snipping Tool - Windows built-in for partial screen capture and annotation. On macOS you can use Command + Shift + 4 (keyboard shortcut for taking partial screen capture).
      • WineBottler - For running Xenu and Notepad++ on macOS.
      • Windows Snipping Tool - Windows built-in for partial screen capture and annotation. On macOS you can use Command + Shift + 4 (keyboard shortcut for taking partial screen capture).
      • Chrome link gopher - Browser extension: link harvester on a page.
    • Curation

      • Zotero Robust Links Extension - A [Zotero](https://www.zotero.org/) extension that submits to and reads from web archives. Source [on GitHub](https://github.com/lanl/Zotero-Robust-Links-Extension). Supercedes [leonkt/zotero-memento](https://github.com/leonkt/zotero-memento).
    • Replay

      • PyWb - A Python (2 and 3) implementation of web archival replay tools, sometimes also known as 'Wayback Machine'. *(Stable)*
      • ReplayWeb.page - A browser-based, fully client-side replay engine for both local and remote WARC & WACZ files. *(Stable)*
  • Resources for Web Publishers

  • Community Resources

  • Web Archiving Service Providers

    • Self-hostable, Open Source

      • Browsertrix - From [Webrecorder](https://webrecorder.net/), source available at <https://github.com/webrecorder/browsertrix>.
      • Browsertrix Cloud - From [Webrecorder](https://webrecorder.net/), source available at <https://github.com/webrecorder/browsertrix-cloud>.
      • Conifer - From [Rhizome](https://rhizome.org/), source available at <https://github.com/Rhizome-Conifer>.
    • Hosted, Closed Source