Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

awesome-web-archiving

An Awesome List for getting started with web archiving
https://github.com/ibnesayeed/awesome-web-archiving

Last synced: 5 days ago
JSON representation

  • Training/Documentation

  • Tools & Software

    • Utilities

      • The Archive Browser - The Archive Browser is a program that lets you browse the contents of archives, as well as extract them. It will let you open files from inside archives, and lets you preview them using Quick Look. WARC is supported (macOS only, Proprietary app).
      • The Unarchiver - Program to extract the contents of many archive formats, inclusive of WARC, to a file system. Free variant of The Archive Browser (macOS only, Proprietary app).
    • Acquisition

      • Crawl - A simple web crawler in Golang. (Stable)
      • Heritrix - An open source, extensible, web-scale, archival quality web crawler. (Stable)
      • HTTrack - An open source website copying utility. (Stable)
      • SiteStory - A transactional archive that selectively captures and stores transactions that take place between a web client (browser) and a web server. (Stable)
      • WebMemex - Browser extension for Firefox and Chrome which lets you archive web pages you visit. (In Development)
      • Wget - An open source file retrieval utility that of [version 1.14 supports writing warcs](http://www.archiveteam.org/index.php?title=Wget_with_WARC_output). (Stable)
      • SecurityTrails - Web based archive for WHOIS and DNS records. REST API available free of charge.
      • Tempas v1 - Temporal web archive search based on [Delicious](https://en.wikipedia.org/wiki/Delicious_(website)) tags. (Stable)
      • Tempas v2 - Temporal web archive search based on links and anchor texts extracted from the German web from 1996 to 2013 (results are not limited to German pages, e.g., [Obama@2005-2009 in Tempas](http://tempas.l3s.de/v2/query?q=obama&from=2005&to=2009)). (Stable)
      • here
    • WARC I/O Libraries

      • Jwat - Libraries and tools for reading/writing/validating WARC/ARC/GZIP files (Java). (Stable)
    • Analysis

      • Archives Unleashed Cloud - Archives Unleashed Cloud (AUK) is an web interface for analysing web archives. Currently, it can sync with Archive-It collections and extract hyperlink networks, full text, and other information from your collections. (Stable)
    • Quality Assurance

  • Resources for Web Publishers

  • Community Resources

    • Blogs and Scholarship

      • IIPC Blog
      • Web Archiving Roundtable - Unofficial blog of the Web Archiving Roundtable of the [Society of American Archivists](https://www2.archivists.org/) maintained by the members of the Web Archiving Roundtable.
      • The Web as History - An open-source book that provides a conceptual overview to web archiving research, as well as several case studies.
      • WS-DL Blog - Web Science and Digital Libraries Research Group blogs about various Web archining related topics, scholarly work, and academic trip reports.
      • DSHR's Blog - David Rosenthal regularly reviwes and summarizes work done in the Digital Preservation field.
    • Slack

      • IIPC Slack - Ask [@netpreserve](https://twitter.com/NetPreserve) for access.
      • Archives Unleashed Slack - [Fill out this request form](https://docs.google.com/forms/d/e/1FAIpQLScXPIH0Ssw63yWqyMkUqHVYmz2-ItBMzHiJQ-sOlJwTA8u5AQ/viewform?usp=sf_link) for access to a researcher group of people working with web archives.
    • Twitter