Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/ruarxive/awesome-digital-preservation

Awesome list dedicated to digital and data preservation tools, sources, services and so on.
https://github.com/ruarxive/awesome-digital-preservation

List: awesome-digital-preservation

archival awesome awesome-list crawler digital-preservation list warc webarchiving

Last synced: about 2 months ago
JSON representation

Awesome list dedicated to digital and data preservation tools, sources, services and so on.

Awesome Lists containing this project

README

        

# Awesome digital preservation

[![Awesome](https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg)](https://github.com/sindresorhus/awesome)

Awesome list of digital preservation tools

## Table of contents

* [Web Archiving](#web-archiving)
* [Social networks](#social-networks)
* [Other digital objects](#other-digital-objects)
* [Standards and specifications](#standards-and-specifications)
* [Organizations](#organizations)
* [Major digital archives](#major-digital-archives)
* [Knowledge bases](#knowledge-bases)
* [Related lists](#related-lists)

## Web archiving

### Crawlers
* [Wget](https://www.gnu.org/software/wget/) - a free software package for retrieving files using HTTP, HTTPS, FTP and FTPS, the most widely used Internet protocols.
* [WPull](https://github.com/ArchiveTeam/wpull) - Wget-compatible web downloader and crawler.
* [Conifer](https://github.com/Rhizome-Conifer/conifer) - collect and revisit web pages
* [grab-site](https://github.com/ArchiveTeam/grab-site) - The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
* [Heritrix3](https://github.com/internetarchive/heritrix3) - Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
* [WAIL](https://github.com/machawk1/wail) - Web Archiving Integration Layer: One-Click User Instigated Preservation
- [Browsetrix Crawler](https://github.com/webrecorder/browsertrix-crawler) - run a high-fidelity browser-based crawler in a single Docker container

## Replay tools
* [Archive Web.page](https://github.com/webrecorder/archiveweb.page) - A High-Fidelity Web Archiving Extension for Chrome and Chromium based browsers
* [Reply Web.page](https://github.com/webrecorder/replayweb.page) - Serverless Web Archive Replay directly in the browser
* [pywb](https://github.com/webrecorder/pywb) - Core Python Web Archiving Toolkit for replay and recording of web archives
* [webrecorder-player](https://github.com/webrecorder/webrecorder-player) - Webrecorder Player for Desktop (OSX/Windows/Linux). (Built with Electron + Webrecorder)
* [ipwb](https://github.com/oduwsdl/ipwb) - InterPlanetary Wayback: A distributed and persistent archive replay system using IPFS

### Analysis and data processing
* [AUT](https://github.com/archivesunleashed/aut/) - The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
* [AUT Notebooks](https://github.com/archivesunleashed/notebooks) - Various examples of notebooks for working with web archives with the Archives Unleashed Toolkit, and derivatives generated by the Archives Unleashed Toolkit.
* [WARCIO](https://github.com/webrecorder/warcio) - Streaming WARC/ARC library for fast web archive IO
* [Metawarc](https://github.com/datacoon/metawarc) - Metadata extractor from WARC files
* [WarcDB](https://github.com/Florents-Tselai/WarcDB) - WarcDB: Web crawl data as SQLite databases
* [ArchiveSpark](https://github.com/helgeho/ArchiveSpark) - An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed at Internet Archive.
* [CDX Toolkit](https://github.com/cocrawler/cdx_toolkit) - A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine

### Page pushers
* [ArchiveBox](https://github.com/ArchiveBox/ArchiveBox) - Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more
* [Wayback](https://github.com/wabarc/wayback) - A self-hosted toolkit for archiving webpages to the Internet Archive, archive.today, IPFS, and local file systems
* [Archivenow](https://github.com/oduwsdl/archivenow) - A Tool To Push Web Resources Into Web Archives
* [iagitup](https://github.com/gdamdam/iagitup) - A command line tool to archive a git repository from GitHub to the Internet Archive.

### Online services
* [ArchiveIt](https://archive-it.org/) - web archiving online services

## Social Networks

### Twitter
* [twarc](https://github.com/DocNow/twarc) - A command line tool (and Python library) for archiving Twitter JSON

### Instagram
* [instaloader](https://github.com/instaloader/instaloader) - Download pictures (or videos) along with their captions and other metadata from Instagram.

### Universal
* [sfm-ui](https://github.com/gwu-libraries/sfm-ui) - Social Feed Manager user interface application.
* [Media downloader](https://github.com/awesome-yasin/Media-Downloader) - download Instagram Reels, Stories, Post, Stalk Instagram Profile, Facebook Public Videos, YouTube Videos and YouTube to MP3 converter, SoundCloud MP3 and Dailymotion videos. Made from Node JS Express JS, React JS and Rapid API.

## Other digital objects

### Online storage
* [ydiskarc](https://github.com/ruarxive/ydiskarc) - command-line tool to backup public resources from Yandex.disk (disk.yandex.ru / yadi.sk) filestorage service
* [filegetter](https://github.com/ruarxive/filegetter) - A command-line tool to collect files from public data sources using URL patterns and config files

### Messengers and chats
* [tgarc](https://github.com/ruarxive/tgarc) - A command line tool for archiving Telegram JSON

### Specific CMS
* [wparc](https://github.com/ruarxive/wparc) - Wordpress API data and files archival command line tool
* [spcrawler](https://github.com/ruarxive/spcrawler) - A command-line tool to backup Sharepoint public installations data from open API endpoint

### Public Data API
- [apibackuper](https://github.com/ruarxive/apibackuper) - Python library and cmd tool to backup API calls

## Standards and specifications

* [The WARC Format 1.1](https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/) - The Web ARChive (WARC) archive format specifies a method for combining multiple digital resources into an aggregate archive file together with related information.
* [CDX File format](https://iipc.github.io/warc-specifications/specifications/cdx-format/cdx-2006/) - format of CDX files, that are list of files in WARC files
* [WARC Specifications](https://iipc.github.io/warc-specifications/) - collection of WARC related specifications and formats
* [The WACZ Format 1.1.1](https://specs.webrecorder.net/wacz/1.1.1/) - Web Archive Collection Zipped. WACZ is a media type that allows web archive collections to be packaged and shared on the web as a discrete file.

## Organizations

* [Digital preservation coalition](https://www.dpconline.org/) - The DPC is a not-for-profit company dedicated to digital preservation inititatives
* [International Internet Preservation Consortium ](https://netpreserve.org/) - Leading consortium for web archiving

## Knowledge bases
* [Archiveteam Wiki](https://wiki.archiveteam.org/) - Wiki about various archival topics and file formats

## Major digital archives

* [Internet Archive](https://archive.org/) - biggest digital archive with big web archives
* [Common Crawl](https://commoncrawl.org) - open data search engine index crawled from whole Internet

## Related lists

* [Awesome Web Archiving](https://github.com/iipc/awesome-web-archiving) - An Awesome List for getting started with web archiving
* [Awesome data takeout](https://github.com/ivbeg/awesome-data-takeout) - An Awesome Data Takeout list of services to take out your personal data from major online services and providers