An open API service indexing awesome lists of open source software.

Projects in Awesome Lists tagged with web-archiving

A curated list of projects in awesome lists tagged with web-archiving .

https://github.com/archivebox/archivebox

🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...

archivebox backups bookmark-archiver browser-bookmarks chromium digipres firefox headless-browser internet-archiving pinboard pocket python rss self-hosted singlefile warc wayback-machine web-archiving wget youtube-dl

Last synced: 09 Sep 2025

https://github.com/pirate/ArchiveBox

🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...

archivebox backups bookmark-archiver browser-bookmarks chromium digipres firefox headless-browser internet-archiving pinboard pocket python rss self-hosted singlefile warc wayback-machine web-archiving wget youtube-dl

Last synced: 01 Apr 2025

https://github.com/ArchiveBox/ArchiveBox

🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...

archivebox backups bookmark-archiver browser-bookmarks chromium digipres firefox headless-browser internet-archiving pinboard pocket python rss self-hosted singlefile warc wayback-machine web-archiving wget youtube-dl

Last synced: 13 Mar 2025

https://github.com/webrecorder/pywb

Core Python Web Archiving Toolkit for replay and recording of web archives

python pywb wayback web-archives web-archiving

Last synced: 14 May 2025

https://github.com/webrecorder/archiveweb.page

A High-Fidelity Web Archiving Extension for Chrome and Chromium based browsers!

archiving browser-extension chromium extension wacz warc web-archiving webrecorder

Last synced: 13 Apr 2025

https://github.com/ray-d-song/web-archive

Free web archiving and sharing service based on Cloudflare. 基于 Cloudflare 的免费网页归档和分享工具。

cloudflare cloudflare-pages d1 free hono self-hosted serverless web-archive web-archiving

Last synced: 15 May 2025

https://github.com/Ray-D-Song/web-archive

Free web archiving and sharing service based on Cloudflare. 基于 Cloudflare 的免费网页归档和分享工具。

cloudflare cloudflare-pages d1 free hono self-hosted serverless web-archive web-archiving

Last synced: 27 Mar 2025

https://github.com/webrecorder/replayweb.page

Serverless replay of web archives directly in the browser

replay-web-page service-worker wacz warc wayback-machine web-archive web-archiving web-replay

Last synced: 14 May 2025

https://github.com/webrecorder/browsertrix-crawler

Run a high-fidelity browser-based web archiving crawler in a single Docker container

crawler crawling wacz warc web-archiving web-crawler webrecorder

Last synced: 15 May 2025

https://github.com/gildas-lormeau/single-file-cli

CLI tool for saving a faithful copy of a complete web page in a single HTML file (based on SingleFile)

archiving cli crawler deno dockerfile nodejs scraping-websites single-file web-archiving web-crawler web-scraper web-scraping

Last synced: 15 May 2025

https://github.com/oduwsdl/ipwb

InterPlanetary Wayback: A distributed and persistent archive replay system using IPFS

docker ipfs memento memento-rfc python service-worker warc wayback web-archiving

Last synced: 13 Apr 2025

https://github.com/bellingcat/auto-archiver

Automatically archive links to videos, images, and social media content from Google Sheets (and more).

archive docker open-source-research python scraping service web-archiving

Last synced: 20 Apr 2025

https://github.com/harvard-lil/perma

Indelible links

libraries web-archiving

Last synced: 03 Apr 2025

https://github.com/webrecorder/webrecorder-player

Webrecorder Player for Desktop (OSX/Windows/Linux). (Built with Electron + Webrecorder)

electron pywb warc web-archiving webrecorder

Last synced: 03 Apr 2025

https://github.com/oduwsdl/archivenow

A Tool To Push Web Resources Into Web Archives

internet-archive web-archiving

Last synced: 05 Apr 2025

https://github.com/florents-tselai/warcdb

WarcDB: Web crawl data as SQLite databases.

cli crawling database sqlite warc web-archiving web-data

Last synced: 04 Apr 2025

https://github.com/Florents-Tselai/WarcDB

WarcDB: Web crawl data as SQLite databases.

cli crawling database sqlite warc web-archiving web-data

Last synced: 08 Apr 2025

https://github.com/machawk1/wail

:whale2: Web Archiving Integration Layer: One-Click User Instigated Preservation

gui heritrix openwayback pyinstaller python warc wayback web-archiving

Last synced: 16 May 2025

https://github.com/webrecorder/warcio

Streaming WARC/ARC library for fast web archive IO

python pywb warc web-archives web-archiving

Last synced: 15 May 2025

https://github.com/archivebox/archivebox-browser-extension

Official ArchiveBox browser extension: automatically/manually preserve your browsing history using ArchiveBox.

archivebox archiving browser-extension chrome-extension digipres digital-preservation firefox-extension internet-archiving svelte web-archiving

Last synced: 07 Jul 2025

https://github.com/ArchiveBox/archivebox-browser-extension

Official ArchiveBox browser extension: automatically/manually preserve your browsing history using ArchiveBox.

archivebox archiving browser-extension chrome-extension digipres digital-preservation firefox-extension internet-archiving svelte web-archiving

Last synced: 03 Apr 2025

https://github.com/webrecorder/browsertrix

Browsertrix is the hosted, high-fidelity, browser-based crawling service from Webrecorder designed to make web archiving easier and more accessible for all!

archiving cloud kubernetes wacz warc web-archive web-archiving webrecorder

Last synced: 16 May 2025

https://github.com/machawk1/warcreate

Chrome extension to "Create WARC files from any webpage"

chrome-extension warc web-archiving

Last synced: 10 Apr 2025

https://github.com/archivebox/electron-archivebox

Desktop Electron app for ArchiveBox internet archiver. (ALPHA: not ready for general use)

archivebox desktop desktop-electron digipres docker electron gui internet-archiving linux macos web-archiving windows

Last synced: 05 Aug 2025

https://github.com/ArchiveBox/electron-archivebox

Desktop Electron app for ArchiveBox internet archiver. (ALPHA: not ready for general use)

archivebox desktop desktop-electron digipres docker electron gui internet-archiving linux macos web-archiving windows

Last synced: 14 Mar 2025

https://github.com/cocrawler/cdx_toolkit

A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine

cdx cdx-api commoncrawl python warc web-archives web-archiving

Last synced: 14 Dec 2025

https://gwu-libraries.github.io/sfm-ui/

Social Feed Manager user interface application.

code4lib social-feed-manager social-media web-archiving

Last synced: 22 Apr 2025

https://github.com/gwu-libraries/sfm-ui

Social Feed Manager user interface application.

code4lib social-feed-manager social-media web-archiving

Last synced: 08 Apr 2025

https://github.com/helgeho/archivespark

An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed at Internet Archive.

archivespark internet-archive spark spark-framework warc web-archiving webarchive

Last synced: 05 Apr 2025

https://github.com/helgeho/ArchiveSpark

An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed at Internet Archive.

archivespark internet-archive spark spark-framework warc web-archiving webarchive

Last synced: 08 Apr 2025

https://github.com/N0taN3rd/wail

:whale2: One-Click User Instigated Preservation

browser-based-presrevation electron high-fidelity-preservation warc web-archiving

Last synced: 03 Apr 2025

https://github.com/maxcountryman/warc-parquet

🗄️ A simple CLI for converting WARC to Parquet.

crawling duckdb parquet warc web-archiving

Last synced: 16 May 2025

https://github.com/n0tan3rd/node-warc

Parse And Create Web ARChive (WARC) files with node.js

chrome-remote-interface pupeteer warc warc-files web-archives web-archiving webarchive webarchiving

Last synced: 07 May 2025

https://github.com/N0taN3rd/node-warc

Parse And Create Web ARChive (WARC) files with node.js

chrome-remote-interface pupeteer warc warc-files web-archives web-archiving webarchive webarchiving

Last synced: 06 Aug 2025

https://github.com/xarantolus/Collect

A server to collect & archive websites that also supports video downloads

archive self-hosted video-downloader web-archiving webinterface website-archive website-scraper

Last synced: 10 May 2025

https://github.com/oduwsdl/warrick

Recover lost websites from the Web Infrastructure

memento memento-rfc recovery web-archiving

Last synced: 20 Feb 2025

https://github.com/xarantolus/collect

A server to collect & archive websites that also supports video downloads

archive self-hosted video-downloader web-archiving webinterface website-archive website-scraper

Last synced: 23 Apr 2025

https://github.com/oduwsdl/memgator

A Memento Aggregator CLI and Server in Go

memento memento-rfc timemap web-archiving

Last synced: 13 Apr 2025

https://github.com/oduwsdl/MemGator

A Memento Aggregator CLI and Server in Go

memento memento-rfc timemap web-archiving

Last synced: 06 Aug 2025

https://github.com/pirate/internet-archiving-talk

🎭 An introduction to the Internet Archiving ecosystem, tooling, and some of the ethical dilemmas that the community faces.

archivebox censorship ethics internet-archiving slideshow talks warc web-archiving wget

Last synced: 24 Mar 2025

https://github.com/Own-Data-Privateer/hoardy-web

Passively capture, archive, and hoard your web browsing history, including the contents of the pages you visit, for later offline viewing, mirroring, and/or indexing. Your own personal private Wayback Machine that can also archive HTTP POST requests and responses, as well as most other HTTP-level data.

archive archiver archiving auto-save backups browser-extension cli internet internet-archiving offline-reading self-hosted snapshot wayback-machine web-archive web-archiving web-browsing website-archive

Last synced: 11 Mar 2025

https://github.com/TarekJor/bookmark-archiver

🗄 Save an archived copy of websites from Pocket/Pinboard/Bookmarks/RSS. Outputs HTML, PDFs, and more...

archive backup bookmarks browser chromium firefox google-chrome headless-browser headless-chrome html-export pinboard pocket preservation python rss safari web-archive web-archiving web-browser wget

Last synced: 27 Mar 2025

https://github.com/tarekjor/bookmark-archiver

🗄 Save an archived copy of websites from Pocket/Pinboard/Bookmarks/RSS. Outputs HTML, PDFs, and more...

archive backup bookmarks browser chromium firefox google-chrome headless-browser headless-chrome html-export pinboard pocket preservation python rss safari web-archive web-archiving web-browser wget

Last synced: 05 Oct 2025

https://github.com/zytedata/web-snap

Create "perfect" snapshots of web pages

capture-page javascript playwright web-archives web-archiving

Last synced: 07 Oct 2025

https://github.com/nla/httrack2warc

Converts HTTrack crawls to WARC files

web-archiving

Last synced: 19 Jul 2025

https://github.com/pkharsimran/website-downloader

Website-downloader is a powerful and versatile Python script designed to download entire websites along with all their assets. This tool allows you to create a local copy of a website, including HTML pages, images, CSS, JavaScript files, and other resources. It is ideal for web archiving, offline browsing, and web development.

automation beautifulsoup data-mining html internet-tools offline-browsing open-source python python-scripts requests web-archiving web-scraping website-cloner website-downloader wget

Last synced: 01 Sep 2025

https://github.com/ArchiveBox/pocket-exporter

A service to help export your pocket bookmarks, tags, saved article text, and more...

archivebox archiving bookmarks getpocket html internet-archiving pocket urls web-archiving

Last synced: 19 Aug 2025

https://github.com/webrecorder/cdxj-indexer

CDXJ Indexing of WARC/ARCs

warc web-archiving

Last synced: 07 Apr 2025

https://github.com/archivebox/archivebox-proxy

Official ArchiveBox MITM proxy: saves URLs of all requests passing through to an ArchiveBox server for archival.

archivebox digipres digital-preservation https-proxy internet-archiving mitmproxy proxy web-archiving web-proxy

Last synced: 07 Jul 2025

https://github.com/internetarchive/sandcrawler

Backend, IA-specific tools for crawling and processing the scholarly web. Content ends up in https://fatcat.wiki

web-archiving

Last synced: 17 Oct 2025

https://github.com/internetarchive/scrapy-warcio

Support for writing WARC files with Scrapy

python scrapy warc web-archiving

Last synced: 14 Jul 2025

https://github.com/archivebox/docs

Source for the Github Wiki / ReadTheDocs documentation for AchiveBox, the self-hosted internet archiving solution.

archivebox cli community digipres documentation internet-archiving python rest sphinx ui usage web-archiving wiki

Last synced: 07 Jul 2025

https://github.com/webrecorder/dat-share

A prototype server to swarm multiple DATs for Webrecorder

dat dat-protocol hyperdrive web-archiving

Last synced: 21 Apr 2025

https://github.com/archivebox/pip-archivebox

Official Python package for ArchiveBox, the self-hosted internet archiving solution.

archivebox digipres internet-archiving pip pypi python sdist setuptools web-archiving wheel

Last synced: 07 Jul 2025

https://github.com/dbeley/archiveboxmatic

ArchiveBoxMatic: configure ArchiveBox with the simplicity of a yaml file.

archivebox archiving web-archiving

Last synced: 29 Apr 2025

https://github.com/rhizome-conifer/conifer-deploy

Conifer setup and deployment via Ansible

ansible-playbook web-archiving webrecorder

Last synced: 12 Apr 2025

https://github.com/anjackson/sliver

A tool for collection archival slivers of the web and web archives

web-archive web-archives web-archiving

Last synced: 09 Apr 2025

https://github.com/helgeho/hadoopconcatgz

A Splitable Hadoop InputFormat for Concatenated GZIP Files and *.(w)arc.gz

hadoop spark warc web-archiving webarchive

Last synced: 14 Apr 2025

https://github.com/bellingcat/auto-archiver-api

API to manage users/sheets/URLs and call the auto-archiver in dedicated workers.

celery digital-preservation fastapi web-archiving

Last synced: 26 Jun 2025

https://github.com/webis-de/scriptor

Plug-and-play reproducible web analysis.

automation browser nodejs playwright user-simulation web-analysis web-archiving

Last synced: 29 Oct 2025

https://github.com/rybesh/capture-urls

Archive a list of URLs using the Wayback Machine

save-page-now wayback-machine web-archiving

Last synced: 04 Apr 2025

https://github.com/project-polymorph/news-website

中文跨性别相关新闻存档站点

news transgender web-archiving

Last synced: 05 Jan 2026

https://github.com/caltechlibrary/eprints2archives

Send records from an EPrints server to the Internet Archive and other web archives

archiving eprints internet-archive memento preservation python terminal web-archives web-archiving

Last synced: 14 Apr 2025

https://github.com/project-polymorph/platform-home

homepage and platform for chinese trans digital archive

transgender web-archiving

Last synced: 12 Oct 2025

https://github.com/helgeho/warcpartitioner

Partition (W)ARC Files by MIME Type and Year

hadoop warc web-archiving webarchive

Last synced: 14 Apr 2025

https://github.com/archivebox/pocket-exporter

A service to help export your pocket bookmarks, tags, saved article text, and more...

archivebox archiving bookmarks getpocket html internet-archiving pocket urls web-archiving

Last synced: 03 Jul 2025

https://github.com/cvyl/cf-static-archive-worker

A serverless website archiving solution built with Cloudflare Workers. This tool crawls and archives static websites, storing all assets (HTML, CSS, JS, images, etc.) in Cloudflare R2 storage.

archiver cloudflare cloudflare-r2 cloudflare-worker cloudflare-workers web-archiving

Last synced: 05 Apr 2025

https://github.com/operating-function/packrat

Next-gen browser history

chrome-extension wacz web-archiving

Last synced: 22 Jun 2025

https://github.com/usiqwerty/cheburashka

Extensible web archiving tool

archive cheburnet web-archiving

Last synced: 03 Apr 2025

https://github.com/yuzhoumo/edbox

Ed course archiver and viewer

alpinejs edstem jinja2 python web-archiving

Last synced: 04 Oct 2025

https://github.com/helgeho/tempas2archivespark

ArchiveSpark DataSpec to analyze the Internet Archive's Web archive through temporal search results returned by Tempas (v2)

archivespark information-retrieval temporal web-archives web-archiving

Last synced: 14 Apr 2025

https://github.com/oduwsdl/offtopic-goldstandard-data

Data for testing the Offtopic detection software

dataset memento offtopic web-archives web-archiving

Last synced: 22 Aug 2025

https://github.com/meequrox/flb-archiver

Flareboard web archiver in C using libcurl

curl libxml2 multithreading pthread web-archiving

Last synced: 15 Sep 2025