An open API service indexing awesome lists of open source software.

Projects in Awesome Lists tagged with webarchive

A curated list of projects in awesome lists tagged with webarchive .

https://github.com/karust/gogetcrawl

Extract web archive data using Wayback Machine and Common Crawl

commoncrawl concurrency crawler golang wayback-machine webarchive

Last synced: 15 Jan 2026

https://github.com/helgeho/archivespark

An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed at Internet Archive.

archivespark internet-archive spark spark-framework warc web-archiving webarchive

Last synced: 05 Apr 2025

https://github.com/helgeho/ArchiveSpark

An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed at Internet Archive.

archivespark internet-archive spark spark-framework warc web-archiving webarchive

Last synced: 08 Apr 2025

https://github.com/chatnoir-eu/chatnoir-resiliparse

A robust web archive analytics toolkit

bigdata cpp cython extraction htmlparser python warc web webarchive

Last synced: 04 Apr 2026

https://github.com/n0tan3rd/node-warc

Parse And Create Web ARChive (WARC) files with node.js

chrome-remote-interface pupeteer warc warc-files web-archives web-archiving webarchive webarchiving

Last synced: 07 May 2025

https://github.com/N0taN3rd/node-warc

Parse And Create Web ARChive (WARC) files with node.js

chrome-remote-interface pupeteer warc warc-files web-archives web-archiving webarchive webarchiving

Last synced: 06 Aug 2025

https://github.com/rcarmo/python-webarchive

Create WebKit/Safari .webarchive files on any platform

asyncio python3 webarchive

Last synced: 21 Jul 2025

https://github.com/mathis2001/webhackurls

Simple python OSINT tool for urls recon thanks to the waybackmachine.

bugbounty osint pentesting recon wayback-machine webarchive

Last synced: 27 Apr 2025

https://github.com/mhucka/devilfish

A utility for simultaneously creating full-page PDF snapshots and web archives of web pages in DEVONthink Pro.

archiving devonthink pdf web webarchive

Last synced: 24 Feb 2025

https://github.com/ticky/webarchive

📑 Rust utilities for working with Apple's Web Archive file format

rust-crate rust-lang safari webarchive

Last synced: 17 Feb 2026

https://github.com/helgeho/hadoopconcatgz

A Splitable Hadoop InputFormat for Concatenated GZIP Files and *.(w)arc.gz

hadoop spark warc web-archiving webarchive

Last synced: 14 Apr 2025

https://github.com/gonejack/webarchive-to-singlefile

This command line converts .webarchive file to resources embed .html file

html webarchive

Last synced: 29 Jan 2026

https://github.com/sicos1977/webarchiveextractor

A .NET Standard 2.0 library to extract a Safari web archive to a folder

macos safari webarchive

Last synced: 23 Aug 2025

https://github.com/q-m/scrapy-webarchive

A plugin for Scrapy that allows users to capture and export web archives in the WARC and WACZ formats during crawling.

scrapy wacz warc webarchive webarchive-data-scraping

Last synced: 24 Apr 2025

https://github.com/mccallofthewild/alexandrias-revenge

🔥The bold new archive that can’t be burned, bulldozed or battering-rammed #PoweredByArweave

archive article-extractor arweave blockchain webarchive

Last synced: 21 Apr 2025

https://github.com/ganapativs/puppeteer-warc

Create WARC (Web ARChive) of a web page

puppeteer warc webarchive

Last synced: 18 May 2026

https://github.com/ibnesayeed/archival-tests

A set of web archival replay test cases

archival-replay memento replay-tests testing webarchive webarchiving

Last synced: 12 Jan 2026

https://github.com/helgeho/warcpartitioner

Partition (W)ARC Files by MIME Type and Year

hadoop warc web-archiving webarchive

Last synced: 14 Apr 2025

https://github.com/airborne-commando/link-extractor-and-archive

A link extractor and archive tool, uses archive.ph as an archiving service; useful for sites that are barebones and aren't advanced.

archive cli gui-python python terminal webarchive webarchiving

Last synced: 29 Apr 2026

https://github.com/maxmmueller/404-to-archive-redirector

Greasemonkey script that redirects from a 404 page to the Wayback Machine.

404-redirect greasemonkey javascript tampermonkey webarchive

Last synced: 17 Feb 2026

https://github.com/n0tan3rd/node-cdxj

Parse CDXJ(https://github.com/oduwsdl/ORS/wiki/CDXJ) files with node.js

cdxj web-archives webarchive webarchiving

Last synced: 17 Aug 2025

https://github.com/piecelet/neodb-trending-history

Trending History of Books, Movies, TVs, Music, Games, Podcasts, and Collections for NeoDB, an open sourced fediverse community that can discover, track, share and discuss your books, movies, tv, music, games, podcasts, and shows. See https://github.com/neodb-social/neodb for NeoDB.

archive bluesky book books douban fediverse game games goodreads historical-data history imdb letterboxd mastodon movies music neodb podcast tv webarchive

Last synced: 07 May 2026

https://github.com/commoncrawl/arc2warc-conversion

Experiences converting Common Crawl's ARC files from the crawls 2008 - 2012 to the WARC format

arc arc-files warc warc-files warc-format webarchive webarchiving

Last synced: 16 Feb 2026

https://github.com/pereslavtsev/memento-client

Time Travel APIs NodeJS library with full support of the Memento protocol.

memento timetravel wayback webarchive

Last synced: 29 Jun 2025

https://github.com/vishwas-r/internet-archive-assistant

Firefox Addon & Chrome Extension for effortlessly saving web pages to the Internet Archive or viewing their latest archived versions. Perfect for preserving content and retrieving snapshots.

chrome-extension firefox-addon internetarchive webarchive

Last synced: 31 Mar 2025

https://github.com/gonejack/html-to-webarchive

This command line converts .html file to Safari's .webarchive file.

html safari webarchive

Last synced: 14 Jan 2026