Projects in Awesome Lists tagged with web-archiving
A curated list of projects in awesome lists tagged with web-archiving .
https://github.com/archivebox/archivebox
🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...
archivebox backups bookmark-archiver browser-bookmarks chromium digipres firefox headless-browser internet-archiving pinboard pocket python rss self-hosted singlefile warc wayback-machine web-archiving wget youtube-dl
Last synced: 09 Sep 2025
https://github.com/pirate/ArchiveBox
🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...
archivebox backups bookmark-archiver browser-bookmarks chromium digipres firefox headless-browser internet-archiving pinboard pocket python rss self-hosted singlefile warc wayback-machine web-archiving wget youtube-dl
Last synced: 01 Apr 2025
https://github.com/ArchiveBox/ArchiveBox
🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...
archivebox backups bookmark-archiver browser-bookmarks chromium digipres firefox headless-browser internet-archiving pinboard pocket python rss self-hosted singlefile warc wayback-machine web-archiving wget youtube-dl
Last synced: 13 Mar 2025
https://github.com/webrecorder/pywb
Core Python Web Archiving Toolkit for replay and recording of web archives
python pywb wayback web-archives web-archiving
Last synced: 14 May 2025
https://github.com/rhizome-conifer/conifer
Collect and revisit web pages.
archives docker python pywb warc wayback web-archiving webrecorder
Last synced: 08 Apr 2025
https://github.com/Rhizome-Conifer/conifer
Collect and revisit web pages.
archives docker python pywb warc wayback web-archiving webrecorder
Last synced: 26 Mar 2025
https://github.com/webrecorder/archiveweb.page
A High-Fidelity Web Archiving Extension for Chrome and Chromium based browsers!
archiving browser-extension chromium extension wacz warc web-archiving webrecorder
Last synced: 13 Apr 2025
https://github.com/ray-d-song/web-archive
Free web archiving and sharing service based on Cloudflare. 基于 Cloudflare 的免费网页归档和分享工具。
cloudflare cloudflare-pages d1 free hono self-hosted serverless web-archive web-archiving
Last synced: 15 May 2025
https://github.com/Ray-D-Song/web-archive
Free web archiving and sharing service based on Cloudflare. 基于 Cloudflare 的免费网页归档和分享工具。
cloudflare cloudflare-pages d1 free hono self-hosted serverless web-archive web-archiving
Last synced: 27 Mar 2025
https://github.com/webrecorder/replayweb.page
Serverless replay of web archives directly in the browser
replay-web-page service-worker wacz warc wayback-machine web-archive web-archiving web-replay
Last synced: 14 May 2025
https://github.com/webrecorder/browsertrix-crawler
Run a high-fidelity browser-based web archiving crawler in a single Docker container
crawler crawling wacz warc web-archiving web-crawler webrecorder
Last synced: 15 May 2025
https://github.com/gildas-lormeau/single-file-cli
CLI tool for saving a faithful copy of a complete web page in a single HTML file (based on SingleFile)
archiving cli crawler deno dockerfile nodejs scraping-websites single-file web-archiving web-crawler web-scraper web-scraping
Last synced: 15 May 2025
https://github.com/oduwsdl/ipwb
InterPlanetary Wayback: A distributed and persistent archive replay system using IPFS
docker ipfs memento memento-rfc python service-worker warc wayback web-archiving
Last synced: 13 Apr 2025
https://github.com/bellingcat/auto-archiver
Automatically archive links to videos, images, and social media content from Google Sheets (and more).
archive docker open-source-research python scraping service web-archiving
Last synced: 20 Apr 2025
https://github.com/akamhy/waybackpy
Wayback Machine API interface & a command-line tool
archive-webpage archive-webpages cdx-api internet-archive internet-archiving osint savepagenow wayback-machine wayback-machine-api wayback-machine-python web-archiving webarchiving
Last synced: 15 May 2025
https://github.com/webrecorder/webrecorder-player
Webrecorder Player for Desktop (OSX/Windows/Linux). (Built with Electron + Webrecorder)
electron pywb warc web-archiving webrecorder
Last synced: 03 Apr 2025
https://github.com/rahiel/archiveror
Archiveror will help you preserve the webpages you love. 💾
archiving bookmark browser-extension chrome-extension firefox-extension javascript linkrot mhtml web-archiving webextension
Last synced: 07 Apr 2025
https://github.com/oduwsdl/archivenow
A Tool To Push Web Resources Into Web Archives
internet-archive web-archiving
Last synced: 05 Apr 2025
https://github.com/florents-tselai/warcdb
WarcDB: Web crawl data as SQLite databases.
cli crawling database sqlite warc web-archiving web-data
Last synced: 04 Apr 2025
https://github.com/Florents-Tselai/WarcDB
WarcDB: Web crawl data as SQLite databases.
cli crawling database sqlite warc web-archiving web-data
Last synced: 08 Apr 2025
https://github.com/machawk1/wail
:whale2: Web Archiving Integration Layer: One-Click User Instigated Preservation
gui heritrix openwayback pyinstaller python warc wayback web-archiving
Last synced: 16 May 2025
https://github.com/webrecorder/warcio
Streaming WARC/ARC library for fast web archive IO
python pywb warc web-archives web-archiving
Last synced: 15 May 2025
https://github.com/archivebox/archivebox-browser-extension
Official ArchiveBox browser extension: automatically/manually preserve your browsing history using ArchiveBox.
archivebox archiving browser-extension chrome-extension digipres digital-preservation firefox-extension internet-archiving svelte web-archiving
Last synced: 07 Jul 2025
https://github.com/ArchiveBox/archivebox-browser-extension
Official ArchiveBox browser extension: automatically/manually preserve your browsing history using ArchiveBox.
archivebox archiving browser-extension chrome-extension digipres digital-preservation firefox-extension internet-archiving svelte web-archiving
Last synced: 03 Apr 2025
https://github.com/webrecorder/browsertrix
Browsertrix is the hosted, high-fidelity, browser-based crawling service from Webrecorder designed to make web archiving easier and more accessible for all!
archiving cloud kubernetes wacz warc web-archive web-archiving webrecorder
Last synced: 16 May 2025
https://github.com/machawk1/warcreate
Chrome extension to "Create WARC files from any webpage"
chrome-extension warc web-archiving
Last synced: 10 Apr 2025
https://github.com/archivebox/electron-archivebox
Desktop Electron app for ArchiveBox internet archiver. (ALPHA: not ready for general use)
archivebox desktop desktop-electron digipres docker electron gui internet-archiving linux macos web-archiving windows
Last synced: 05 Aug 2025
https://github.com/ArchiveBox/electron-archivebox
Desktop Electron app for ArchiveBox internet archiver. (ALPHA: not ready for general use)
archivebox desktop desktop-electron digipres docker electron gui internet-archiving linux macos web-archiving windows
Last synced: 14 Mar 2025
https://github.com/cocrawler/cdx_toolkit
A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine
cdx cdx-api commoncrawl python warc web-archives web-archiving
Last synced: 14 Dec 2025
https://gwu-libraries.github.io/sfm-ui/
Social Feed Manager user interface application.
code4lib social-feed-manager social-media web-archiving
Last synced: 22 Apr 2025
https://github.com/gwu-libraries/sfm-ui
Social Feed Manager user interface application.
code4lib social-feed-manager social-media web-archiving
Last synced: 08 Apr 2025
https://github.com/helgeho/archivespark
An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed at Internet Archive.
archivespark internet-archive spark spark-framework warc web-archiving webarchive
Last synced: 05 Apr 2025
https://github.com/helgeho/ArchiveSpark
An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed at Internet Archive.
archivespark internet-archive spark spark-framework warc web-archiving webarchive
Last synced: 08 Apr 2025
https://github.com/programminghistorian/ph-submissions
The repository and website hosting the peer review process for new Programming Historian lessons
api data-management dh digital-history digital-humanities distant-reading linked-open-data mapping multi-lingual network-analysis open-educational-resources open-source pedagogy programming-historian python r-studio web-archiving web-scraping
Last synced: 08 May 2025
https://github.com/N0taN3rd/wail
:whale2: One-Click User Instigated Preservation
browser-based-presrevation electron high-fidelity-preservation warc web-archiving
Last synced: 03 Apr 2025
https://github.com/maxcountryman/warc-parquet
🗄️ A simple CLI for converting WARC to Parquet.
crawling duckdb parquet warc web-archiving
Last synced: 16 May 2025
https://github.com/internetarchive/fatcat
Perpetual Access To The Scholarly Record
digital-library open-access postgresql python rust scholarly-communication web-archiving
Last synced: 07 Apr 2025
https://github.com/n0tan3rd/node-warc
Parse And Create Web ARChive (WARC) files with node.js
chrome-remote-interface pupeteer warc warc-files web-archives web-archiving webarchive webarchiving
Last synced: 07 May 2025
https://github.com/N0taN3rd/node-warc
Parse And Create Web ARChive (WARC) files with node.js
chrome-remote-interface pupeteer warc warc-files web-archives web-archiving webarchive webarchiving
Last synced: 06 Aug 2025
https://github.com/xarantolus/Collect
A server to collect & archive websites that also supports video downloads
archive self-hosted video-downloader web-archiving webinterface website-archive website-scraper
Last synced: 10 May 2025
https://github.com/oduwsdl/warrick
Recover lost websites from the Web Infrastructure
memento memento-rfc recovery web-archiving
Last synced: 20 Feb 2025
https://github.com/xarantolus/collect
A server to collect & archive websites that also supports video downloads
archive self-hosted video-downloader web-archiving webinterface website-archive website-scraper
Last synced: 23 Apr 2025
https://github.com/oduwsdl/memgator
A Memento Aggregator CLI and Server in Go
memento memento-rfc timemap web-archiving
Last synced: 13 Apr 2025
https://github.com/oduwsdl/MemGator
A Memento Aggregator CLI and Server in Go
memento memento-rfc timemap web-archiving
Last synced: 06 Aug 2025
https://github.com/pirate/internet-archiving-talk
🎭 An introduction to the Internet Archiving ecosystem, tooling, and some of the ethical dilemmas that the community faces.
archivebox censorship ethics internet-archiving slideshow talks warc web-archiving wget
Last synced: 24 Mar 2025
https://github.com/Own-Data-Privateer/hoardy-web
Passively capture, archive, and hoard your web browsing history, including the contents of the pages you visit, for later offline viewing, mirroring, and/or indexing. Your own personal private Wayback Machine that can also archive HTTP POST requests and responses, as well as most other HTTP-level data.
archive archiver archiving auto-save backups browser-extension cli internet internet-archiving offline-reading self-hosted snapshot wayback-machine web-archive web-archiving web-browsing website-archive
Last synced: 11 Mar 2025
https://github.com/TarekJor/bookmark-archiver
🗄 Save an archived copy of websites from Pocket/Pinboard/Bookmarks/RSS. Outputs HTML, PDFs, and more...
archive backup bookmarks browser chromium firefox google-chrome headless-browser headless-chrome html-export pinboard pocket preservation python rss safari web-archive web-archiving web-browser wget
Last synced: 27 Mar 2025
https://github.com/tarekjor/bookmark-archiver
🗄 Save an archived copy of websites from Pocket/Pinboard/Bookmarks/RSS. Outputs HTML, PDFs, and more...
archive backup bookmarks browser chromium firefox google-chrome headless-browser headless-chrome html-export pinboard pocket preservation python rss safari web-archive web-archiving web-browser wget
Last synced: 05 Oct 2025
https://github.com/zytedata/web-snap
Create "perfect" snapshots of web pages
capture-page javascript playwright web-archives web-archiving
Last synced: 07 Oct 2025
https://github.com/gildas-lormeau/mhtml-to-html
Convert MHTML to HTML
bunjs cli deno executable html javascript mhtml nodejs single-file web-archiving
Last synced: 11 Apr 2025
https://github.com/pkharsimran/website-downloader
Website-downloader is a powerful and versatile Python script designed to download entire websites along with all their assets. This tool allows you to create a local copy of a website, including HTML pages, images, CSS, JavaScript files, and other resources. It is ideal for web archiving, offline browsing, and web development.
automation beautifulsoup data-mining html internet-tools offline-browsing open-source python python-scripts requests web-archiving web-scraping website-cloner website-downloader wget
Last synced: 01 Sep 2025
https://github.com/ArchiveBox/pocket-exporter
A service to help export your pocket bookmarks, tags, saved article text, and more...
archivebox archiving bookmarks getpocket html internet-archiving pocket urls web-archiving
Last synced: 19 Aug 2025
https://github.com/archivebox/archivebox-proxy
Official ArchiveBox MITM proxy: saves URLs of all requests passing through to an ArchiveBox server for archival.
archivebox digipres digital-preservation https-proxy internet-archiving mitmproxy proxy web-archiving web-proxy
Last synced: 07 Jul 2025
https://github.com/internetarchive/sandcrawler
Backend, IA-specific tools for crawling and processing the scholarly web. Content ends up in https://fatcat.wiki
Last synced: 17 Oct 2025
https://github.com/internetarchive/scrapy-warcio
Support for writing WARC files with Scrapy
python scrapy warc web-archiving
Last synced: 14 Jul 2025
https://github.com/archivebox/docs
Source for the Github Wiki / ReadTheDocs documentation for AchiveBox, the self-hosted internet archiving solution.
archivebox cli community digipres documentation internet-archiving python rest sphinx ui usage web-archiving wiki
Last synced: 07 Jul 2025
https://github.com/webrecorder/dat-share
A prototype server to swarm multiple DATs for Webrecorder
dat dat-protocol hyperdrive web-archiving
Last synced: 21 Apr 2025
https://github.com/archivebox/pip-archivebox
Official Python package for ArchiveBox, the self-hosted internet archiving solution.
archivebox digipres internet-archiving pip pypi python sdist setuptools web-archiving wheel
Last synced: 07 Jul 2025
https://github.com/dbeley/archiveboxmatic
ArchiveBoxMatic: configure ArchiveBox with the simplicity of a yaml file.
archivebox archiving web-archiving
Last synced: 29 Apr 2025
https://github.com/rhizome-conifer/conifer-deploy
Conifer setup and deployment via Ansible
ansible-playbook web-archiving webrecorder
Last synced: 12 Apr 2025
https://github.com/anjackson/sliver
A tool for collection archival slivers of the web and web archives
web-archive web-archives web-archiving
Last synced: 09 Apr 2025
https://github.com/helgeho/hadoopconcatgz
A Splitable Hadoop InputFormat for Concatenated GZIP Files and *.(w)arc.gz
hadoop spark warc web-archiving webarchive
Last synced: 14 Apr 2025
https://github.com/bellingcat/auto-archiver-api
API to manage users/sheets/URLs and call the auto-archiver in dedicated workers.
celery digital-preservation fastapi web-archiving
Last synced: 26 Jun 2025
https://github.com/mrrfv/webarchive
Crawls websites and saves found URLs to a file.
archive archiveteam archiving crawler crawling ia internet-archive scraper web-archiving web-scraping
Last synced: 18 Mar 2025
https://github.com/webis-de/scriptor
Plug-and-play reproducible web analysis.
automation browser nodejs playwright user-simulation web-analysis web-archiving
Last synced: 29 Oct 2025
https://github.com/rybesh/capture-urls
Archive a list of URLs using the Wayback Machine
save-page-now wayback-machine web-archiving
Last synced: 04 Apr 2025
https://github.com/project-polymorph/news-website
中文跨性别相关新闻存档站点
news transgender web-archiving
Last synced: 05 Jan 2026
https://github.com/caltechlibrary/eprints2archives
Send records from an EPrints server to the Internet Archive and other web archives
archiving eprints internet-archive memento preservation python terminal web-archives web-archiving
Last synced: 14 Apr 2025
https://github.com/oduwsdl/oduwsdl.github.io
ODU Web Science and Digital Libraries Research Group (WS-DL) home page.
digital-libraries digital-preservation information-retrieval machine-learning natural-language-processing web-archiving web-science
Last synced: 20 Feb 2025
https://github.com/project-polymorph/platform-home
homepage and platform for chinese trans digital archive
Last synced: 12 Oct 2025
https://github.com/q-m/replayweb.page-docker
Docker image for ReplayWeb.page
replay-web-page web-archive web-archiving web-replay
Last synced: 22 Feb 2025
https://github.com/n0tan3rd/memgatorbulkdownload
memento memento-protocol memento-rfc memgator timemap timemaps web-archiving
Last synced: 26 Jun 2025
https://github.com/helgeho/warcpartitioner
Partition (W)ARC Files by MIME Type and Year
hadoop warc web-archiving webarchive
Last synced: 14 Apr 2025
https://github.com/archivebox/pocket-exporter
A service to help export your pocket bookmarks, tags, saved article text, and more...
archivebox archiving bookmarks getpocket html internet-archiving pocket urls web-archiving
Last synced: 03 Jul 2025
https://github.com/cvyl/cf-static-archive-worker
A serverless website archiving solution built with Cloudflare Workers. This tool crawls and archives static websites, storing all assets (HTML, CSS, JS, images, etc.) in Cloudflare R2 storage.
archiver cloudflare cloudflare-r2 cloudflare-worker cloudflare-workers web-archiving
Last synced: 05 Apr 2025
https://github.com/operating-function/packrat
Next-gen browser history
chrome-extension wacz web-archiving
Last synced: 22 Jun 2025
https://github.com/usiqwerty/cheburashka
Extensible web archiving tool
archive cheburnet web-archiving
Last synced: 03 Apr 2025
https://github.com/yuzhoumo/edbox
Ed course archiver and viewer
alpinejs edstem jinja2 python web-archiving
Last synced: 04 Oct 2025
https://github.com/helgeho/tempas2archivespark
ArchiveSpark DataSpec to analyze the Internet Archive's Web archive through temporal search results returned by Tempas (v2)
archivespark information-retrieval temporal web-archives web-archiving
Last synced: 14 Apr 2025
https://github.com/oduwsdl/offtopic-goldstandard-data
Data for testing the Offtopic detection software
dataset memento offtopic web-archives web-archiving
Last synced: 22 Aug 2025
https://github.com/oliverwebdev/webarchiver
A powerful desktop application to download, archive, and manage web pages locally with full resource support, built with Python and PyQt6.
archive-management content-archiving data-preservation desktop-application html-editor html-parsing offline-browsing playwright pyqt6 python python-gui selenium sqlite tagging-system web-archiving web-content-management web-scraping website-archiver website-backup
Last synced: 14 Mar 2025
https://github.com/meequrox/flb-archiver
Flareboard web archiver in C using libcurl
curl libxml2 multithreading pthread web-archiving
Last synced: 15 Sep 2025
https://github.com/torahappy/misc
mysterious box of various codes
backup firewall game linux math proton raspberry-pi timestamp tor web-archiving wine
Last synced: 30 Aug 2025