Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
Projects in Awesome Lists tagged with webarchiving
A curated list of projects in awesome lists tagged with webarchiving .
https://github.com/akamhy/waybackpy
Wayback Machine API interface & a command-line tool
archive-webpage archive-webpages cdx-api internet-archive internet-archiving osint savepagenow wayback-machine wayback-machine-api wayback-machine-python web-archiving webarchiving
Last synced: 04 Aug 2024
https://github.com/N0taN3rd/Squidwarc
Squidwarc is a high fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head
browser-automation chrome chrome-headless crawler crawling headless-chrome high-fidelity-preservation puppeteer webarchives webarchiving
Last synced: 01 Aug 2024
https://github.com/n0tan3rd/squidwarc
Squidwarc is a high fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head
browser-automation chrome chrome-headless crawler crawling headless-chrome high-fidelity-preservation puppeteer webarchives webarchiving
Last synced: 31 Jul 2024
https://github.com/N0taN3rd/node-warc
Parse And Create Web ARChive (WARC) files with node.js
chrome-remote-interface pupeteer warc warc-files web-archives web-archiving webarchive webarchiving
Last synced: 17 Aug 2024
https://github.com/ArchiveTeam/wget-lua
Wget-AT is a modern Wget with Lua hooks, Zstandard (+dictionary) WARC compression and URL-agnostic deduplication.
archiveteam archiving crawl crawler crawlers crawling downloader ftp lua scraper scraping spider warc webarchiving wget wget-lua zstd
Last synced: 06 Aug 2024
https://github.com/peterk/warcworker
A dockerized, queued high fidelity web archiver based on Squidwarc
archiving high-fidelity-preservation preservation webarchives webarchiving
Last synced: 01 Aug 2024
https://github.com/commoncrawl/cc-notebooks
Various Jupyter notebooks about Common Crawl data
aws-athena common-crawl commoncrawl jupyter-notebook webarchiving webgraph-framework
Last synced: 17 Aug 2024
https://github.com/datacoon/metawarc
metawarc: a command-line tool for metadata extraction from files from WARC (Web ARChive)
metadata osint osint-python warc warc-files webarchiving
Last synced: 01 Aug 2024
https://github.com/ArchiveTeam/WebArchiver
Decentralized web archiving
archiver archiving crawler decentralized python warc web webarchiving
Last synced: 01 Aug 2024
https://github.com/ibnesayeed/archival-tests
A set of web archival replay test cases
archival-replay memento replay-tests testing webarchive webarchiving
Last synced: 01 Oct 2024